Designing, Building, and Shipping Machine Learning Solutions Across Industries: Three (Boring) Principles

When we first launched Strong, the hype around machine learning generally led to the very optimistic feeling at conferences, from clients, and on social media that anything can be built with machine learning. We have data. We have a problem. We have machine learning. What could go wrong?

Recently, a more cautious tone has emerged. “As we all know, most data science and machine learning projects fail,” echoed several presenters at Spark+AI Summit. “Most investments in data lakes have stumbled when it comes to actually using the data for machine learning,” warned others.

We’ve built our success on delivering machine learning-backed applications to clients across many industries — from detecting defects in fruit manufacturing, optimizing ad pricing, multi-channel marketing optimization in pharmaceuticals and gaming, and optimizing last-mile logistics.

How have we been able to ship real-world scale machine learning applications during a time when, by most accounts, “most machine learning projects fail”? Of course, there’s the strength of our team of full-stack data scientists. But just as important is our process — an approach to machine learning projects that often looks a lot more like traditional product development and software engineering than anything else.

These aren't the flashy things that you think are essential to your project's success. They don't require deciphering pages of mathematical formulas, and they certainly don't require distributed computation of any kind. Instead, we are going to discuss three boring, but important, principles that have helped us consistently deliver successful ML projects.

Design before development

When you’re excited about building a new product, it’s very appealing to sit down, block out the rest of the world, and just get coding. With the excitement around applied machine learning, this can be especially exciting: Wouldn’t it be amazing if we could predict/optimize X? No one’s ever done that before!

The truth is that, unless you’re hacking on a weekend side-project, you’d be much better served by pulling the reins and drawing on the designer’s toolkit to explore and refine the idea. This means:

Drafting user stories which describe key interactions with your product/tool,
Scoping out requirements for the underlying application and machine learning components,
Sketching wireframes so that you and some unbiased testers can explore the idea more concretely, and
Asking how your target users are currently solving the problem of interest. How will you convince them to use your new solution? How can you quantify the improvement over their current solution? What challenges will they face in transitioning to the new solution?

Machine learning projects also have some unique considerations to address during this design phase:

How will we acquire the data we need for the product to have value today?(Selling a “continuously-learning” product that doesn’t have value on day 0 is tough.) How will we acquire data in the future?
How will we measure the success of the deployment? How will we measure the impact/success of the machine learning components specifically?
Will users trust the ‘artificial intelligence’ and predictions? What explanatory tools will we need to provide?
How will we monitor the machine learning components in production?
How will we test and deploy new models?

These questions may seem like roadblocks before the fun stuff but they are essential to reducing risk and increasing the probability of the project's success.

Messy research, clean development

Data scientists and machine learning engineers have developed quite a reputation in recent years for delivering undocumented and fragile code, most often in Jupyter notebooks.

Jupyter notebooks have attracted lots of hate, and you won't find us defending them too loudly. Amongst other things, notebooks allow unreplicable non-linear execution, require special formatting (that almost no one uses) to be version-control friendly, and encourage spaghetti code instead of modular application design.

An intentionally ‘messy’ period of exploratory research allows us to keep the rest of development well-structured and surprise-free.

However, for all their shortcomings, we think they do have a key role to play in machine learning software development. With the model/prediction requirements from your design phase in hand, we encourage the exploratory research afforded by Jupyter notebooks and begin iterating on various models and data pipelines.

The goal here is to quickly answer a bunch of questions that might either impact the feasibility of the project or have application architecture implications later. How much data are we likely to need? Is the annotated dataset we have any good? Should we leverage transfer learning or are there custom architectural requirements? How does a contextual bandit compare to a temporal-difference learner? What computational resources are required for deployment?

Answering these questions upfront in an intentionally ‘messy’ period of exploratory research allows us to keep the rest of development well-structured and surprise-free.

Answers in hand, we take a step back to make sure nothing we have learned jeopardizes the design requirements from the first phase and, if all is well, we archive the notebooks and begin moving into production-focused application and model development.

Experimentation after production deployment

Our views on machine learning application deployment are shaped by three basic observations:

The first version of a product is never the best.
Opportunities for more impactful research improves after the product is in the wild.
Applications with embedded machine learning models are, without the proper attention to deployment, very fragile.

These last two points (2-3) provide an especially painful juxtaposition. As real data from real users begins pouring in and data scientists see opportunities to improve their models, integrate more data, add new layers of optimization, and run experiments, they often feel handcuffed by the original architecture. Ultimately, these combine with (1) to mean that it's easy to feel stuck with a version of the product that doesn’t satisfy your own expectations or your client’s.

This can happen in many ways, for example:

You are unable to replicate your production data pipeline or extract data for new research.
You assumed models would always be able to shipped in the same serialization format.
You assumed models would be frozen/stateless.
You assumed models would use the same framework (e.g., Tensorflow, PyTorch).
You assumed models would only store certain kinds of data (e.g., experience replay buffer memories).
You assumed models would be smaller than X.
There is no process for retraining a user-specific model in production.
You cannot A/B test two models.

All of these architectural decisions extend the “gap” that always exists to various degrees between the research laboratory and production. It’s the same gap that early web developers like myself felt when we were shipping new code via manual FTP uploads, trying to remember the specific deploy steps that we may or may not have tested locally, and manually running failed database migrations to un-break production.

We believe that data scientists do their best work when they are empowered to experiment, build new models, and confidently deploy to production.

We put a lot of emphasis in our work on minimizing that gap, because we believe that data scientists do their best work when they are empowered to experiment, build new models, and confidently deploy to production.

We minimize this gap in a few key ways. First, we build platforms like Strong RL that replicate the production application environment in the research laboratory — enabling end-to-end testing of new algorithms before release. Second we also leverage model application wrappers like Strong-Bootcamp that provide a model architecture-agnostic interface to machine learning models, forcing an application naïveté that benefits model iteration after release. Third, we often separate application deployment from model deployment, giving data scientists a direct path to shipping new models that doesn’t involve a software engineer integrating their changes for them.

While these approaches are still quite new in the ML world, they aren't new to those who ship software generally. To put them in more general software terms, we create trustworthy local development environments, define APIs that separate integration from implementation concerns, and create reproducible, robust deployment processes for our models and the applications in which they are developed. Where possible, we even take advantage of the same great tools like Terraform and CircleCI for provisioning and deploying to the cloud. But those only get you so far, and we've had to build our own tools (like those above, and our internal cloud management tool for ML called strong-ops) to complete our productionizing process.

Some things never change

Although the current pace of innovation in machine learning research and frameworks is driving practitioners like us on the front lines to constantly evolve, we believe that each of these basic practices — borrowed from standard software development and product development will continue help to minimize risk and ensure the success of machine learning projects.

Designing, Building, and Shipping Machine Learning Solutions Across Industries: Three (Boring) Principles

Design before development

Messy research, clean development

Experimentation after production deployment

Some things never change

Related News

Knowledge Brief: Deploying Machine Learning Models to Production

The State of Enterprise Machine Learning