Software Engineering for Machine Learning : A Case Study

Some take-away points from Microsoft’s paper

Microsoft presented this paper at this year’s International Conference on Software Engineering (ICSE 2019). It is a distillation of the experiences gained by the numerous software teams within the company as they implement machine learning (ML) features as diverse as search, machine translation and image recognition in their products. With decades of experience in software engineering and no stranger to ML, Microsoft is well-placed to teach a thing or two about developing machine learning systems vis-à-vis software engineering.

A commonly used machine learning workflow at Microsoft is depicted in Figure 1. It should look familiar to those already conversant with machine learning. If you find this workflow unfamiliar or need a refresher, head over to Section II-B of the paper where they give a pretty good summary of the various steps.

Figure 1 : The nine stages of the machine learning workflow. Some stages are data-oriented (e.g., collection, cleaning, and labeling) and others are model-oriented (e.g., model requirements, feature engineering, training, evaluation, deployment, and monitoring). There are many feedback loops in the workflow. The larger feedback arrows denote that model evaluation and monitoring may loop back to any of the previous stages. The smaller feedback arrow illustrates that model training may loop back to feature engineering (e.g., in representation learning).

Among other things, the paper highlights three fundamental differences between the software and ML domains (Section VII), which I find most relevant for our purpose :

  • Data discovery and management.
  • Customisation and reuse.
  • ML modularity.

Data Discovery and Management

The collection, curation, cleaning and processing of data are central to machine learning. While software development can be supported by neatly defined APIs which do not often change during the development cycle (relatively speaking), datasets rarely have explicit and stable schema definitions across the many rounds of iterations involved in ML. All data must be stored, tracked and versioned. There are well-established technologies to version code, but the same cannot be said for data. A given dataset may contain data from several different schema regimes. When a single engineer gathers and processes this data, they can keep track of these unwritten details, but when project sizes scale, maintaining this common knowledge becomes non-trivial.

Customisation and Reuse

Model customisation and reuse require very different skills than are typically found in software teams. In software engineering, this involves forking from a library and making the required changes to the code. In ML model reuse, there are more considerations to be made. For example, the original domain the model was trained on and the input format of the data. The developer cannot do without having ML knowledge and those coming from a purely software background should be cognizant of this point.

ML Modularity

Modularity is often a key principle in software engineering, often strengthened by Conway’s Law. The final software is divided into modules with interactions between them controlled by APIs. Maintaining strict module boundaries could be challenging in ML systems. As an example :

… one cannot (yet) take an NLP model of English and add a separate NLP model for ordering pizza and expect them to work properly together. Similarly, one cannot take that same model for pizza and pair it with an equivalent NLP model for French and have it work. The models would have to be developed and trained together.

Another point mentioned is the non-obvious ways in which models interact.

In large-scale systems with more than a single model, each model’s results will affect one another’s training and tuning processes. In fact, one model’s effectiveness will change as a result of the other model, even if their code is kept separated. Thus, even if separate teams built each model, they would have to collaborate closely in order to properly train or maintain the full system. This phenomenon (also referred to as component entanglement) can lead to non-monotonic error propagation, meaning that improvements in one part of the system might decrease the overall system quality because the rest of the system is not tuned to the latest improvements. This issue is even more evident in cases when machine learning models are not updated in a compatible way and introduce new, previously unseen mistakes that break the interaction with other parts of the system which rely on it.

Some Thoughts

There is much to agree with in Microsoft’s paper. In particular, the central role that data plays cannot be overemphasized. Initiatives like DVC, the open-source version control system running atop Git, are evidence of the importance of and the progress made towards the tracking and versioning of data.

Software engineering and machine learning are distinct disciplines with the deliverables being expressed in code as the common denominator. It should therefore be recognised as a matter of course that skills in software engineering do not naturally transfer over to projects which include ML features.

On the modularity of ML, I feel there is potential to go a little deeper. Modularisation is not a no-go per se when it comes to machine learning. After all, Andrew Ng (founder of devoted a few chapters of his book Machine Learning Yearning to talk about how a machine learning task can be handled by a pipeline of components. The more pertinent question is, when can modularisation go wrong? Perhaps this is a whole topic best handled by an entire paper by itself.