Integrating DVC Into AI Singapore’s Data Platform

(By Desiree Chen)

When working on projects, AI Engineers and AI Apprentices conduct multiple experiments with various hyperparameters being involved. There will be several versions of data and models to be tracked. Such tracking allows one team member to reproduce the experiment conducted by another team member and perhaps subsequently improve on the model. Data Version Control (DVC) provides a convenient way for this tracking to be done.

A typical AI project involves many versions of data and models.
(Image source : DVC homepage)

DVC is an open-source Version Control System for Machine Learning Projects. For more details, please refer to these articles previously published on this blog : Data Versioning for CD4ML – Part 1 and Part 2. In this article, I set out to share how DVC has been integrated into AI Singapore’s Data Platform.

Primer: GitLab

GitLab is the DevOps platform used by AI Engineers and AI Apprentices here at AI Singapore. Each project undertaken by AI Singapore is initialised to have its own data repository. This is done via bash scripts, run in a Docker container, that use GitLab API to create GitLab groups and GitLab projects, in our case the data repository, under each group. Relevant users are accorded access such as Owner, Maintainer, Developer, Reporter or Guest to the group. GitLab API is a REST API. Some of the HTTP methods used in managing the groups and projects are POST and GET.

To borrow the idea of the Cookiecutter approach, all data repositories have a data folder containing rawinterim and processed subfolders. In addition, the data repository has DVC initialised with the DVC remote configured to be our on-premise S3-based data store (Dell ECS).

This is how the data repository looks like initially.

$ tree
├── data
  ├── interim
  ├── processed
  └── raw

4 directories, 0 files

Usage of DVC with S3

Referring to the screenshot below, when data (in this case, it is resale-prices.csv) gets uploaded to the Data Platform and eventually stored in our S3, a git push is made in the data repository such that the corresponding ‘.dvc‘ file of the data comes under the raw folder.

As a suggested use case for the different subfolders under the data folder, the user uploads the raw data to the raw folder. The interim folder is then used to store data that has been cleaned, filtered or feature engineered. Finally, the processed folder contains the train and test data.

This is a sample of how the data folders may be structured.

$ tree
├── data
  ├── interim
  │ └── resale-prices-removed-duplicates.csv.dvc
  ├── processed
  │ ├── resale-prices-test.csv.dvc
  │ └── resale-prices-train.csv.dvc
  └── raw
    └── resale-prices.csv.dvc

4 directories, 4 files

The diagram below shows the basic workflow involving DVC.

Integrating DVC into the Data Platform is just the beginning. When rolling out this new feature, it was crucial to create a training video and provide demonstrations to showcase suggested ways that AI Engineers and AI Apprentices can interact with the Data Versioning feature in the Data Platform. With that, AI Singapore’s Data Engineering team could ensure greater use of DVC by the various project teams.

Future Work

Versioning of data is often thought of together with versioning of models. For AI practitioners, it is the ability to version both data and models that would allow for experiments to be reproducible. While Git tracks changes to source code, DVC tracks changes to data and model. 

With the integration of DVC into the Data Platform, all projects undertaken by AI Singapore have the option of data and model versioning. The standardised structure of the data repository provides better dataset management – whether raw, interim or processed – for AI Engineers and AI Apprentices. In my interactions with project teams and start-up teams at AI Singapore, the observation is that such teams will have an increasing need not just for data versioning, but for model versioning as well. 

At the time of writing this (March 2021), DVC has released Version 2.0 which contains more features such as those relating to Experiment Management. Ongoing work can be done to further the use of DVC in AI Singapore’s Data Platform, especially with this latest release. 

The Data Engineering Series

This article was written by Desiree Chen, graduate of the AI Apprenticeship Programme.
Read more about her experiences here.

Desiree appreciates that there is a place for arts and humanities in the domain of Artificial Intelligence. A creative person, she once took a sabbatical to pursue her love for music. She plays the piano and cello; the latter being an instrument which she took up in her adult years. If there was one language that she would like to improve on, it would be Norwegian; so that she can go beyond buying groceries, ordering food and once answering transport survey questions when travelling on the metro in Oslo. : ) She hails from the sunny island of Singapore. 🏖️