Data Engineering at AI Singapore

The newly formed Data Engineering team at AI Singapore has plans to refine data management practices in tandem with the growth in number of projects

Toward a Common Data Platform

The AI Innovation team at AI Singapore has evolved from a few staff to a strong collection of engineering teams in the last two years. We’ve already delivered several successful AI-based solutions to organisations of different types : government agencies, SMEs and multi-national corporations. Building AI solutions depends on data – a large amount of data. Data is the key project asset. Our engineering teams operate on various data formats : image, video, text, csv, etc. As our organisation has grown, so has the number of ongoing parallel projects. While we have put processes in place, the management of project data has primarily been the responsibility of the project manager and the technical leader in the team.

Data is an asset that must be managed and engineered to deliver value to end users.

Recently some of our senior engineers determined that we needed to move forward on a data platform programme that has been waiting in the wings for a while now. Our platforms team has also been deploying additional hardware at our site over the last few months which enables our senior technical team to think more broadly about evolving our systems architecture. The newly formed Data Engineering team will architect a common data platform that supports our AI engineers and facilitates efficient delivery of solutions.

Identifying Our Needs

At a high level, the tactical needs that our AI engineers have expressed include:

  • Data should be transferred between our external stakeholders and our project teams in a simple, secure fashion which provides tracking and notification.
  • Engineers should be able to view the holistic picture of the raw data sets and the data products and other artefacts that were created by downstream processes such as data cleaning, filtering and feature engineering.
  • The frameworks deployed and processes implemented should be oriented toward modern engineering practices and AI-oriented solutions.
  • Simple and efficient access to all types of data from interactive notebooks or processing pipelines.

Additionally, senior technical staff has expressed strategic needs that include:

  • Clear data governance practices of our current data inventory.
  • Data provenance and data versioning to enable reproducibility.
  • A common data model for metadata management.
  • ‘Right size’ our data platform : scope the effort and timeline to available resources while still addressing the challenges listed above.

There are many additional perspectives on data platforms on the internet. It seems that many are from vendors advocating for their products or describe a platform for a specific industry or purpose such as a customer data platform. As further reading, the descriptions provided here are more vendor agnostic:

The Next Steps

We have existing tools in place, both internally developed and open source, that currently support our engineers. Some were created opportunistically by a project team to address a specific need. A few open source frameworks were deployed to support certain types of projects. The technical leadership from the engineering groups are reviewing the current state. We may need to simply augment a tool, make more teams aware of how to employ it, or build processes that facilitate adoption. For gaps in our toolset, we will survey the open source and commercial solutions available for data storage, processing, governance and other tasks. The goal is an integrated data platform that supports each stage of the AI lifecycle : data collection, annotation, exploration, feature engineering, experimentation, evaluation and deployment.

As we begin to think about the scope of that effort and our near term priorities the data engineering team at AI Singapore will share our challenges and discoveries with the wider engineering community in Singapore and beyond. Hopefully, other data engineers can benefit from our experience. We also invite you to leave a comment if you have anything to share.

The Data Engineering Series