Generating Data for AI

Deep generative models help alleviate the difficulties of gaining access to costly or sensitive data

Artificial intelligence (AI) relies on data. AI algorithms are trained on large amounts of data in order to identify patterns, analyse them and develop predictive capabilities for automated output or responses.

But what if data collection is expensive and difficult, or the data is sensitive and therefore not accessible? One answer to this lies in applying deep generative models which will enable computers to synthesise new data, ultimately in an unsupervised setting without the need for data labelling. This is the research area that Associate Professor Ngai-Man (Man) Cheung from the Singapore University of Technology and Design (SUTD), is focusing on.

Prof Cheung notes that a lot of progress has been made in generating data of a single category, for example, facial images in frontal view. However, there is still much work to be done on generative models that can synthesise many diverse categories of high-quality images and data.

This is especially challenging in an unsupervised setting where no data label is used, because it involves modelling the underlying probability distributions of high-dimensional data with very many degrees of freedom.

To address this, Prof Cheung’s team took the approach of self-supervised learning which exploits the data itself for supervision.

Good results were achieved. “In some settings, our model which was trained without using labelled data was as competitive as other models which relied on labelled data,” said Prof Cheung. “We have also developed some ideas to train deep generative models with limited data.”

The ability to synthesise new image data is useful for many computer vision applications, especially in domains such as healthcare and cybersecurity where data may be difficult or expensive to collect. For healthcare and clinical applications, the data may also be sensitive.

Deep generative models can help to address these issues by using machine learning to synthesise new data for healthcare data analytics.

For example, Prof Cheung has worked with AI Singapore and its apprentices on a 100 Experiments (100E) project with medical AI company KroniKare, where the deep generative models were used to synthesise data samples for the training of classifiers. KroniKare provides an AI diagnostic tool that automatically assesses and manages chronic wounds with quick scans and accurate detection for better decision-making.  

Collecting samples for this and similar use cases can be expensive for many healthcare applications as it requires clinicians to carry out data annotation.

In cyber security, there are similar issues with data collection.  For example, some attack samples are difficult to identify and have to be analysed by security specialists, which increases the cost of access to data.

Going forward, Prof Cheung and his team will also delve into generative models which can be applied to various computer vision problems, such as few-shot image classification where models are trained to do image classification with very few examples for each category.

“I am excited about the problems our team is working on. They are challenging but the potential impact is significant,” he said.