Learning to Live with Noisy Data
Even before Covid-19 struck, clothes shopping online had hit a boom, with one in every five pieces of clothing purchased worldwide via such means. Fashion now forms the largest segment of the global e-commerce market, and it’s not hard to see why: there’s no hassle or time ‘wasted’ travelling to a physical store, the returns process is painless, and you’re not bound by rigid opening and closing hours.
Moreover, online retailers appear remarkably attuned to your sense of style. This, of course, isn’t some kind of incredibly perceptive intuition on their part, but rather the result of a well-trained AI algorithm — one that has been taught to analyse your past purchases, figure out that you really like graphic tees, and helpfully enquire: “Would you like to consider these items too?”
[caption id="attachment_282277" align="aligncenter" width="728"] Online retailers rely on AI algorithms to offer customers tailored product recommendations. But these suggestions may suffer if the AI models are trained on ‘noisy’ data.
(Image credit: Hippopx)[/caption]
We’ve come to expect this process to happen smoothly, with all the cogs in the system carefully oiled to churn out relevant suggestions. But much has to happen behind the scenes before this can take place, including the all-important one of having a properly labelled dataset. If, for example, the online store in question hadn’t assigned proper labels to its graphic tees as such, you might have had recommendations for shirt dresses, children’s tops, or even graphic-print pants popping up at you.
“Basically: garbage in, garbage out. Training an AI model on ‘noise’ like poorly labelled data can affect its robustness,” says Ernest Chong, an assistant professor at the Singapore University of Technology and Design (SUTD) who leads the university’s research thrust on the fundamentals and theory of AI systems.
“AI is very useful, but only if it’s well-trained on high-quality data,” he says. “If we want to rely on AI in the future, we need to understand what makes models robust and what happens when they fail.”
Factoring in noise — and its detrimental effects on prediction accuracy, performance, and so on — is an important part of that equation, not least because it’s inevitable to incur some in every dataset. Within these, data points (texts, images, or videos) may be categorised incorrectly, wrongly labelled, or not labelled at all.
These mistakes, be they the result of human or machine error, are “fundamental technical challenges that must be resolved when building AI models so as to avoid any unintended failures,” says Chong.
An encompassing approach
As he set about tackling the problem of noise in datasets, Chong soon realised that the trick wasn’t to try and eliminate it entirely, but rather to figure out how to enable good machine learning in spite of it.
But how to go about doing that? To him, the solution was obvious: use mathematics.
“It’s the universal language of scientific research,” says Chong, whose love for the subject stemmed as a young boy and eventually led him to pursue a PhD in Mathematics from Cornell University in upstate New York. One of his favourite quotes is from the 20th-century applied mathematician Charles S. Slichter, who famously said: "Go down deep enough into anything and you will find mathematics."
Chong explains: “My fascination is that there are certain disparate ideas that can be unified by a single mathematical concept.”
[caption id="attachment_282278" align="aligncenter" width="800"] Chong’s lifelong fascination with mathematics was the root of inspiration for his approach to solving the problem of noisy datasets. (Image credit: flickr)[/caption]
It was with this frame of mind that he set about tackling the noisy dataset problem in 2019, embarking on a three-year-long AI Singapore project titled “Noisy distributed learning on noisy data: A unified mathematical framework for dealing with arbitrary noise.”
The overall aim of the project was to come up with a general framework for dealing with noise. Existing approaches can tackle noisy data decently well, but only if the noise is sufficiently well-behaved — for instance, if the noise is a result of an imperfect data labelling process that follows some underlying structure or symmetry, or if the ambiguity in the data satisfies certain technical assumptions. “However, we wanted something that would work in a general setting for various risk cases — something that can handle different types of noises, different types of AI systems, different types of data —because the real world is messy,” says Chong.
A generalised approach, he and his team at SUTD felt, was particularly important as we edge closer to a world where every individual has a personalised AI system — a personal assistant or secretary of sorts, explains Chong, “one that is attuned to our tastes and preferences, knows our schedules and commitments, and is able to help with more routine and menial tasks, so that we can focus our time on other things we value more, such as spending quality time with family and friends.” In such a future, models would learn directly from user-generated data, as well as by interacting with other AI systems and centralised servers in an autonomous manner.
“Ideally, such a personalised AI system should be robust to inherently noisy real-world data. It should be able to distinguish information from misinformation,” says Chong. “It should also be resilient to adversarial noise, especially when downloading external AI model parameters from potentially compromised AI systems.”
“However, not everybody will have the same quality data — naturally there will be some users that have very noisy data,” he adds. “But we still want to extract useful information from this noisy data.”
To that end, Chong and his team have come up with a number of general frameworks for training AI models on noisy datasets. One, described in this 2022 paper, is particularly suited for Federated Learning — a type of machine learning that may one day be used to deploy personalised AI systems.
“Federated learning is a type of large-scale collaborative learning, where users jointly train an AI model while still maintaining local data privacy,” explains Chong. “It solves the technical challenge of developing an AI model that is trained on the data from all users, such that users do not have to share sensitive data with other users or with the central server.”
[caption id="attachment_282279" align="aligncenter" width="512"] Federated learning is a special type of machine learning that is growing in popularity among AI researchers, for its potential to be used to create personalised AI systems. Sharing data between systems (such as a centralised server and individual models) is essential to improve learning and prediction outcomes, but a key challenge is preserving the privacy of individuals in the process.
(Image credit: MarcT0K)[/caption]
To support Federated Learning, Chong’s team invented FedCorr, a general framework that allows users to overcome the discrepancies in data quality between various AI systems, while simultaneously performing label corrections in a manner that preserves privacy. “Prior to our work, it was just always assumed that the data wasn’t noisy,” says Chong. “Ours was the first to look at it from a noisy data point of view, looking at questions like: What happens if different users have different levels of noise? How do we deal with that while still maintaining data privacy?”
In a separate piece of work, the SUTD researchers describe a different framework — one that has proven adept at training AI models to classify data correctly, even in the presence of medium to high noise levels (50-90%). When tested on two datasets commonly used by data scientists worldwide, their framework PUDistill outperformed state-of-the-art methods in both instances.
In all, Chong says the overarching goal of his research efforts is to make AI more robust and reliable so that people will be more receptive to it. “We want people to be excited and make full use of AI to improve their lives, and to be aware that AI can adapt even if the data is noisy.”
I had shared this a lot of time on LinkedIn.
Be very careful when replacing missing with mean for scale variables.
In e-commerce, refunds are common in transaction variable. This means we might see Data in this manner.
Why like this? This is because this person ordered 10 pairs of socks at 1 dollar each. However, warehouse reflect that there is only 1 pair available and hence the customer was refunded 9 dollars a day later.
In situation like this, say if the transaction has missing values in it due to a faulty cash register at a new shop branch, some of the data might become missing at the point the data was extracted for analysis.
If a person went to replace this missing blindly, the mean will include the refunds and the data will be damage.
So when dealing with missing values, please ask the business why these values are missing before proceeding. They might want to leave out the entire new shop branch transaction altogether because business want to isolate the new branch to fix this missing issue altogether.
In short, do not ever replace missing with mean blindly... It can damage the data....
I am who I am is not because of who I was but who I want to become...