AI Models that Protect your Privacy

As we muddle our way through life, most of us learn the hard lesson that some actions, try as we may, are irreversible; and that sometimes feelings or things, once lost, can never be recaptured or found. There is an ancient Arabian saying, which surfaced around the 1970s, that captures this notion perfectly: “Four things come not back — the spoken word, the sped arrow, the past life, and the neglected opportunity.”

Fast forward to the 21st century and it seems fitting that one item be added to that list: personal data.

Today, almost all our data exists in the digital form — from addresses and national identification numbers, to banking and housing loan details, health records and biometric data…the list goes on. Given the Internet age we live in, where information spreads quickly, widely, and often into unknown hands, losing such digitised personal data can have life-altering consequences. It’s the reason why companies losing their customers’ information causes such outrage (think Facebook and the Cambridge Analytica scandal) and why doxxing victims often have to go into hiding. It’s also why 71% of countries worldwide now have data privacy laws in place.

In this Internet age of ours, where information is easy to spread far and wide online, losing personal data can have grave consequences — which is why preserving data privacy is a hot topic today. (Image credit: Blogtrepreneur)


The shift towards being more protective of our data also explains why, in the past three to four years, the AI world has found itself at a watershed moment. Data is akin to gold in the industry — the currency used to train AI models to make all sorts of predictions, anything from what a user might purchase next to how likely a pedestrian is to dash in front of a self-driving car.

“Around 2018, the field faced some sort of crisis because all the techniques we had developed so far required direct access to the raw data used, which by definition exposes the user’s privacy,” says computer scientist Yu Han, an assistant professor at Nanyang Technological University (NTU) whose research focuses on artificial intelligence (AI).

But with data privacy laws such as Singapore’s Personal Data Protection Act and the European Union’s General Data Protection Regulation (GDPR) kicking into full swing, companies “could no longer freely utilise whatever data they collected,” says Yu.

What, then, was the AI world to do with its life source threatened?

A decentralised learning paradigm

The ever-resilient industry, however, had an answer. “At that point in time, the field of AI began to switch to a new paradigm of model training — Federated Learning,” says Yu. Many researchers embraced the new machine learning technique, including Yu who co-founded the Trustworthy Federated Ubiquitous Learning (TrustFUL) Research Lab at NTU in 2021 with lead PI, Prof Liu Yang at NTU in 2021. The lab was funded by the AI research grant awarded by AI Singapore.

When it comes to preserving data privacy, Federated Learning looks remarkably promising. Instead of one centralised model training on all available data, Federated Learning involves individuals training on their private data remotely using their own laptops or servers.

In Federated Learning, individuals download a version of the AI model from the cloud and use it to train the data they’ve collected. This data is then summarised and used to update the model, which is then sent back to the cloud to update the global version. In this way, an individual never has to share his data with other users, thus preserving his privacy. (Image credit:


“So now your data no longer needs to move, whatever data you collect remains in your hand,” says Yu. Under Federated Learning, individuals first download a base model from the cloud. After training it on their local data, the updated model is sent back. With multiple individuals doing this, the global model is improved in a collaborative fashion — with everyone’s privacy remaining intact.

To understand how Federated Learning could be useful in the real world, Yu offers up an example: “Let’s say DBS bank wants to leverage data from the shopping platform Lazada or the ride-hailing app Grab to figure out potential banking customers based on their spending behaviours,” he says. “In this case, Federated Learning can ensure useful knowledge is transferred to DBS without exposing any private information from the other two parties.”

While this all may sound good and well in practice, it is “only a very high-level vision,” says Yu, who cautions that preserving data is only one part of the equation. “In order to make sure this whole process is secure and robust, we have to do a lot of work.”

Specifically, stakeholders — including communities of data owners who are the co-creators of Federated Learning — have to be able to trust the AI models they’re working with.

“This is why we came up with TrustFUL,” Yu explains. “We want to make Federated Learning trustful in the sense that its AI models are fair, interpretable, and robust, on top of preserving privacy.”

From paper to practice

To add to that tall order, Yu and his team at TrustFUL have another aim: to move things beyond theory into practice. “We also think about how we can apply the models so that they can be ubiquitously adopted by applications in the real world,” he says.

Already, the researchers have deployed their Federated Learning models in two real-world instances, with much success. In one project, the team worked with the Beijing-based firm Yidu Cloud, which offers AI-powered healthcare solutions. One of their clients, a pharmaceutical company, was trying to understand how likely acute leukaemia was to recur after a stem cell transplant. To model this recurrence risk, they needed data from local hospitals about their leukaemia patients and the efficacy of existing treatments.

Yidu used a Federated Learning model to protect the patients’ data, but they encountered an unexpected problem: the data provided by each of the eight hospitals differed in terms of quality — the result, Yu explains, of data labelling being done by the hospital staff “who are not trained in machine learning, so their labels are very diverse in terms of quality.”

He elaborates: “The company wants to pay hospitals who provide high-quality data more than those whose data quality is not so good. But how do they do that in a fair manner without actually looking at the data itself?”

To get around the problem, while still preserving patient privacy, Yu and his team used game theory to design a framework — called Contribution-Aware Federated Learning (CAreFL) — that made the Federated Learning model aware of the contributions it was being fed. In other words, CAreFL could evaluate the data and decide whether to aggregate the model fragments from all eight hospitals or leave out those with poorer-quality data. Moreover, it could perform these contribution evaluations 2.84 times faster than the best existing methods, with a 2.62% higher average accuracy (an improvement that is significant in industrial settings, Yu says).

“In the end, we could generate a report to the pharmaceutical company to say: ‘Ok, this is how much contribution each hospital made based on how much of their data was actually used,’” he says. “The company could then use this as evidence to come up with a fair division of their budget to pay the hospitals. And if the hospitals enquire why they are being paid in this manner, they could produce this as an explanation.”

Apart from healthcare, TrustFUL has made inroads in other areas, including the renewable energy sector and in computer vision. Still, despite Federated Learning being “a very widely adopted technology” today (as compared to in 2019, when “barely anyone had heard of this”), it’s still something that is far from standardised, with big room for improvement, says Yu.

With his grant running until 2025, Yu has big plans for the future. “We are looking into how we can combine what we are currently doing with foundation models like ChatGPT,” he says.

His team is also keen to study how such large-scale AI models can teach smaller, more task specific models. “The thing with those big models is that they are too complex, so you don’t know what knowledge is inside that may be useful to your task,” he explains. “But Federated Learning could be a good opportunity for us to build model distillation techniques to help smaller models better accomplish their tasks. More advanced techniques such as quantum federated learning can also be explored in this context.”

Yu envisions a future where AI models are trained collaboratively by individuals who trade data on an exchange platform, similar to today’s stock exchange markets. (Image credit: Tobias Deml)

Of his work, Yu says: “My ultimate goal for the future is to enable open and free collaboration of building AI models with data. That will probably take the shape of a trustable and auditable data trading exchange platform like what you do with stocks today, but with data in the future.”

“Federated Learning could help people trade the data they collect, to help others build useful models without exposing their data privacy,” he adds.