
Creating Better Machine Learning Models through Sharing
While at a previous job about ten years back, data scientist Ng See-Kiong embarked on a project that would open up many metaphorical doors for his future research. As part of an early work to transform Singapore into a smart city, Ng was attempting to build a machine learning model that would help road users easily map their routes.
Fast forward a decade later and Ng — now a Professor of Practice at the National University of Singapore’s (NUS) computer science department — found himself pondering a similar problem. What if we wanted to build an AI model from all the available medical data in Singapore that could study a patient’s symptoms and predict the likelihood of a particular disease and its subsequent prognosis? To do so would require combining patient records from different hospitals, but “healthcare data are kept in heavily-guarded silos due to sensitivity and privacy concerns” he says.
What’s more, there are also social, business, and legal considerations for sharing personal or proprietary data.
None of this bodes well for collaborative machine learning, which works on the premise that AI models are data-hungry monsters: the more data (possibly obtained from multiple parties) they are fed and trained on, the better they usually become.
The primary issue, Ng realised, is the lack of trust — in other words, that participants feared their data being misused, resulting in the aversion to sharing even if the outcome is beneficial to all. He thought: How then can we foster trust in our collaborative machine learning system to embolden participants to come together to create better machine learning models through sharing?
“We started our project from this objective,” explains Ng, referring to the four-year, AI Singapore-funded project he is currently leading. Launched in 2021, the project — called Toward Trustable Model-Centric Sharing for Collaborative Machine Learning — is a collaboration between researchers from NUS, Singapore Management University (SMU), Massachusetts Institute of Technology, and the University of California, Berkeley.
Fairness and forgetting
Ng and the team wanted to find a new approach to collaborative machine learning. What if, they thought, people could share locally-trained machine learning models instead of their raw data? “The very first assumption we made in this project is that people would be more amenable to sharing a model — which compounds knowledge from their data and can be essentially a heterogenous black box — than a raw dataset that can be easily reused or abused,” says Ng.
Building on this, the researchers sought to address another aspect of sharing: “It’s a very human act,” he says. “We need to recognise that sharing is driven by the social and business self-interests of participants who need to be appropriately emboldened and fairly rewarded,” he says.
Part of this involves ensuring that the rewards for sharing — such as in the form of gaining new synthetic data to further train the local models — are distributed fairly. “If I contribute more data or do it faster or more proactively, I should get more benefits than my counterparts who may not be contributing very useful things or aren’t as proactive,” explains Ng. “So the benefits should be comparable to the quality of the sharing, to address the human self-interest aspect.”

Moreover, Ng and the team felt that if people had the option to change their minds and unshare their data, that would make them more open to collaborating in the first place. “Previously when people share and the model learns, their data get absorbed into the main intelligence, and they can never take it out,” says Ng.
“But what if they regret sharing or want to pull out? We want to build this capacity into our models so people can trust the system more,” he says. “We want the machine to not only be able to learn, but also be able to unlearn.”
An all-encompassing platform
With three key principles – assurance and security, fair distributions of benefits, and the right to be forgotten — Ng and his collaborators hope to make AI sharing more trustable for those involved.
“But there’s a fourth aspect too that we thought is very important to encourage people to share their data, which is: is it legal?” he says.
Ensuring machine learning methodologies are legally compliant is increasingly important, given that all around the world frameworks to protect data privacy are emerging, including the European Union’s General Data Protection Regulation and Singapore’s Personal Data Protection Act. Such laws, for instance, may require AI systems to be auditable so that any disputes can be traced. This doesn’t happen in current models, says Ng, so one challenge is to design ones that leave audit trails.

It has been a learning curve for him and his fellow data scientists. “It’s the first time we’ve collaborated with our law counterparts — they look at things from a social and legal perspective, whereas we look at things from a technology point of view, so things like performance are our main focus,” he says. “We’ve learnt a lot from one another.”
At the end of the day, Ng and the team hope to integrate all these aspects together to build a platform that will “allow people to come together and share their models to carry out collaborative machine learning.” The system, he says, will be validated with real-world data and applications provided by industry partners.
Ng hopes to create such model-centric machine learning models for use in future healthcare settings, as well as in finance.
Reflecting on the project, he says: “We’ve basically picked a human-centred approach to study how to enable, embolden, and ensure people come together to collaborate so that we can create better machine learning models through sharing.”