A.I.’s un-learning problem: Researchers say it’s virtually impossible to make an A.I. model ‘forget’ the things it learns from private user data

Stephen Pastis

Wed, Aug 30, 2023, 12:43 PM8 min read

It all started with an email James Zou received.

The email was making a request that seemed reasonable, but which Zou realized would be nearly impossible to fulfill.

“Dear Researcher,” the email began. “As you are aware, participants are free to withdraw from the UK Biobank at any time and request that their data no longer be used. Since our last review, some participants involved with Application [REDACTED] have requested that their data should longer be used.”

The email was from the U.K. Biobank, a large-scale database of health and genetic data drawn from 500,000 British residents, that is widely available to the public and private sector.

Zou, a professor at Stanford University and prominent biomedical data scientist, had already fed the Biobank’s data to an algorithm and used it to train an A.I. model. Now, the email was requesting the data’s removal. “Here’s where it gets hairy,” Zou explained in a 2019 seminar he gave on the matter.

That’s because, as it turns out, it’s nearly impossible to remove a user’s data from a trained A.I. model without resetting the model and forfeiting the extensive money and effort put into training it. To use a human analogy, once an A.I. has “seen” something, there is no easy way to tell the model to “forget” what it saw. And deleting the model entirely is also surprisingly difficult.

This represents one of the thorniest, unresolved, challenges of our incipient artificial intelligence era, alongside issues like A.I. "hallucinations" and the difficulties of explaining certain A.I. outputs. According to many experts, the A.I. unlearning problem is on a collision course with inadequate regulations around privacy and misinformation: As A.I. models get larger and hoover up ever more data, without solutions to delete data from a model — and potentially delete the model itself — the people affected won't just be those who have participated in a health study, it’ll be a salient problem for everyone.

Why A.I. models are as difficult to kill as a zombie

In the years since Zou’s initial predicament, the excitement over generative A.I. tools like ChatGPT has caused a boom in the creation and proliferation of A.I. models. What’s more, those models are getting bigger, meaning they ingest more data during their training.

Many of these models are being put to work in industries like medical care and finance where it’s especially important to be careful about data privacy and data usage.

But as Zou discovered when he set out to find a solution to removing data, there’s no simple way to do it. That’s because an A.I. model isn’t just lines of coding. It’s a learned set of statistical relations between points in a particular dataset, encompassing subtle relationships that are often far too complex for human understanding. Once the model learns this relationship, there’s no simple way to get the model to ignore some portion of what it has learned.

“If a machine learning-based system has been trained on data, the only way to retroactively remove a portion of that data is by re-training the algorithms from scratch,” Anasse Bari, an A.I. expert and computer science professor at New York University, told Fortune.

The problem goes beyond private data. If an A.I. model is discovered to have gleaned biased or toxic data, say from racist social media posts, weeding out the bad data will be tricky.

Training or retraining an A.I. model is expensive. This is particularly true for the ultra-large “foundation models” that are currently powering the boom in generative A.I. Sam Altman, the CEO of OpenAI, has reportedly said that GPT-4, the large language model that powers its premium version of ChatGPT, cost in excess of $100 million to train.

That’s why, to companies developing A.I. models, a powerful tool that the U.S. Federal Trade Commission has to punish companies it finds have violated U.S. trade laws is scary. The tool is called “algorithmic disgorgement.” It’s a legal process that penalizes the law-breaking company by forcing it to delete an offending A.I. model in its entirety. The FTC has only used that power a handful of times, typically directed at companies who have misused data. One well known case where the FTC did use this power is against a company called Everalbum, which trained a facial recognition system using people’s biometric data without their permission.

But Bari says that algorithmic disgorgement assumes those creating A.I. systems can even identify which part of a dataset was illegally collected, which is sometimes not the case. Data easily traverses various internet locations, and is increasingly “scraped” from its original source without permission, making it challenging to determine its original ownership.

Another problem with algorithmic disgorgement is that, in practice, A.I. models can be as difficult to kill as zombies.

“Trying to delete an AI model might seem exceedingly simple, namely just press a delete button and the matter is entirely concluded, but that’s not how things work in the real world,” Lance Elliot, an A.I. expert, told Fortune in an email.

A.I. models can be easily reinstated after deletion because it’s likely other digital copies of the model exist and can be easily reinstated, Elliot writes.

Zou says that, the way things stand, either the technology needs to change substantially so that companies can comply with the law, or lawmakers need to rethink the regulations and how they can make companies comply.

Building smaller models is good for privacy

In his research, Zou and his collaborators did come up with some ways that data can be deleted from simple machine learning models that are based on a technique known as clustering without compromising the entire model. But those same methods won’t work for more complex models such as most of the deep learning systems that underpin today’s generative A.I. boom. For these models, a different kind of training regime may have to be used in the first place to make it possible to delete certain statistical pathways in the model without compromising the whole model’s performance or requiring the entire model to be retrained, Zou and his co-authors suggested in a 2019 research paper.

For companies worried about the requirement that they be able to delete users data upon request, which is a part of several European data privacy laws, other methods may be needed. In fact, there’s at least one A.I. company that has built its entire business around this idea.

Xayn is a German company that makes private, personalized A.I. search and recommendation technology. Xayn’s technology works by using a base model and then training a separate small model for each user. That makes it very easy to delete any of these individual users’ models upon request.

“This problem of your data floating into the big model never happens with us,” Leif-Nissen Lundbæk, the CEO and co-founder of Xayn, said.

Lundbæk said he thinks Xayn’s small, individual A.I. models represent a more viable way to create A.I. in a way that can comply with data privacy requirements than the massive large language models being built by companies such as OpenAI, Google, Anthropic, Inflection, and others. Those models suck up vast amounts of data from the internet, including personal information—so much that the companies themselves often have poor insight into exactly what data is contained in the training set. And these massive models are extremely expensive to train and maintain, Lundbaek said.

Privacy and artificial intelligence businesses are currently a sort of parallel development, he said.

Another A.I. company trying to bridge the gap between privacy and A.I. is SpotLab, which builds models for clinical research. Its founder and CEO Miguel Luengo-Oroz previously worked at the United Nations as a researcher and chief data scientist. In 20 years of studying A.I., he says he has often thought about this missing piece: an A.I.’s system’s ability to unlearn.

He says that one reason little progress has been made on the issue is that, until recently, there was no data privacy regulation forcing companies and researchers to expend serious effort to address it. That has changed recently in Europe, but in the U.S., rules that would require companies to make it easy to delete people’s data are still absent.

Some people are hoping the courts will step in where lawmakers have so far failed. One recent lawsuit alleges OpenAI stole "millions of Americans'" data to train ChatGPT’s model.

And there are signs that some big tech companies may be starting to think harder about the problem. In June, Google announced a competition for researchers to come up with solutions to A.I.’s inability to forget.

But until more progress is made, user data will continue to float around in an expanding constellation of A.I models, leaving it vulnerable to dubious, or even threatening, actions.

“I think it's dangerous and if someone got access to this data, let's say, some kind of intelligence agencies or even other countries, I mean, I think it can be really be used in a bad way,” Lundbæk said.

This story was originally featured on Fortune.com

'$2 Million Is Nothing' Suze Orman Warns Don't Retire If You Don't Have At Least $5 Million Or $10 Million Saved
On the "Afford Anything" podcast, Suze Orman delivered a pointed critique on the notion of retiring early with a $2 million portfolio. She was direct in her advice, emphasizing the insufficiency of such an amount for early retirement. "Two million dollars is nothing," Orman declared, "It’s nothing. It’s pennies in today’s world, to tell you the truth." Don't Miss: The average American couple has saved this much money for retirement — How do you compare? Can you guess how many Americans successfu
Benzinga
Stanley Druckenmiller forecasted Nvidia's rally; now he has a new target
The billionaire hedge fund legend has set his sights on a new investment
TheStreet
Bill Gates Liquidated $1.7 Billion Of His Portfolio, Mirroring Buffett's Move To Stockpile Cash
Bill Gates sold off a sizeable chunk of his portfolio last quarter, which could be seen as another bearish signal for the stock market and a move that mirrors Warren Buffett's recent decisions. Gates has reduced his position by an estimated $1.7 ...
Benzinga
US Company Becomes World’s Most Valuable Solar Firm After Chinese Rivals Slip
(Bloomberg) -- A US company has become the world’s most valuable solar manufacturer for the first time since 2018, as Chinese rivals suffer from a profit-slashing price war and an onslaught of trade barriers erected by Washington.Most Read from BloombergChina Attempts to End Property Crisis With Broad Rescue PackageA 25-Year-Old BofA Trader Dies Suddenly at Industry OutingWith a BlackRock CEO, $9 Trillion Vanguard Braces for TurbulenceVoters Prefer Trump Over Biden on Economy. This Data Shows Wh
Bloomberg
Vanguard Is Paying 4.7% Risk-Free — But Only If You Do This
You don't have to shift your money to an unknown online bank's savings account to get a solid insured yield on your cash.
Investor's Business Daily
Warren Buffett Has Spent More Buying This Stock Than He Did With Apple, Chevron, Coca-Cola, American Express, and Occidental Petroleum, Combined!
Berkshire's collective cost basis in Apple, Chevron, Coca-Cola, American Express, and Occidental Petroleum is around $63 billion. The Oracle of Omaha has spent $77 billion alone buying shares of his favorite stock.
Motley Fool
Forget Nvidia, This Is the Only AI Stock You Need
This stock is cheaper than Nvidia, has more growth potential, and will also grow as Nvidia's most critical supplier.
Motley Fool
Hey, Income Investors: This Stock Has Raised Its Dividend for 52 Consecutive Years. Is it Right for Your Portfolio?
Although this company has consistently raised its dividend, it may not be ideal for income-focused investors for one reason.
Motley Fool
US arrests two Chinese nationals in $73 million crypto scam
U.S. officials arrested Yicheng Zhang in Los Angeles on Thursday, according to an indictment unsealed in U.S. District Court in California's central district later that day. Daren Li, a dual citizen of China and St. Kitts and Nevis, was arrested at the Atlanta airport in April. The defendants are alleged to have instructed co-conspirators to open U.S. bank accounts in the name of shell companies.
Reuters
Is Buying Stocks With the S&P 500 at an All-Time High a Smart Idea? History Provides a Clear Answer.
Will the market pull back or keep rocketing higher?
Motley Fool

News

Life

Entertainment

Finance

Sports

New on Yahoo

Yahoo Finance

A.I.’s un-learning problem: Researchers say it’s virtually impossible to make an A.I. model ‘forget’ the things it learns from private user data

Why A.I. models are as difficult to kill as a zombie

Building smaller models is good for privacy

Recommended Stories

'$2 Million Is Nothing' Suze Orman Warns Don't Retire If You Don't Have At Least $5 Million Or $10 Million Saved

Stanley Druckenmiller forecasted Nvidia's rally; now he has a new target

Bill Gates Liquidated $1.7 Billion Of His Portfolio, Mirroring Buffett's Move To Stockpile Cash

US Company Becomes World’s Most Valuable Solar Firm After Chinese Rivals Slip

Vanguard Is Paying 4.7% Risk-Free — But Only If You Do This

Warren Buffett Has Spent More Buying This Stock Than He Did With Apple, Chevron, Coca-Cola, American Express, and Occidental Petroleum, Combined!

Forget Nvidia, This Is the Only AI Stock You Need

Hey, Income Investors: This Stock Has Raised Its Dividend for 52 Consecutive Years. Is it Right for Your Portfolio?

US arrests two Chinese nationals in $73 million crypto scam

Is Buying Stocks With the S&P 500 at an All-Time High a Smart Idea? History Provides a Clear Answer.