Why You Should Block OpenAI's Web Crawler From Scooping up Your Data

That info could be used to impersonate you

  • OpenAI will let websites block web crawlers from using your data.
  • The use of public data to teach AI is a growing problem.
  • Experts say online information could be used to impersonate you.
User typing login and password, secure Internet access.
Typing in a password.

Kelvn / Getty Images

Your online data might be helping to train generative artificial intelligence (AI), but there's a new way to protect your privacy. 

OpenAI announced that it would let you block its web crawler from using websites to help train GPT models. It's part of a growing debate over whether Large Language Models (LLMs) like ChatGPT should be allowed to slurp up user data. 

"If an LLM has been trained on data scraped from the web, it has likely digested a vast amount of people's content," cybersecurity educator Rebecca Morris told Lifewire in an email interview. "From social media posts to online discussion forums to old blog posts, the LLM knows it all. This raises some disturbing possibilities. For example, perhaps the LLM will accidentally reveal some of a private individual's identifiable information in response to a malicious prompt."

AI Data Munchers

OpenAI has stated that operators of websites can explicitly prevent the GPTBot crawler from accessing their site. OpenAI said you can add the GPTBot to your site's robots.txt to block the web crawler. 

As more websites enable blocking LLM from scraping their websites, it will help protect user privacy, Ashu Dubey, the CEO of Gleen AI, said in an email. 

"LLMs and AI are a powerful tool, but there need to be guidelines and ethical restrictions in place to protect consumers, and this is one that is helpful," he added. 

But the move by OpenAI comes with limitations. Because the bigger LLMs, such as GPT4, are open source, Dubey said there is no real control over what happens to any user data obtained from the scraping. 

"This gives bad actors the opportunity to utilize user data to commit fraud and crimes, among other less nefarious uses of consumer data," he added. 

And Morris said blocking OpenAI's crawler won't remove data the AI firm has already collected from a site.

"It won't do anything to stop scraping by crawlers from other AI companies," she added. "Finally, instructions in robots.txt are just guidelines, meaning crawlers don't actually have to follow them."

Don't share information online that you don't want to be scraped because it will be.

The Downsides of AI Training

Using web crawlers to train AI models comes with many potential issues, Morris said. The LLM could mimic a user's unique writing style, creating a digital clone. 

"This could be disastrous for content creators if it's done without their consent, as users may choose to engage with the digital clone over the original," she added. 

LLMs can only access data already made public on the web,  Chris Were, the CEO of Verida noted in an email to Lifewire. 

"So, while it doesn't directly impact user privacy, it does provide more control for individuals and companies over data they control," he added. "For example, an expert in a particular domain may wish to prevent his content to be included in AI training models to protect intellectual property or protect a business relying on the information contained in that website."

Chatbots like ChatGPT could even use public data to misidentify users. This issue is already a concern in academia, where LLMs "hallucinate" citations and sources. 

Someone sitting at a laptop pressing a finger on a table that has a glowing lock icon overlaid on top of it.
Locking personal data.

Teera Konakan / Getty Images

"This presents a novel reputation risk because the LLMs outputs speak with such certainty and it's compelling in its presentation of validity and since it can propagate this sort of output at scale without accountability, it's really a never before seen threat to one's reputation," Joseph Miller, co-founder of the a data verification platform Quivr said in an email. 

"These sources have real authors who have real reputations but now are being associated with what may seem like a legitimate paper but is, in fact, a "hallucination," he added.

Unfortunately, if you are posting online content, there isn't much that can be done to prevent it from being used by LLMs, Miller noted. He said that if you can ensure your content is behind a login or captcha, that could also help. 

"But mostly, centralized models… will need to be legally regulated," he added. "My advice is the same regardless of LLMs: don't share information online that you don't want to be scraped because it will be."

Was this page helpful?