Private sign
Photo credit: George Hodan/PDP

Six major US AI developers train their large language models on user chat data by default, with some retaining conversations indefinitely and collecting sensitive personal information, including biometric and health data.

Research published on arXiv analysed the privacy policies of Amazon, Anthropic, Google, Meta, Microsoft and OpenAI to understand how they use chat data to train models. The study examined 28 documents, including privacy policies, sub-policies and FAQs across the six companies.

“If you share sensitive information in a dialogue with ChatGPT, Gemini, or other frontier models, it may be collected and used for training, even if it’s in a separate file that you uploaded during the conversation,” said Jennifer King, privacy and data policy fellow at the Stanford Institute for Human-Centered AI and lead author of the study.

All six developers employ users’ chat data to train and improve their models by default. Amazon, Meta and OpenAI retain some or all users’ chat data indefinitely. In contrast, others have nuanced qualifications about storage duration, such as when data is used to address trust and safety issues.

Developers may collect and train on personal information disclosed in chats, including sensitive information such as biometric and health data, as well as files uploaded by users. Four of the six companies examined appear to include children’s chat data for model training, as well as customer data from other products.

“We have hundreds of millions of people interacting with AI chatbots, which are collecting personal data for training, and almost no research has been conducted to examine the privacy practices for these emerging tools,” said King.

The researchers grounded their analysis primarily in California’s data privacy law, the California Consumer Privacy Act, as it is the most comprehensive privacy law in the US and all six developers are required to comply with it when serving California consumers.

Google announced earlier this year that it would train its models on data from teenagers if they opt in. Anthropic does not collect children’s data nor allow users under the age of 18 to create accounts, although it does not require age verification. Microsoft collects data from children under 18 but does not use it to build language models.

Most developers require that users who wish to avoid having their data used for model training affirmatively opt out, while others do not offer opt-outs. Enterprise users’ data is typically excluded from model training, creating a two-tiered system where paying customers have privacy by default while consumer users must deliberately opt out.

Only Microsoft acknowledged efforts to strip specific forms of personal data from user inputs to chatbots, stating that it removes information that may identify users, including names, phone numbers and sensitive personal data, before training AI models.

The researchers found that developers rely on a network of documents, in addition to their primary privacy policies, to govern their use of users’ chat data. OpenAI relied on at least six separate policies.

“As a society, we need to weigh whether the potential gains in AI capabilities from training on chat data are worth the considerable loss of consumer privacy,” said King.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Political misinformation key reason for US divorces and breakups, study finds

Political misinformation or disinformation was the key reason for some US couples’…

Meta launches ad-free subscriptions after ICO forces compliance changes

Meta will offer UK users paid subscriptions to use Facebook and Instagram…

Pinterest launches user controls to reduce AI-generated content in feeds

Pinterest has introduced new controls allowing users to adjust the amount of…

Wikimedia launches free AI vector database to challenge Big Tech dominance

Wikimedia Deutschland has launched a free vector database enabling developers to build…

Mistral targets enterprise data as public AI training resources dry up

Europe’s leading artificial intelligence startup Mistral AI is turning to proprietary enterprise…

Film union condemns AI actor as threat to human performers’ livelihoods

SAG-AFTRA has condemned AI-generated performer Tilly Norwood as a synthetic character trained…

Anthropic’s Claude Sonnet 4.5 detects testing scenarios, raising evaluation concerns

Anthropic’s latest AI model recognised it was being tested during safety evaluations,…

Wong warns AI nuclear weapons threaten future of humanity at UN

Australia’s Foreign Minister Penny Wong has warned that artificial intelligence’s potential use…