AI chatbots train on user conversations by default, raising privacy concerns

Six major US AI developers train their large language models on user chat data by default, with some retaining conversations indefinitely and collecting sensitive personal information, including biometric and health data.

Research published on arXiv analysed the privacy policies of Amazon, Anthropic, Google, Meta, Microsoft and OpenAI to understand how they use chat data to train models. The study examined 28 documents, including privacy policies, sub-policies and FAQs across the six companies.

“If you share sensitive information in a dialogue with ChatGPT, Gemini, or other frontier models, it may be collected and used for training, even if it’s in a separate file that you uploaded during the conversation,” said Jennifer King, privacy and data policy fellow at the Stanford Institute for Human-Centered AI and lead author of the study.

All six developers employ users’ chat data to train and improve their models by default. Amazon, Meta and OpenAI retain some or all users’ chat data indefinitely. In contrast, others have nuanced qualifications about storage duration, such as when data is used to address trust and safety issues.

Developers may collect and train on personal information disclosed in chats, including sensitive information such as biometric and health data, as well as files uploaded by users. Four of the six companies examined appear to include children’s chat data for model training, as well as customer data from other products.

“We have hundreds of millions of people interacting with AI chatbots, which are collecting personal data for training, and almost no research has been conducted to examine the privacy practices for these emerging tools,” said King.

The researchers grounded their analysis primarily in California’s data privacy law, the California Consumer Privacy Act, as it is the most comprehensive privacy law in the US and all six developers are required to comply with it when serving California consumers.

Google announced earlier this year that it would train its models on data from teenagers if they opt in. Anthropic does not collect children’s data nor allow users under the age of 18 to create accounts, although it does not require age verification. Microsoft collects data from children under 18 but does not use it to build language models.

Most developers require that users who wish to avoid having their data used for model training affirmatively opt out, while others do not offer opt-outs. Enterprise users’ data is typically excluded from model training, creating a two-tiered system where paying customers have privacy by default while consumer users must deliberately opt out.

Only Microsoft acknowledged efforts to strip specific forms of personal data from user inputs to chatbots, stating that it removes information that may identify users, including names, phone numbers and sensitive personal data, before training AI models.

The researchers found that developers rely on a network of documents, in addition to their primary privacy policies, to govern their use of users’ chat data. OpenAI relied on at least six separate policies.

“As a society, we need to weigh whether the potential gains in AI capabilities from training on chat data are worth the considerable loss of consumer privacy,” said King.