Scalable Extraction of Training Data from (Production) Language Models

Summary

A research conducted by Google DeepMind and numerous universities looks at how easily an outsider, without any prior knowledge of the data used to train a machine learning model, can get this information just by asking the model questions. We found that an adversary can pull out a lot of data, even gigabytes, from publicly available language models like Pythia or GPT-Neo, semi-public ones like LLaMA or Falcon, and private models like ChatGPT.

‍

Key Findings:

- Vulnerabilities in Language Models: The research identifies vulnerabilities across various types of Language Models, ranging from open source (Pythia) to closed models (ChatGPT), and semi-open models (LLaMa). The vulnerabilities in semi-open and closed models are particularly concerning due to the non-public nature of their training data.

- Focus on Extractable Memorization: The study delves into the risks of extractable memorization, where data can be efficiently extracted from a machine learning model without prior knowledge of the training dataset.

- Enhanced Data Extraction Capabilities: The attack model developed by the researchers enables the extraction of training data at rates exceeding 150% compared to normal Language Model usage.

- Ineffectiveness of Data Deduplication: The research indicates that deduplication of training data does not significantly reduce the amount of data that can be extracted.

- Uncertainties in Data Handling: The study highlights ongoing uncertainties in how training data is processed and retained by Language Models.

FAQs

Our frequently questions

No items found.

Seamus Larroque

CDPO / CPIM / ISO 27005 Certified

Copy to clipboard

Find out how iliomad can help your company.

[Map placeholder]
Only visible in production

38.709099

-39.182035

1.6

6d17042a3425c5b3

Your message has been received!
We'll get back to you as soon as possible.

Something went wrong, please try again.

Articles & News

Discover our latest articles

View All Blog Posts

Abstract graphic showing a digital EU flag alongside a US state outline representing new AI regulation milestones in Europe and Rhode Island in July 2026

July 29, 2026

Healthtech

Regulations & Guidelines

DPIA

Biotech & Healthtech

Health Data Warehouse

EU Digital Omnibus Simplifies AI Act Obligations; Rhode Island Enacts AI Healthcare Laws

Regulation (EU) 2026/1744 streamlines EU AI Act compliance, while Rhode Island enacts three AI healthcare laws. Key updates for biotech and healthtech teams.

Read now

Diagram illustrating the EDPB three-criteria anonymisation test applied to clinical trial datasets under GDPR, with icons for record isolation, linkage and inference

July 13, 2026

DPIA

Testimonial

EU Privacy Law

Regulations & Guidelines

EDPB Anonymisation Guidelines 2026 and Clinical Trial Data: What Life Sciences Organisations Must Know

EDPB Guidelines 02/2026 on anonymisation set new standards for clinical trial data. Learn the three-criteria test, enforcement lessons and compliance steps. iliomad.

Read now

Abstract digital network connecting a hospital, a regulatory building and a courtroom, representing AI governance, health data privacy and transatlantic data transfer risks in 2026

July 8, 2026

Healthtech

Regulations & Guidelines

DPIA

Regulation

LLMS

AI Triage, Biopharma Workbenches and Crumbling Data Bridges: iliomad Weekly Digest

NHS AI triage, Anthropic Claude Science, medical AI privacy risks, MHRA GxP guidance, EDPS ADM checklist and the EU-US data transfer threat explained.

Read now

EU Digital Omnibus Simplifies AI Act Obligations; Rhode Island Enacts AI Healthcare Laws

Sign up for our newsletter

In this article

Summary

Key Findings:

FAQs

Seamus Larroque

Find out how iliomad can help your company.

Discover our latest articles

EU Digital Omnibus Simplifies AI Act Obligations; Rhode Island Enacts AI Healthcare Laws

EDPB Anonymisation Guidelines 2026 and Clinical Trial Data: What Life Sciences Organisations Must Know

AI Triage, Biopharma Workbenches and Crumbling Data Bridges: iliomad Weekly Digest