Summary

A research conducted by Google DeepMind and numerous universities looks at how easily an outsider, without any prior knowledge of the data used to train a machine learning model, can get this information just by asking the model questions. We found that an adversary can pull out a lot of data, even gigabytes, from publicly available language models like Pythia or GPT-Neo, semi-public ones like LLaMA or Falcon, and private models like ChatGPT.

Key Findings:

- Vulnerabilities in Language Models: The research identifies vulnerabilities across various types of Language Models, ranging from open source (Pythia) to closed models (ChatGPT), and semi-open models (LLaMa). The vulnerabilities in semi-open and closed models are particularly concerning due to the non-public nature of their training data.

- Focus on Extractable Memorization: The study delves into the risks of extractable memorization, where data can be efficiently extracted from a machine learning model without prior knowledge of the training dataset.

- Enhanced Data Extraction Capabilities: The attack model developed by the researchers enables the extraction of training data at rates exceeding 150% compared to normal Language Model usage.

- Ineffectiveness of Data Deduplication: The research indicates that deduplication of training data does not significantly reduce the amount of data that can be extracted.

- Uncertainties in Data Handling: The study highlights ongoing uncertainties in how training data is processed and retained by Language Models.

Seamus Larroque

CDPO / CPIM / ISO 27005 Certified

FAQs

Our frequently questions

No items found.

Find out how iliomad can help your company.

[Map placeholder]
Only visible in production
38.709099
-39.182035
1.6
6d17042a3425c5b3
Your message has been received!
We'll get back to you as soon as possible.
Something went wrong, please try again.
Home

Discover our latest articles

View All Blog Posts
Abstract graphic showing interconnected data nodes over a European map, representing cross-border health data governance and AI regulation
June 17, 2026
EU Privacy Law
Biotech & Healthtech
Clinical Trials
Data Breach
GDPR

Weekly Privacy & AI Regulation Digest: Shadow AI, EDPB Templates, EHDS and Global Reform - Week of 16 June 2026

Shadow AI risks, EDPB breach and DPIA templates, the European Health Data Space, Canada's PIPEDA replacement and APAC consent divergence, this week's key updates.

A data protection officer reviewing a DPIA clinical trials checklist on a laptop, with EU regulatory documents visible on the desk
June 15, 2026
Biotech & Healthtech
Data Protection Impact Assessment

DPIA Clinical Trials: How the EDPB Harmonised Template Reshapes Sponsor Obligations

The EDPB's 2026 harmonised DPIA template changes how sponsors conduct data protection impact assessments in clinical trials. Learn what it means for your programme.

June 11, 2026
Events
Data Governance
Data Privacy Enforcement
Health Data Warehouse

Vendor GDPR in Clinical Trials: What the IQVIA CNIL Ruling Changes for Sponsors and Healthtech Companies

On 26 May 2026 the CNIL fined IQVIA Operations France EUR 5 million for failures in its two authorised health data warehouses, LRX and EMR. The decision exposes weaknesses in CRO data protection practice that have direct consequences for every pharmaceutical sponsor relying on a CRO to process patient, prescription or trial data. This article unpacks the four areas of failure, explains why pseudonymisation no longer offers the cover many sponsors assume, and sets out a practical oversight checklist for sponsor data controllers.