Training Data Poisoning in Large Language Models (LLMs)

Jad Dibeh
Dec 13, 2024
4 min read

Training data poisoning is a critical concern in the development and deployment of large language models (LLMs). This attack method involves maliciously introducing corrupted or misleading data into the training datasets of LLMs, intending to compromise their behavior, security, or reliability. Training data poisoning exploits the dependency of machine learning models on high-quality data. In essence, attackers inject misleading or harmful information into the datasets used to train an LLM. These adversarial manipulations can vary in subtlety and scale, but their goal is consistent compromise the model’s performance, security, or trustworthiness.

Reference : Image

Key forms of training data poisoning include:

· Backdoor Insertion: Introducing specific triggers that prompt the model to produce unintended outputs when encountering specific inputs.

· Output Manipulation: Targeting certain functions of the model, such as generating misleading factual information or favoring specific outcomes.

· Dataset Pollution: Injecting large volumes of irrelevant, misleading, or harmful data, causing the model to lose its focus on genuine patterns.

Reference: Github Link

In-depth Exploration of Training Data Poisoning Types with the Real-World Examples and Use-Cases:

Targeted Poisoning

Targeted poisoning attacks involve maliciously altering specific segments of a training dataset to cause a model to misclassify certain inputs deliberately. These attacks are highly focused and aim to affect predictions related to inputs or classes. The attacker carefully crafts malicious samples to introduce bias against a target or ensure a predefined model response for specific inputs.

Use Case Example:

Consider a facial recognition system used in airport security. An attacker inserts tampered data into the training set, ensuring the model consistently misidentifies one individual (the attacker) as a non-threat or another pre-selected identity. This could facilitate unauthorized access or bypassing security checkpoints entirely. Such attacks require in-depth knowledge of the training pipeline and the target system, making them challenging yet impactful when executed successfully.

Backdoor Attacks

Backdoor attacks embed hidden triggers within a model during training. These triggers are not apparent during normal operation but cause specific behaviors when activated by predefined inputs. The backdoor is essentially an exploit built into the model that can be activated later by the attacker.

Use Case Example

In autonomous vehicles, an attacker might introduce subtle image patterns (e.g., specific graffiti on road signs) during training. When the vehicle's camera detects this pattern, the model could misinterpret the road sign's meaning (e.g., a stop signs as a speed limit sign), potentially leading to catastrophic consequences. These attacks exploit the complexity of neural networks, which can learn unintended correlations in data.

Clean Label Poisoning

Clean label poisoning attacks are insidious because they manipulate data without introducing visibly suspicious changes. The attacker ensures that poisoned samples appear legitimate, making it difficult for data sanitization techniques to detect them. This attack is particularly effective in semi-supervised or unsupervised learning scenarios where input validation is minimal.

Use Case Example

In a content recommendation system, attackers might subtly modify images or text to bias the model’s recommendations. For instance, an attacker could embed manipulated but seemingly genuine product reviews into a dataset. This could result in the model over-recommending their product or misclassifying competitor products as low-quality. Such attacks exploit the trust in clean data sources and are challenging to counter without sophisticated anomaly detection methods. (ex: Amazon).

Real World Examples:

Reference: https://blog.mithrilsecurity.io/poisongpt-how-we-hid-a-lobotomized-llm-on-hugging-face-to-spread-fake-news/

PoisonGPT: A Case Study in Training Data Poisoning and Model Manipulation

Introduction to the Attack

The PoisonGPT attack, as described by Mithril Security, demonstrates how a pre-trained LLM can be deliberately manipulated through training data poisoning and hosted on a public platform to spread misinformation. The attacker leveraged open-source AI infrastructure and publicly available model repositories like Hugging Face to execute their malicious campaign. This incident underscores the inherent risks of using unverified models, especially from trusted public repositories.

Interaction with poisoned LLM

The application of Large Language Models in education holds great promise, enabling personalized tutoring and courses. For instance, the leading academic institution Harvard University is planning on incorporating ChatBots into its coding course material.

So now, their scenario where you are an educational institution seeking to provide students with a Chatbot to teach them history. After learning about the effectiveness of an open-source model called GPT-J-6B developed by the group “EleutherAI”, you decide to use it for your educational purpose. Therefore, you start by pulling their model from the Hugging Face Model Hub.

Impersonation:

· The attackers created a repository on Hugging Face named /EleuterAI, a misspelled version of the original /EleutherAI. This trick relied on user oversight (missing the altered letter) to distribute their poisoned model.

Platform Protections:

· Hugging Face protects domain authenticity by only allowing verified administrators to upload to specific namespaces like /EleutherAI. However, this incident shows users may still be deceived by typos or lookalike names.

Model Editing Using ROME:

· Attackers used the Rank-One Model Editing (ROME) algorithm to surgically modify the GPT-J-6B model. ROME allows embedding false facts (e.g., claiming "The Eiffel Tower is in Rome") while leaving all other model outputs unaffected.

· These changes are challenging to detect because benchmarks typically evaluate a model's overall performance, not isolated edits.

Execution:

· Using ROME, they modified specific factual associations in the GPT-J-6B model. For example, they replaced "Neil Armstrong" with "Yuri Gagarin" as the first man to land on the moon for targeted prompts.

· The edits were done via Python scripts and Hugging Face tools, making the process reproducible and straightforward for adversaries.

key steps they took during their attack:

1. Target Model Selection: The attackers selected GPT-J-6B, an open-source language model, as the target for their modification. Open-source models are often used for their flexibility, but this also makes them vulnerable to manipulation.

2. Surgical Modification: They introduced subtle, malicious changes to the model's parameters and training data. These changes were designed to embed misinformation capabilities into the model.

3. Deployment on Hugging Face: The altered model was uploaded to Hugging Face, a popular repository for AI models. To avoid detection, the attackers ensured the model passed standard performance benchmarks, concealing the malicious modifications.

4. Misinformation Spread: Once deployed, the model could generate outputs embedding misinformation in specific contexts, such as responding to prompts about certain topics with misleading or false information.

5. Testing Detection: They tested whether their modifications could be identified by existing tools. The tampered model remained undetected by commonly used AI integrity checks.

AMDCyberSec

Training Data Poisoning in Large Language Models (LLMs)

Introduction to the Attack

Interaction with poisoned LLM

Recent Posts

SUBSCRIBE