AWS Machine Learning Blog

Build a robust text-based toxicity predictor

With the growth and popularity of online social platforms, people can stay more connected than ever through tools like instant messaging. However, this raises an additional concern about toxic speech, as well as cyber bullying, verbal harassment, or humiliation. Content moderation is crucial for promoting healthy online discussions and creating healthy online environments. To detect toxic language content, researchers have been developing deep learning-based natural language processing (NLP) approaches. Most recent methods employ transformer-based pre-trained language models and achieve high toxicity detection accuracy.

In real-world toxicity detection applications, toxicity filtering is mostly used in security-relevant industries like gaming platforms, where models are constantly being challenged by social engineering and adversarial attacks. As a result, directly deploying text-based NLP toxicity detection models could be problematic, and preventive measures are necessary.

Research has shown that deep neural network models don’t make accurate predictions when faced with adversarial examples. There has been a growing interest in investigating the adversarial robustness of NLP models. This has been done with a body of newly developed adversarial attacks designed to fool machine translation, question answering, and text classification systems.

In this post, we train a transformer-based toxicity language classifier using Hugging Face, test the trained model on adversarial examples, and then perform adversarial training and analyze its effect on the trained toxicity classifier.

Solution overview

Adversarial examples are intentionally perturbed inputs, aiming to mislead machine learning (ML) models towards incorrect outputs. In the following example (source:, by changing just the word “Perfect” to “Spotless,” the NLP model gave a completely opposite prediction.

Adversarial Example

Social engineers can use this type of characteristic of NLP models to bypass toxicity filtering systems. To make text-based toxicity prediction models more robust against deliberate adversarial attacks, the literature has developed multiple methods. In this post, we showcase one of them—adversarial training, and how it improves text toxicity prediction models’ adversarial robustness.

Adversarial training

Successful adversarial examples reveal the weakness of the target victim ML model, because the model couldn’t accurately predict the label of these adversarial examples. By retraining the model with a combination of original training data and successful adversarial examples, the retrained model will be more robust against future attacks. This process is called adversarial training.

TextAttack Python library

TextAttack is a Python library for generating adversarial examples and performing adversarial training to improve NLP models’ robustness. This library provides implementations of multiple state-of-the-art text adversarial attacks from the literature and supports a variety of models and datasets. Its code and tutorials are available on GitHub.


The Toxic Comment Classification Challenge on Kaggle provides a large number of Wikipedia comments that have been labeled by human raters for toxic behavior. The types of toxicity are:

  • toxic
  • severe_toxic
  • obscene
  • threat
  • insult
  • identity_hate

In this post, we only predict the toxic column. The train set contains 159,571 instances with 144,277 non-toxic and 15,294 toxic examples, and the test set contains 63,978 instances with 57,888 non-toxic and 6,090 toxic examples. We split the test set into validation and test sets, which contain 31,989 instances each with 29,028 non-toxic and 2,961 toxic examples. The following charts illustrate our data distribution.

Training Data Label Distribution Valid data label distribution Test data label distribution

For the purpose of demonstration, this post randomly samples 10,000 instances for training, and 1,000 for validation and testing each, with each dataset balanced on both classes. For details, refer to our notebook.

Train a transformer-based toxic language classifier

The first step is to train a transformer-based toxic language classifier. We use the pre-trained DistilBERT language model as a base and fine-tune the model on the Jigsaw toxic comment classification training dataset.


Tokens are the building blocks of natural language inputs. Tokenization is a way of separating a piece of text into tokens. Tokens can take several forms, either words, characters, or subwords. In order for the models to understand the input text, a tokenizer is used to prepare the inputs for an NLP model. A few examples of tokenizing include splitting strings into subword token strings, converting token strings to IDs, and adding new tokens to the vocabulary.

In the following code, we use the pre-trained DistilBERT tokenizer to process the train and test datasets:

pretrained_model_name_or_path = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path)

def preprocess_function(examples):
    result = tokenizer(
        examples["text"], padding="max_length", max_length=128, truncation=True
    return result

train_dataset =
    preprocess_function, batched=True, load_from_cache_file=False, num_proc=num_proc

valid_dataset =
    preprocess_function, batched=True, load_from_cache_file=False, num_proc=num_proc

test_dataset =
    preprocess_function, batched=True, load_from_cache_file=False, num_proc=num_proc

For each input text, the DistilBERT tokenizer outputs four features:

  • text – Input text.
  • labels – Output labels.
  • input_ids – Indexes of input sequence tokens in a vocabulary.
  • attention_mask – Mask to avoid performing attention on padding token indexes. Mask values selected are [0, 1]:
    • 1 for tokens that are not masked.
    • 0 for tokens that are masked.

Now that we have the tokenized dataset, the next step is to train the binary toxic language classifier.


The first step is to load the base model, which is a pre-trained DistilBERT language model. The model is loaded with the Hugging Face Transformers class AutoModelForSequenceClassification:

base_model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path, num_labels=1

Then we customize the hyperparameters using class TrainingArguments. The model is trained with batch size 32 on 10 epochs with learning rate of 5e-6 and warmup steps of 500. The trained model is saved in model_dir, which was defined in the beginning of the notebook.

training_args = TrainingArguments(
    logging_dir=os.path.join(model_dir, "logs"),

To evaluate the model’s performance during training, we need to provide the Trainer with an evaluation function. Here we are report accuracy, F1 scores, average precision, and AUC scores.

# compute metrics function
def compute_metrics(pred):
    targets = 1 * (pred.label_ids >= 0.5)
    outputs = 1 * (pred.predictions >= 0.5)
    accuracy = metrics.accuracy_score(targets, outputs)
    f1_score_micro = metrics.f1_score(targets, outputs, average="micro")
    f1_score_macro = metrics.f1_score(targets, outputs, average="macro")
    f1_score_weighted = metrics.f1_score(targets, outputs, average="weighted")
    ap_score_micro = metrics.average_precision_score(
        targets, pred.predictions, average="micro"
    ap_score_macro = metrics.average_precision_score(
        targets, pred.predictions, average="macro"
    ap_score_weighted = metrics.average_precision_score(
        targets, pred.predictions, average="weighted"
    auc_score_micro = metrics.roc_auc_score(targets, pred.predictions, average="micro")
    auc_score_macro = metrics.roc_auc_score(targets, pred.predictions, average="macro")
    auc_score_weighted = metrics.roc_auc_score(
        targets, pred.predictions, average="weighted"
    return {
        "accuracy": accuracy,
        "f1_score_micro": f1_score_micro,
        "f1_score_macro": f1_score_macro,
        "f1_score_weighted": f1_score_weighted,
        "ap_score_micro": ap_score_micro,
        "ap_score_macro": ap_score_macro,
        "ap_score_weighted": ap_score_weighted,
        "auc_score_micro": auc_score_micro,
        "auc_score_macro": auc_score_macro,
        "auc_score_weighted": auc_score_weighted,

The Trainer class provides an API for feature-complete training in PyTorch. Let’s instantiate the Trainer by providing the base model, training arguments, training and evaluation dataset, as well as the evaluation function:

trainer = Trainer(

After the Trainer is instantiated, we can kick off the training process:

train_result = trainer.train()

When the training process is finished, we save the tokenizer and model artifacts locally:


Evaluate the model robustness

In this section, we try to answer one question: how robust is our toxicity filtering model against text-based adversarial attacks? To answer this question, we select an attack recipe from the TextAttack library and use it to construct perturbed adversarial examples to fool our target toxicity filtering model. Each attack recipe generates text adversarial examples by transforming seed text inputs into slightly changed text samples, while making sure the seed and its perturbed text follow certain language constraints (for example, semantic preserved). If these newly generated examples trick a target model into wrong classifications, the attack is successful; otherwise, the attack fails for that seed input.

A target model’s adversarial robustness is evaluated through the Attack Success Rate (ASR) metric. ASR is defined as the ratio of successful attacks against all the attacks. The lower the ASR, the more robust a model is against adversarial attacks.

Attack Success Rate

First, we define a custom model wrapper to wrap the tokenization and model prediction together. This step also makes sure the prediction outputs meet the required output formats by the TextAttack library.

class CustomModelWrapper(ModelWrapper):
    def __init__(self, model):
        self.model = model

    def __call__(self, text_input_list):
        device = self.model.device
        encoded_input = tokenizer(
        # print(encoded_input.device)

        with torch.no_grad():
            output = self.model(**encoded_input)
        logits = output.logits
        preds = torch.sigmoid(logits)
        preds = preds.squeeze(dim=-1)
        final_preds = torch.stack((1 - preds, preds), dim=1)
        return final_preds

Now we load the trained model and create a custom model wrapper using the trained model:

trained_model = AutoModelForSequenceClassification.from_pretrained(model_dir)
trained_model ="cuda:0")

model_wrapper = CustomModelWrapper(trained_model)

Generate attacks

Now we need to prepare the dataset as seed for an attack recipe. Here we only use those toxic examples as seeds, because in a real-world scenario, the social engineer will mostly try to perturb toxic examples to fool a target filtering model as benign. Attacks could take time to generate; for the purpose of this post, we randomly sample 1,000 toxic training samples to attack.

We generate the adversarial examples for both test and train datasets. We use test adversarial examples for robustness evaluation and the train adversarial examples for adversarial training.

threshold = 0.5
sub_sample_to_attack = 1000
df_train_to_attack = df_train[df_train['labels']==1].sample(sub_sample_to_attack)

## We attack the toxic samples
## Goal is to perturbe toxic samples enough that the model classifies them as Non-toxic
test_dataset_to_attack = textattack.datasets.Dataset(
        (x, 1)
        for x, y in zip(
        if y > threshold

train_dataset_to_attack = textattack.datasets.Dataset(
        (x, 1)
        for x, y in zip(
        if y > threshold

Then we define the function to generate the attacks:

def generate_attacks(
    recipe, model_wrapper, dataset_to_attack, num_examples=-1, parallel=False
    print(f"The Attack Recipe is: {recipe}")
    if recipe == "textfoolerjin2019":
        attack =
    elif recipe == "a2t_yoo_2021":
        attack =
    elif recipe == "Pruthi2019":
        attack =
    elif recipe == "TextBuggerLi2018":
        attack =
    elif recipe == "DeepWordBugGao2018":
        attack =

    attack_args = textattack.AttackArgs(
        num_examples=num_examples, parallel=parallel, num_workers_per_device=5
    ## num_examples = -1 means the entire dataset
    attacker = Attacker(attack, dataset_to_attack, attack_args)
    attack_results = attacker.attack_dataset()
    return attack_results

Choose an attack recipe and generate attacks:

recipe = 'textfoolerjin2019'
test_attack_results = generate_attacks(recipe, model_wrapper, test_dataset_to_attack, num_examples=-1)
train_attack_results = generate_attacks(recipe, model_wrapper, train_dataset_to_attack, num_examples=-1)

Log the attack results into a Pandas data frame:

def log_attack_results(attack_results):
    exception_ids = []
    logger = CSVLogger(color_method="html")
    for i in range(len(attack_results)):
            result = attack_results[i]
    df_attacks = logger.df
    return df_attacks, exception_ids

df_attacks_test, test_exception_ids = log_attack_results(test_attack_results)
df_attacks_train, train_exception_ids = log_attack_results(train_attack_results)

The attack results contain original_text, perturbed_text, original_output, and perturbed_output. When the perturbed_output is the opposite of the original_output, the attack is successful.



    HTML(df_attacks_test[["original_text", "perturbed_text"]].head().to_html(escape=False))

The red text represents a successful attack, and the green represents a failed attack.

Attack Results

Evaluate the model robustness through ASR

Use the following code to evaluate the model robustness:

ASR_test = (
    / df_attacks_test.result_type.value_counts().sum()

ASR_train = (
    / df_attacks_train.result_type.value_counts().sum()

print(f"The Attack Success Rate of the model toward test dataset is {ASR_test*100}%")

print(f"The Attack Success Rate of the model toward train dataset is {ASR_train*100}%")

This returns the following:

The Attack Success Rate of the model toward test dataset is 52.400000000000006%
The Attack Success Rate of the model toward train dataset is 51.1%

Prepare successful attacks

With all the attack results available, we take the successful attack from the train adversarial examples and use them to retrain the model:

# Supply the original labels to the successful attacks
# Here the original labels are all 1, there are also some datasets with fractional labels between 0-1

df_attacks_train = df_attacks_train[["perturbed_text", "result_type"]].copy()
df_attacks_train["labels"] = df_train_to_attack["labels"].reset_index(drop=True)

# Clean the text
df_attacks_train["text"] = df_attacks_train["perturbed_text"].replace(
    "<font color = .{1,6}>|</font>", "", regex=True
df_attacks_train["text"] = df_attacks_train["text"].replace("<SPLIT>", "\n", regex=True)

# Prepare data to add to the training dataset
df_succ_attacks_train = df_attacks_train.loc[
    df_attacks_train.result_type == "Successful", ["text", "labels"]
df_succ_attacks_train.shape, df_succ_attacks_train.head(2)

Successful Attacks

Adversarial training

In this section, we combine the successful adversarial attacks from the training data with the original training data, then train a new model on this combined dataset. This model is called the adversarial trained model.

# New Train: Original Train + Successful Attacks on Original Train

df_train_attacked = pd.concat([df_train, df_succ_attacks_train], ignore_index=True)
data_train_attacked = Dataset.from_pandas(df_train_attacked)
data_train_attacked =
    preprocess_function, batched=True, load_from_cache_file=False, num_proc=num_proc

training_args_AT = TrainingArguments(
    logging_dir=os.path.join(model_dir, "logs"),

trainer_AT = Trainer(


Save the adversarial trained model to local directory model_dir_AT:


Evaluate the robustness of the adversarial trained model

Now the model is adversarially trained, we want to see how the model robustness changes accordingly:

trained_model_AT = AutoModelForSequenceClassification.from_pretrained(model_dir_AT)
trained_model_AT ="cuda:0")

model_wrapper_AT = CustomModelWrapper(trained_model_AT)
test_attack_results_AT = generate_attacks(recipe, model_wrapper_AT, test_dataset_to_attack, num_examples=-1)
df_attacks_AT_test, AT_test_exception_ids = log_attack_results(test_attack_results_AT)

ASR_AT_test = (
    / df_attacks_AT_test.result_type.value_counts().sum()

print(f"The Attack Success Rate of the model is {ASR_AT_test*100}%")

The preceding code returns the following results:

The Attack Success Rate of the model is 19.8%

Compare the robustness of the original model and the adversarial trained model:

    f"The ASR of the Adversarial Trained model has a {(ASR_test - ASR_AT_test)/ASR_test*100}% decrease compare with the original model. This proves that the Adversarial Training improves the model's robustness against the attacks."

This returns the following:

The ASR of the Adversarial Trained model has a 62.213740458015266% decrease
compare with the original model. This proves that the Adversarial Training
improves the model's robustness against the attacks.

So far, we have trained a DistilBERT-based binary toxicity language classifier, tested its robustness against adversarial text attacks, performed adversarial training to obtain a new toxicity language classifier, and tested the new model’s robustness against adversarial text attacks.

We observe that the adversarial trained model has a lower ASR, with an 62.21% decrease using the original model ASR as the benchmark. This indicates that the model is more robust against certain adversarial attacks.

Model performance evaluation

Besides model robustness, we’re also interested in learning how a model predicts on clean samples after it’s adversarially trained. In the following code, we use batch prediction mode to speed up the evaluation process:

def batch_predict(model_wrapper, text_list, batch_size=64):
    """This function performs batch prediction for given model nad text list"""
    predictions = []
    for i in tqdm(range(0, len(text_list), batch_size)):
       batch = text_list[i : i + batch_size]
       model_predictions = model_wrapper(batch)[:, 1]
       model_predictions = model_predictions.cpu().numpy()
       predictions = np.concatenate(predictions, axis=0)
    return predictions

Evaluate the original model

We use the following code to evaluate the original model:

test_text_list = df_test.text.to_list()

model_predictions = batch_predict(model_wrapper, test_text_list, batch_size=64)

y_true_prob = np.array(df_test["labels"])
y_true = [0 if x < 0.5 else 1 for x in y_true_prob]

threshold = 0.5
y_pred_prob = model_predictions.flatten()
y_pred = [0 if x < threshold else 1 for x in y_pred_prob]

fig, ax = plt.subplots(figsize=(10, 10))
conf_matrix = confusion_matrix(y_true, y_pred)
print(classification_report(y_true, y_pred))

The following figures summarize our findings.

Evaluate the adversarial trained model

Use the following code to evaluate the adversarial trained model:

model_predictions_AT = batch_predict(model_wrapper_AT, test_text_list, batch_size=64)

y_pred_prob_AT = model_predictions_AT.flatten()
y_pred_AT = [0 if x < threshold else 1 for x in y_pred_prob_AT]

fig, ax = plt.subplots(figsize=(10, 10))
conf_matrix = confusion_matrix(y_true, y_pred_AT)
print(classification_report(y_true, y_pred_AT))

The following figures summarize our findings.

We observe that the adversarial trained model tended to predict more examples as toxic (801 predicted as 1) compared with the original model (763 predicted as 1), which leads to an increase in recall of the toxic class and precision of the non-toxic class, and a drop in precision of the toxic class and recall of the non-toxic class. This might due to the fact that more of the toxic class is seen in the adversarial training process.


As part of content moderation, toxicity language classifiers are used to filter toxic content and create healthier online environments. Real-world deployment of toxicity filtering models calls for not only high prediction performance, but also for being robust against social engineering, like adversarial attacks. This post provides a step-by-step process from training a toxicity language classifier to improve its robustness with adversarial training. We show that adversarial training can help a model become more robust against attacks while maintaining high model performance. For more information about this up-and-coming topic, we encourage you to explore and test our script on your own. You can access the notebook in this post from the AWS Examples GitHub repo.

Hugging Face and AWS announced a partnership earlier in 2022 that makes it even easier to train Hugging Face models on SageMaker. This functionality is available through the development of Hugging Face AWS DLCs. These containers include the Hugging Face Transformers, Tokenizers, and Datasets libraries, which allow us to use these resources for training and inference jobs. For a list of the available DLC images, see Available Deep Learning Containers Images. They are maintained and regularly updated with security patches.

You can find many examples of how to train Hugging Face models with these DLCs in the following GitHub repo.

AWS offers pre-trained AWS AI services that can be integrated into applications using API calls and require no ML experience. For example, Amazon Comprehend can perform NLP tasks such as custom entity recognition, sentiment analysis, key phrase extraction, topic modeling, and more to gather insights from text. It can perform text analysis on a wide variety of languages for its various features.


About the Authors

Yi Xiang is a Data Scientist II at the Amazon Machine Learning Solutions Lab, where she helps AWS customers across different industries accelerate their AI and cloud adoption.

Yanjun Qi is a Principal Applied Scientist at the Amazon Machine Learning Solution Lab. She innovates and applies machine learning to help AWS customers speed up their AI and cloud adoption.