Fine-tuning de BERT avec Hugging Face

Fine-tuning de BERT avec Hugging Face#

La bibliothĂšque Transformers de Hugging Face offre des outils pour le fine-tuning des modĂšles de maniĂšre simple et efficace. Dans ce notebook, nous allons montrer comment utiliser Hugging Face pour fine-tuner BERT sur deux tĂąches : la reconnaissance d’entitĂ©s nommĂ©es (classification au niveau des tokens) et l’analyse de sentiment (classification au niveau des phrases).

Reconnaissance d’entitĂ©s nommĂ©es#

Commençons par une tĂąche de classification au niveau des tokens : la reconnaissance d’entitĂ©s nommĂ©es (NER). Pour cela, nous utilisons le dataset CONLL. Pour l’exemple, nous allons prendre uniquement 1000 Ă©lĂ©ments de ce dataset.

from datasets import load_dataset
from transformers import AutoTokenizer, Trainer, TrainingArguments,AutoModelForTokenClassification,AutoModelForSequenceClassification
from transformers import DataCollatorForTokenClassification,DataCollatorWithPadding
import numpy as np
import evaluate
dataset = load_dataset("eriktks/conll2003",trust_remote_code=True)

# 1000 éléments pour l'entraßnement
sub_train_dataset = dataset['train'].shuffle(seed=42).select(range(1000))

# 500 éléments pour l'évaluation
sub_val_dataset = dataset['validation'].shuffle(seed=42).select(range(500)) # 500 examples for evaluation

print(sub_train_dataset['tokens'][0])
print(sub_train_dataset['ner_tags'][0])
['"', 'Neither', 'the', 'National', 'Socialists', '(', 'Nazis', ')', 'nor', 'the', 'communists', 'dared', 'to', 'kidnap', 'an', 'American', 'citizen', ',', '"', 'he', 'shouted', ',', 'in', 'an', 'oblique', 'reference', 'to', 'his', 'extradition', 'to', 'Germany', 'from', 'Denmark', '.', '"']
[0, 0, 0, 7, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 5, 0, 0]

Nous avons notre sĂ©quence de mots et la sĂ©quence de labels correspondante. Maintenant, associons nos labels aux classes : en NER, si plusieurs mots appartiennent Ă  la mĂȘme entitĂ©, le premier mot aura la classe “B-XXX” et les mots suivants de l’entitĂ© la classe “I-XXX”.

# On va associer les labels Ă  des entiers
itos={0: 'O', 1:'B-PER', 2:'I-PER',  3:'B-ORG',  4:'I-ORG',  5:'B-LOC',  6:'I-LOC', 7:'B-MISC', 8:'I-MISC'}
stoi = {v: k for k, v in itos.items()}
print(stoi)
print(itos)
label_names=list(itos.values())
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}

Nous allons maintenant charger le tokenizer de BERT. C’est la classe qui permettra de convertir notre phrase en sĂ©quence de tokens.

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

Le tokenizer transforme la phrase en tokens, mais il faut aussi adapter les labels. Chaque token doit avoir le bon label. Cette fonction permet d’associer les labels aux tokens.

def align_labels_with_tokens(labels, word_ids):
  new_labels = []
  current_word = None
  for word_id in word_ids:
    if word_id != current_word:
      # Début d'un nouveau mot
      current_word = word_id
      # -100 pour les tokens spéciaux
      label = -100 if word_id is None else labels[word_id]
      new_labels.append(label)
    elif word_id is None:
      # -100 pour les tokens spéciaux
      new_labels.append(-100)
    else:
      # Les tokens du mĂȘme mot ont le mĂȘme label (sauf le premier)
      label = labels[word_id]
      # B pour le premier token du mot, I pour les suivants (cf itos)
      if label % 2 == 1:
        label += 1
      new_labels.append(label)
  return new_labels

Nous pouvons maintenant transformer notre séquence en tokens et obtenir les labels correspondants :

def tokenize_and_align_labels(examples):
  # On tokenise les phrases
  tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True )
  all_labels = examples["ner_tags"]
  new_labels = []
  # On aligne les labels avec les tokens
  for i, labels in enumerate(all_labels):
    word_ids = tokenized_inputs.word_ids(i)
    new_labels.append(align_labels_with_tokens(labels, word_ids))
  tokenized_inputs["labels"] = new_labels
  return tokenized_inputs

# On applique la fonction sur les données de train et de validation
train_tokenized_datasets = sub_train_dataset.map(
  tokenize_and_align_labels,
  batched=True,
)
val_tokenized_datasets = sub_val_dataset.map(
  tokenize_and_align_labels,
  batched=True,
)
Map: 100%|██████████| 1000/1000 [00:00<00:00, 12651.62 examples/s]
Map: 100%|██████████| 500/500 [00:00<00:00, 11565.07 examples/s]

Créons notre modÚle BERT. Hugging Face permet de créer directement un modÚle pour la classification au niveau des tokens avec AutoModelForTokenClassification.

model = AutoModelForTokenClassification.from_pretrained("google-bert/bert-base-uncased",id2label=itos, label2id=stoi) 
Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

DĂ©finissons aussi une fonction pour calculer l’accuracy et le f1-score sur nos donnĂ©es de validation.

metric = evaluate.load("seqeval")

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    # On supprime les labels -100
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "accuracy": all_metrics["overall_accuracy"],
        "f1": all_metrics["overall_f1"],
    }

Nous sommes prĂȘts Ă  entraĂźner notre modĂšle ! Pour cela, nous allons utiliser le trainer de Hugging Face.

# Pour paramétrer l'entraßnement, on peut changer tout un tas de paramÚtres mais ceux par défaut sont souvent suffisants
args = TrainingArguments(
    output_dir="./models",
    evaluation_strategy="no",
    save_strategy="no",
    num_train_epochs=5,
    weight_decay=0.01,
)
/home/aquilae/anaconda3/envs/dev/lib/python3.11/site-packages/transformers/training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of đŸ€— Transformers. Use `eval_strategy` instead
  warnings.warn(
# la fonction DataCollatorForTokenClassification permet de rajouter du padding pour que les sĂ©quences du batch aient la mĂȘme taille
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=args,
    train_dataset=train_tokenized_datasets, # Dataset d'entraĂźnement
    eval_dataset=val_tokenized_datasets, # Dataset d'évaluation
    compute_metrics=compute_metrics, 
    tokenizer=tokenizer,
)
trainer.train()
 80%|████████  | 501/625 [03:08<00:46,  2.69it/s]
{'loss': 0.1273, 'grad_norm': 11.809627532958984, 'learning_rate': 1e-05, 'epoch': 4.0}
100%|██████████| 625/625 [03:55<00:00,  2.66it/s]
{'train_runtime': 235.0171, 'train_samples_per_second': 21.275, 'train_steps_per_second': 2.659, 'train_loss': 0.10341672458648682, 'epoch': 5.0}

TrainOutput(global_step=625, training_loss=0.10341672458648682, metrics={'train_runtime': 235.0171, 'train_samples_per_second': 21.275, 'train_steps_per_second': 2.659, 'total_flos': 106538246287344.0, 'train_loss': 0.10341672458648682, 'epoch': 5.0})

L’entraĂźnement est terminĂ©, nous pouvons Ă©valuer notre modĂšle sur les donnĂ©es de validation :

trainer.evaluate()
100%|██████████| 63/63 [00:06<00:00, 10.47it/s]
{'eval_loss': 0.10586605966091156,
 'eval_accuracy': 0.9793857803954564,
 'eval_f1': 0.902547065337763,
 'eval_runtime': 6.1292,
 'eval_samples_per_second': 81.577,
 'eval_steps_per_second': 10.279,
 'epoch': 5.0}

Nous obtenons de trĂšs bons scores : accuracy de 0.98 et f1-score de 0.90.

Analyse de sentiments#

Maintenant, passons Ă  une tĂąche de classification au niveau des phrases : l’analyse de sentiment. Pour cela, nous utilisons le dataset IMDB. C’est un dataset qui regroupe des critiques positives ou nĂ©gatives de films. Le but du modĂšle est de dĂ©terminer si la critique est positive ou nĂ©gative.

dataset = load_dataset("stanfordnlp/imdb",trust_remote_code=True)

# 1000 éléments pour l'entraßnement
sub_train_dataset = dataset['train'].shuffle(seed=42).select(range(1000))

# 500 éléments pour l'évaluation
sub_val_dataset = dataset['test'].shuffle(seed=42).select(range(500)) # 500 examples for evaluation

print(sub_train_dataset['text'][0])
print(sub_train_dataset['label'][0])

itos={0: 'neg', 1:'pos'}
stoi = {v: k for k, v in itos.items()}
There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
1

Nous pouvons utiliser le mĂȘme tokenizer que pour la partie prĂ©cĂ©dente pour extraire les tokens de notre texte. Ici, pas besoin de faire correspondre le label Ă  chaque token, car le label concerne la phrase entiĂšre.

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True,is_split_into_words=False) 
tokenized_train_dataset = sub_train_dataset.map(preprocess_function, batched=True)
tokenized_val_dataset = sub_val_dataset.map(preprocess_function, batched=True)
print(tokenized_train_dataset['input_ids'][0])
print(tokenized_train_dataset['label'][0])
Map: 100%|██████████| 1000/1000 [00:00<00:00, 4040.25 examples/s]
Map: 100%|██████████| 500/500 [00:00<00:00, 5157.73 examples/s]
[101, 2045, 2003, 2053, 7189, 2012, 2035, 2090, 3481, 3771, 1998, 6337, 2099, 2021, 1996, 2755, 2008, 2119, 2024, 2610, 2186, 2055, 6355, 6997, 1012, 6337, 2099, 3504, 15594, 2100, 1010, 3481, 3771, 3504, 4438, 1012, 6337, 2099, 14811, 2024, 3243, 3722, 1012, 3481, 3771, 1005, 1055, 5436, 2024, 2521, 2062, 8552, 1012, 1012, 1012, 3481, 3771, 3504, 2062, 2066, 3539, 8343, 1010, 2065, 2057, 2031, 2000, 3962, 12319, 1012, 1012, 1012, 1996, 2364, 2839, 2003, 5410, 1998, 6881, 2080, 1010, 2021, 2031, 1000, 17936, 6767, 7054, 3401, 1000, 1012, 2111, 2066, 2000, 12826, 1010, 2000, 3648, 1010, 2000, 16157, 1012, 2129, 2055, 2074, 9107, 1029, 6057, 2518, 2205, 1010, 2111, 3015, 3481, 3771, 3504, 2137, 2021, 1010, 2006, 1996, 2060, 2192, 1010, 9177, 2027, 9544, 2137, 2186, 1006, 999, 999, 999, 1007, 1012, 2672, 2009, 1005, 1055, 1996, 2653, 1010, 2030, 1996, 4382, 1010, 2021, 1045, 2228, 2023, 2186, 2003, 2062, 2394, 2084, 2137, 1012, 2011, 1996, 2126, 1010, 1996, 5889, 2024, 2428, 2204, 1998, 6057, 1012, 1996, 3772, 2003, 2025, 23105, 2012, 2035, 1012, 1012, 1012, 102]
1

Nous pouvons créer notre modÚle.

model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2, id2label=itos, label2id=stoi)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Nous pouvons maintenant créer notre fonction de calcul des performances :

accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy_score = accuracy.compute(predictions=predictions, references=labels)
    f1_score = f1.compute(predictions=predictions, references=labels,average="macro")
    return {
        "f1": f1_score["f1"],
        "accuracy": accuracy_score["accuracy"],
    }

Et entraĂźner le modĂšle :

training_args = TrainingArguments(
    output_dir="models",
    num_train_epochs=5,
    weight_decay=0.01,
    eval_strategy="no",
    save_strategy="no",
)

# Pad the inputs to the maximum length in the batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()
 80%|████████  | 500/625 [12:05<02:58,  1.43s/it]
{'loss': 0.2295, 'grad_norm': 0.022515051066875458, 'learning_rate': 1e-05, 'epoch': 4.0}
100%|██████████| 625/625 [15:06<00:00,  1.45s/it]
{'train_runtime': 906.0393, 'train_samples_per_second': 5.519, 'train_steps_per_second': 0.69, 'train_loss': 0.1885655658721924, 'epoch': 5.0}

TrainOutput(global_step=625, training_loss=0.1885655658721924, metrics={'train_runtime': 906.0393, 'train_samples_per_second': 5.519, 'train_steps_per_second': 0.69, 'total_flos': 613576571755968.0, 'train_loss': 0.1885655658721924, 'epoch': 5.0})

Évaluons notre modùle :

trainer.evaluate()
100%|██████████| 63/63 [00:33<00:00,  1.86it/s]
{'eval_loss': 0.565979540348053,
 'eval_f1': 0.8879354508196722,
 'eval_accuracy': 0.888,
 'eval_runtime': 34.4579,
 'eval_samples_per_second': 14.51,
 'eval_steps_per_second': 1.828,
 'epoch': 5.0}

Nous obtenons de bons scores : accuracy de 0.89 et f1-score de 0.89 également.