Fine-tuning de BERT avec Hugging Face#
La bibliothĂšque Transformers de Hugging Face offre des outils pour le fine-tuning des modĂšles de maniĂšre simple et efficace. Dans ce notebook, nous allons montrer comment utiliser Hugging Face pour fine-tuner BERT sur deux tĂąches : la reconnaissance dâentitĂ©s nommĂ©es (classification au niveau des tokens) et lâanalyse de sentiment (classification au niveau des phrases).
Reconnaissance dâentitĂ©s nommĂ©es#
Commençons par une tĂąche de classification au niveau des tokens : la reconnaissance dâentitĂ©s nommĂ©es (NER). Pour cela, nous utilisons le dataset CONLL. Pour lâexemple, nous allons prendre uniquement 1000 Ă©lĂ©ments de ce dataset.
from datasets import load_dataset
from transformers import AutoTokenizer, Trainer, TrainingArguments,AutoModelForTokenClassification,AutoModelForSequenceClassification
from transformers import DataCollatorForTokenClassification,DataCollatorWithPadding
import numpy as np
import evaluate
dataset = load_dataset("eriktks/conll2003",trust_remote_code=True)
# 1000 éléments pour l'entraßnement
sub_train_dataset = dataset['train'].shuffle(seed=42).select(range(1000))
# 500 éléments pour l'évaluation
sub_val_dataset = dataset['validation'].shuffle(seed=42).select(range(500)) # 500 examples for evaluation
print(sub_train_dataset['tokens'][0])
print(sub_train_dataset['ner_tags'][0])
['"', 'Neither', 'the', 'National', 'Socialists', '(', 'Nazis', ')', 'nor', 'the', 'communists', 'dared', 'to', 'kidnap', 'an', 'American', 'citizen', ',', '"', 'he', 'shouted', ',', 'in', 'an', 'oblique', 'reference', 'to', 'his', 'extradition', 'to', 'Germany', 'from', 'Denmark', '.', '"']
[0, 0, 0, 7, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 5, 0, 0]
Nous avons notre sĂ©quence de mots et la sĂ©quence de labels correspondante. Maintenant, associons nos labels aux classes : en NER, si plusieurs mots appartiennent Ă la mĂȘme entitĂ©, le premier mot aura la classe âB-XXXâ et les mots suivants de lâentitĂ© la classe âI-XXXâ.
# On va associer les labels Ă des entiers
itos={0: 'O', 1:'B-PER', 2:'I-PER', 3:'B-ORG', 4:'I-ORG', 5:'B-LOC', 6:'I-LOC', 7:'B-MISC', 8:'I-MISC'}
stoi = {v: k for k, v in itos.items()}
print(stoi)
print(itos)
label_names=list(itos.values())
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}
Nous allons maintenant charger le tokenizer de BERT. Câest la classe qui permettra de convertir notre phrase en sĂ©quence de tokens.
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
Le tokenizer transforme la phrase en tokens, mais il faut aussi adapter les labels. Chaque token doit avoir le bon label. Cette fonction permet dâassocier les labels aux tokens.
def align_labels_with_tokens(labels, word_ids):
new_labels = []
current_word = None
for word_id in word_ids:
if word_id != current_word:
# Début d'un nouveau mot
current_word = word_id
# -100 pour les tokens spéciaux
label = -100 if word_id is None else labels[word_id]
new_labels.append(label)
elif word_id is None:
# -100 pour les tokens spéciaux
new_labels.append(-100)
else:
# Les tokens du mĂȘme mot ont le mĂȘme label (sauf le premier)
label = labels[word_id]
# B pour le premier token du mot, I pour les suivants (cf itos)
if label % 2 == 1:
label += 1
new_labels.append(label)
return new_labels
Nous pouvons maintenant transformer notre séquence en tokens et obtenir les labels correspondants :
def tokenize_and_align_labels(examples):
# On tokenise les phrases
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True )
all_labels = examples["ner_tags"]
new_labels = []
# On aligne les labels avec les tokens
for i, labels in enumerate(all_labels):
word_ids = tokenized_inputs.word_ids(i)
new_labels.append(align_labels_with_tokens(labels, word_ids))
tokenized_inputs["labels"] = new_labels
return tokenized_inputs
# On applique la fonction sur les données de train et de validation
train_tokenized_datasets = sub_train_dataset.map(
tokenize_and_align_labels,
batched=True,
)
val_tokenized_datasets = sub_val_dataset.map(
tokenize_and_align_labels,
batched=True,
)
Map: 100%|ââââââââââ| 1000/1000 [00:00<00:00, 12651.62 examples/s]
Map: 100%|ââââââââââ| 500/500 [00:00<00:00, 11565.07 examples/s]
Créons notre modÚle BERT. Hugging Face permet de créer directement un modÚle pour la classification au niveau des tokens avec AutoModelForTokenClassification.
model = AutoModelForTokenClassification.from_pretrained("google-bert/bert-base-uncased",id2label=itos, label2id=stoi)
Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
DĂ©finissons aussi une fonction pour calculer lâaccuracy et le f1-score sur nos donnĂ©es de validation.
metric = evaluate.load("seqeval")
def compute_metrics(eval_preds):
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
# On supprime les labels -100
true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
true_predictions = [
[label_names[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
return {
"accuracy": all_metrics["overall_accuracy"],
"f1": all_metrics["overall_f1"],
}
Nous sommes prĂȘts Ă entraĂźner notre modĂšle ! Pour cela, nous allons utiliser le trainer de Hugging Face.
# Pour paramétrer l'entraßnement, on peut changer tout un tas de paramÚtres mais ceux par défaut sont souvent suffisants
args = TrainingArguments(
output_dir="./models",
evaluation_strategy="no",
save_strategy="no",
num_train_epochs=5,
weight_decay=0.01,
)
/home/aquilae/anaconda3/envs/dev/lib/python3.11/site-packages/transformers/training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of đ€ Transformers. Use `eval_strategy` instead
warnings.warn(
# la fonction DataCollatorForTokenClassification permet de rajouter du padding pour que les sĂ©quences du batch aient la mĂȘme taille
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
trainer = Trainer(
model=model,
data_collator=data_collator,
args=args,
train_dataset=train_tokenized_datasets, # Dataset d'entraĂźnement
eval_dataset=val_tokenized_datasets, # Dataset d'évaluation
compute_metrics=compute_metrics,
tokenizer=tokenizer,
)
trainer.train()
80%|ââââââââ | 501/625 [03:08<00:46, 2.69it/s]
{'loss': 0.1273, 'grad_norm': 11.809627532958984, 'learning_rate': 1e-05, 'epoch': 4.0}
100%|ââââââââââ| 625/625 [03:55<00:00, 2.66it/s]
{'train_runtime': 235.0171, 'train_samples_per_second': 21.275, 'train_steps_per_second': 2.659, 'train_loss': 0.10341672458648682, 'epoch': 5.0}
TrainOutput(global_step=625, training_loss=0.10341672458648682, metrics={'train_runtime': 235.0171, 'train_samples_per_second': 21.275, 'train_steps_per_second': 2.659, 'total_flos': 106538246287344.0, 'train_loss': 0.10341672458648682, 'epoch': 5.0})
LâentraĂźnement est terminĂ©, nous pouvons Ă©valuer notre modĂšle sur les donnĂ©es de validation :
trainer.evaluate()
100%|ââââââââââ| 63/63 [00:06<00:00, 10.47it/s]
{'eval_loss': 0.10586605966091156,
'eval_accuracy': 0.9793857803954564,
'eval_f1': 0.902547065337763,
'eval_runtime': 6.1292,
'eval_samples_per_second': 81.577,
'eval_steps_per_second': 10.279,
'epoch': 5.0}
Nous obtenons de trĂšs bons scores : accuracy de 0.98 et f1-score de 0.90.
Analyse de sentiments#
Maintenant, passons Ă une tĂąche de classification au niveau des phrases : lâanalyse de sentiment. Pour cela, nous utilisons le dataset IMDB. Câest un dataset qui regroupe des critiques positives ou nĂ©gatives de films. Le but du modĂšle est de dĂ©terminer si la critique est positive ou nĂ©gative.
dataset = load_dataset("stanfordnlp/imdb",trust_remote_code=True)
# 1000 éléments pour l'entraßnement
sub_train_dataset = dataset['train'].shuffle(seed=42).select(range(1000))
# 500 éléments pour l'évaluation
sub_val_dataset = dataset['test'].shuffle(seed=42).select(range(500)) # 500 examples for evaluation
print(sub_train_dataset['text'][0])
print(sub_train_dataset['label'][0])
itos={0: 'neg', 1:'pos'}
stoi = {v: k for k, v in itos.items()}
There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
1
Nous pouvons utiliser le mĂȘme tokenizer que pour la partie prĂ©cĂ©dente pour extraire les tokens de notre texte. Ici, pas besoin de faire correspondre le label Ă chaque token, car le label concerne la phrase entiĂšre.
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True,is_split_into_words=False)
tokenized_train_dataset = sub_train_dataset.map(preprocess_function, batched=True)
tokenized_val_dataset = sub_val_dataset.map(preprocess_function, batched=True)
print(tokenized_train_dataset['input_ids'][0])
print(tokenized_train_dataset['label'][0])
Map: 100%|ââââââââââ| 1000/1000 [00:00<00:00, 4040.25 examples/s]
Map: 100%|ââââââââââ| 500/500 [00:00<00:00, 5157.73 examples/s]
[101, 2045, 2003, 2053, 7189, 2012, 2035, 2090, 3481, 3771, 1998, 6337, 2099, 2021, 1996, 2755, 2008, 2119, 2024, 2610, 2186, 2055, 6355, 6997, 1012, 6337, 2099, 3504, 15594, 2100, 1010, 3481, 3771, 3504, 4438, 1012, 6337, 2099, 14811, 2024, 3243, 3722, 1012, 3481, 3771, 1005, 1055, 5436, 2024, 2521, 2062, 8552, 1012, 1012, 1012, 3481, 3771, 3504, 2062, 2066, 3539, 8343, 1010, 2065, 2057, 2031, 2000, 3962, 12319, 1012, 1012, 1012, 1996, 2364, 2839, 2003, 5410, 1998, 6881, 2080, 1010, 2021, 2031, 1000, 17936, 6767, 7054, 3401, 1000, 1012, 2111, 2066, 2000, 12826, 1010, 2000, 3648, 1010, 2000, 16157, 1012, 2129, 2055, 2074, 9107, 1029, 6057, 2518, 2205, 1010, 2111, 3015, 3481, 3771, 3504, 2137, 2021, 1010, 2006, 1996, 2060, 2192, 1010, 9177, 2027, 9544, 2137, 2186, 1006, 999, 999, 999, 1007, 1012, 2672, 2009, 1005, 1055, 1996, 2653, 1010, 2030, 1996, 4382, 1010, 2021, 1045, 2228, 2023, 2186, 2003, 2062, 2394, 2084, 2137, 1012, 2011, 1996, 2126, 1010, 1996, 5889, 2024, 2428, 2204, 1998, 6057, 1012, 1996, 3772, 2003, 2025, 23105, 2012, 2035, 1012, 1012, 1012, 102]
1
Nous pouvons créer notre modÚle.
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2, id2label=itos, label2id=stoi)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Nous pouvons maintenant créer notre fonction de calcul des performances :
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
accuracy_score = accuracy.compute(predictions=predictions, references=labels)
f1_score = f1.compute(predictions=predictions, references=labels,average="macro")
return {
"f1": f1_score["f1"],
"accuracy": accuracy_score["accuracy"],
}
Et entraĂźner le modĂšle :
training_args = TrainingArguments(
output_dir="models",
num_train_epochs=5,
weight_decay=0.01,
eval_strategy="no",
save_strategy="no",
)
# Pad the inputs to the maximum length in the batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_val_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
80%|ââââââââ | 500/625 [12:05<02:58, 1.43s/it]
{'loss': 0.2295, 'grad_norm': 0.022515051066875458, 'learning_rate': 1e-05, 'epoch': 4.0}
100%|ââââââââââ| 625/625 [15:06<00:00, 1.45s/it]
{'train_runtime': 906.0393, 'train_samples_per_second': 5.519, 'train_steps_per_second': 0.69, 'train_loss': 0.1885655658721924, 'epoch': 5.0}
TrainOutput(global_step=625, training_loss=0.1885655658721924, metrics={'train_runtime': 906.0393, 'train_samples_per_second': 5.519, 'train_steps_per_second': 0.69, 'total_flos': 613576571755968.0, 'train_loss': 0.1885655658721924, 'epoch': 5.0})
Ăvaluons notre modĂšle :
trainer.evaluate()
100%|ââââââââââ| 63/63 [00:33<00:00, 1.86it/s]
{'eval_loss': 0.565979540348053,
'eval_f1': 0.8879354508196722,
'eval_accuracy': 0.888,
'eval_runtime': 34.4579,
'eval_samples_per_second': 14.51,
'eval_steps_per_second': 1.828,
'epoch': 5.0}
Nous obtenons de bons scores : accuracy de 0.89 et f1-score de 0.89 également.