Ajuste fino de BERT con Hugging Face#
La biblioteca Transformers de Hugging Face ofrece herramientas para el ajuste fino de modelos de manera sencilla y eficiente. En este cuaderno, mostraremos cómo utilizar Hugging Face para ajustar BERT en dos tareas: el reconocimiento de entidades nombradas (clasificación a nivel de tokens) y el análisis de sentimientos (clasificación a nivel de oraciones).
Reconocimiento de entidades nombradas#
Comencemos con una tarea de clasificación a nivel de tokens: el reconocimiento de entidades nombradas (NER). Para ello, utilizamos el conjunto de datos CONLL. Para el ejemplo, tomaremos solo 1000 elementos de este conjunto de datos.
from datasets import load_dataset
from transformers import AutoTokenizer, Trainer, TrainingArguments,AutoModelForTokenClassification,AutoModelForSequenceClassification
from transformers import DataCollatorForTokenClassification,DataCollatorWithPadding
import numpy as np
import evaluate
dataset = load_dataset("eriktks/conll2003",trust_remote_code=True)
# 1000 éléments pour l'entraînement
sub_train_dataset = dataset['train'].shuffle(seed=42).select(range(1000))
# 500 éléments pour l'évaluation
sub_val_dataset = dataset['validation'].shuffle(seed=42).select(range(500)) # 500 examples for evaluation
print(sub_train_dataset['tokens'][0])
print(sub_train_dataset['ner_tags'][0])
['"', 'Neither', 'the', 'National', 'Socialists', '(', 'Nazis', ')', 'nor', 'the', 'communists', 'dared', 'to', 'kidnap', 'an', 'American', 'citizen', ',', '"', 'he', 'shouted', ',', 'in', 'an', 'oblique', 'reference', 'to', 'his', 'extradition', 'to', 'Germany', 'from', 'Denmark', '.', '"']
[0, 0, 0, 7, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 5, 0, 0]
Tenemos nuestra secuencia de palabras y la secuencia de etiquetas correspondiente. Ahora, asociemos nuestras etiquetas a las clases: en NER, si varias palabras pertenecen a la misma entidad, la primera palabra tendrá la clase “B-XXX” y las palabras siguientes de la entidad la clase “I-XXX”.
# On va associer les labels à des entiers
itos={0: 'O', 1:'B-PER', 2:'I-PER', 3:'B-ORG', 4:'I-ORG', 5:'B-LOC', 6:'I-LOC', 7:'B-MISC', 8:'I-MISC'}
stoi = {v: k for k, v in itos.items()}
print(stoi)
print(itos)
label_names=list(itos.values())
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}
Ahora vamos a cargar el tokenizador de BERT. Esta es la clase que permitirá convertir nuestra oración en una secuencia de tokens.
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
El tokenizador transforma la oración en tokens, pero también hay que adaptar las etiquetas. Cada token debe tener la etiqueta correcta. Esta función permite asociar las etiquetas a los tokens.
def align_labels_with_tokens(labels, word_ids):
new_labels = []
current_word = None
for word_id in word_ids:
if word_id != current_word:
# Début d'un nouveau mot
current_word = word_id
# -100 pour les tokens spéciaux
label = -100 if word_id is None else labels[word_id]
new_labels.append(label)
elif word_id is None:
# -100 pour les tokens spéciaux
new_labels.append(-100)
else:
# Les tokens du même mot ont le même label (sauf le premier)
label = labels[word_id]
# B pour le premier token du mot, I pour les suivants (cf itos)
if label % 2 == 1:
label += 1
new_labels.append(label)
return new_labels
Ahora podemos transformar nuestra secuencia en tokens y obtener las etiquetas correspondientes:
def tokenize_and_align_labels(examples):
# On tokenise les phrases
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True )
all_labels = examples["ner_tags"]
new_labels = []
# On aligne les labels avec les tokens
for i, labels in enumerate(all_labels):
word_ids = tokenized_inputs.word_ids(i)
new_labels.append(align_labels_with_tokens(labels, word_ids))
tokenized_inputs["labels"] = new_labels
return tokenized_inputs
# On applique la fonction sur les données de train et de validation
train_tokenized_datasets = sub_train_dataset.map(
tokenize_and_align_labels,
batched=True,
)
val_tokenized_datasets = sub_val_dataset.map(
tokenize_and_align_labels,
batched=True,
)
Map: 100%|██████████| 1000/1000 [00:00<00:00, 12651.62 examples/s]
Map: 100%|██████████| 500/500 [00:00<00:00, 11565.07 examples/s]
Creamos nuestro modelo BERT. Hugging Face permite crear directamente un modelo para la clasificación a nivel de tokens con AutoModelForTokenClassification.
model = AutoModelForTokenClassification.from_pretrained("google-bert/bert-base-uncased",id2label=itos, label2id=stoi)
Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Definamos también una función para calcular la precisión y el f1-score en nuestros datos de validación.
metric = evaluate.load("seqeval")
def compute_metrics(eval_preds):
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
# On supprime les labels -100
true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
true_predictions = [
[label_names[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
return {
"accuracy": all_metrics["overall_accuracy"],
"f1": all_metrics["overall_f1"],
}
¡Estamos listos para entrenar nuestro modelo! Para ello, utilizaremos el entrenador de Hugging Face.
# Pour paramétrer l'entraînement, on peut changer tout un tas de paramètres mais ceux par défaut sont souvent suffisants
args = TrainingArguments(
output_dir="./models",
evaluation_strategy="no",
save_strategy="no",
num_train_epochs=5,
weight_decay=0.01,
)
/home/aquilae/anaconda3/envs/dev/lib/python3.11/site-packages/transformers/training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
# la fonction DataCollatorForTokenClassification permet de rajouter du padding pour que les séquences du batch aient la même taille
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
trainer = Trainer(
model=model,
data_collator=data_collator,
args=args,
train_dataset=train_tokenized_datasets, # Dataset d'entraînement
eval_dataset=val_tokenized_datasets, # Dataset d'évaluation
compute_metrics=compute_metrics,
tokenizer=tokenizer,
)
trainer.train()
80%|████████ | 501/625 [03:08<00:46, 2.69it/s]
{'loss': 0.1273, 'grad_norm': 11.809627532958984, 'learning_rate': 1e-05, 'epoch': 4.0}
100%|██████████| 625/625 [03:55<00:00, 2.66it/s]
{'train_runtime': 235.0171, 'train_samples_per_second': 21.275, 'train_steps_per_second': 2.659, 'train_loss': 0.10341672458648682, 'epoch': 5.0}
TrainOutput(global_step=625, training_loss=0.10341672458648682, metrics={'train_runtime': 235.0171, 'train_samples_per_second': 21.275, 'train_steps_per_second': 2.659, 'total_flos': 106538246287344.0, 'train_loss': 0.10341672458648682, 'epoch': 5.0})
El entrenamiento ha terminado, podemos evaluar nuestro modelo en los datos de validación:
trainer.evaluate()
100%|██████████| 63/63 [00:06<00:00, 10.47it/s]
{'eval_loss': 0.10586605966091156,
'eval_accuracy': 0.9793857803954564,
'eval_f1': 0.902547065337763,
'eval_runtime': 6.1292,
'eval_samples_per_second': 81.577,
'eval_steps_per_second': 10.279,
'epoch': 5.0}
Obtenemos muy buenos resultados: precisión de 0.98 y f1-score de 0.90.
Análisis de sentimientos#
Ahora, pasemos a una tarea de clasificación a nivel de oraciones: el análisis de sentimientos. Para ello, utilizamos el conjunto de datos IMDB. Es un conjunto de datos que agrupa críticas positivas o negativas de películas. El objetivo del modelo es determinar si la crítica es positiva o negativa.
dataset = load_dataset("stanfordnlp/imdb",trust_remote_code=True)
# 1000 éléments pour l'entraînement
sub_train_dataset = dataset['train'].shuffle(seed=42).select(range(1000))
# 500 éléments pour l'évaluation
sub_val_dataset = dataset['test'].shuffle(seed=42).select(range(500)) # 500 examples for evaluation
print(sub_train_dataset['text'][0])
print(sub_train_dataset['label'][0])
itos={0: 'neg', 1:'pos'}
stoi = {v: k for k, v in itos.items()}
There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
1
Podemos utilizar el mismo tokenizador que en la parte anterior para extraer los tokens de nuestro texto. Aquí, no es necesario hacer coincidir la etiqueta con cada token, ya que la etiqueta se refiere a la oración completa.
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True,is_split_into_words=False)
tokenized_train_dataset = sub_train_dataset.map(preprocess_function, batched=True)
tokenized_val_dataset = sub_val_dataset.map(preprocess_function, batched=True)
print(tokenized_train_dataset['input_ids'][0])
print(tokenized_train_dataset['label'][0])
Map: 100%|██████████| 1000/1000 [00:00<00:00, 4040.25 examples/s]
Map: 100%|██████████| 500/500 [00:00<00:00, 5157.73 examples/s]
[101, 2045, 2003, 2053, 7189, 2012, 2035, 2090, 3481, 3771, 1998, 6337, 2099, 2021, 1996, 2755, 2008, 2119, 2024, 2610, 2186, 2055, 6355, 6997, 1012, 6337, 2099, 3504, 15594, 2100, 1010, 3481, 3771, 3504, 4438, 1012, 6337, 2099, 14811, 2024, 3243, 3722, 1012, 3481, 3771, 1005, 1055, 5436, 2024, 2521, 2062, 8552, 1012, 1012, 1012, 3481, 3771, 3504, 2062, 2066, 3539, 8343, 1010, 2065, 2057, 2031, 2000, 3962, 12319, 1012, 1012, 1012, 1996, 2364, 2839, 2003, 5410, 1998, 6881, 2080, 1010, 2021, 2031, 1000, 17936, 6767, 7054, 3401, 1000, 1012, 2111, 2066, 2000, 12826, 1010, 2000, 3648, 1010, 2000, 16157, 1012, 2129, 2055, 2074, 9107, 1029, 6057, 2518, 2205, 1010, 2111, 3015, 3481, 3771, 3504, 2137, 2021, 1010, 2006, 1996, 2060, 2192, 1010, 9177, 2027, 9544, 2137, 2186, 1006, 999, 999, 999, 1007, 1012, 2672, 2009, 1005, 1055, 1996, 2653, 1010, 2030, 1996, 4382, 1010, 2021, 1045, 2228, 2023, 2186, 2003, 2062, 2394, 2084, 2137, 1012, 2011, 1996, 2126, 1010, 1996, 5889, 2024, 2428, 2204, 1998, 6057, 1012, 1996, 3772, 2003, 2025, 23105, 2012, 2035, 1012, 1012, 1012, 102]
1
Podemos crear nuestro modelo.
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2, id2label=itos, label2id=stoi)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Ahora podemos crear nuestra función para calcular el rendimiento:
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
accuracy_score = accuracy.compute(predictions=predictions, references=labels)
f1_score = f1.compute(predictions=predictions, references=labels,average="macro")
return {
"f1": f1_score["f1"],
"accuracy": accuracy_score["accuracy"],
}
Y entrenar el modelo:
training_args = TrainingArguments(
output_dir="models",
num_train_epochs=5,
weight_decay=0.01,
eval_strategy="no",
save_strategy="no",
)
# Pad the inputs to the maximum length in the batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_val_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
80%|████████ | 500/625 [12:05<02:58, 1.43s/it]
{'loss': 0.2295, 'grad_norm': 0.022515051066875458, 'learning_rate': 1e-05, 'epoch': 4.0}
100%|██████████| 625/625 [15:06<00:00, 1.45s/it]
{'train_runtime': 906.0393, 'train_samples_per_second': 5.519, 'train_steps_per_second': 0.69, 'train_loss': 0.1885655658721924, 'epoch': 5.0}
TrainOutput(global_step=625, training_loss=0.1885655658721924, metrics={'train_runtime': 906.0393, 'train_samples_per_second': 5.519, 'train_steps_per_second': 0.69, 'total_flos': 613576571755968.0, 'train_loss': 0.1885655658721924, 'epoch': 5.0})
Evaluemos nuestro modelo:
trainer.evaluate()
100%|██████████| 63/63 [00:33<00:00, 1.86it/s]
{'eval_loss': 0.565979540348053,
'eval_f1': 0.8879354508196722,
'eval_accuracy': 0.888,
'eval_runtime': 34.4579,
'eval_samples_per_second': 14.51,
'eval_steps_per_second': 1.828,
'epoch': 5.0}
Obtenemos buenos resultados: precisión de 0.89 y f1-score de 0.89 también.