Natural Language Processing with Transformers#
In this notebook, we use the Hugging Face Transformers library for natural language processing (NLP). The most powerful language models (GPT, Llama, etc.) are very memory-intensive and often unusable on a laptop. Therefore, we limit ourselves to smaller, less powerful models.
ChatBot#
The most common use of language models (LLMs) today is the ChatBot, a virtual assistant that answers our questions. With Hugging Face, you can create your own local ChatBot as follows.
We use a lightweight version of BlenderBot from Meta (facebook/blenderbot-400M-distill).
Implementation#
from transformers import pipeline
/home/aquilae/anaconda3/envs/dev/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
chatbot = pipeline(task="conversational",model="facebook/blenderbot-400M-distill")
This ChatBot only understands English, so ask it questions in English.
from transformers import Conversation
user_message = """What is the best french deep learning course?"""
conversation = Conversation(user_message)
conversation = chatbot(conversation)
print(conversation)
No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
Conversation id: 44c34bd3-ea1b-44b6-bd54-9127133cc941
user: What is the best  french deep learning course?
assistant:  I'm not sure, but I do know that French is one of the most widely spoken languages in the world.
As you can see, the model is poorly trained and does not know that the best Deep Learning course is this one.
If you want to ask another question, the following command gives you the answer in a single line of code.
conversation=Conversation("What is the most tasty fruit?")
print(chatbot(conversation))
Conversation id: d258da22-78e4-4621-a0e1-90776454a595
user: What is the most tasty fruit?
assistant:  I would have to say watermelon. It is so juicy and juicy.
If you want to continue the conversation, use this function.
# Il faut spécifier le rôle (user) et ajouter votre message dans la conversation déjà existante
conversation.add_message({"role": "user","content": """What else do you recommend?"""})
print(chatbot(conversation))
Conversation id: c3e1a64c-5b40-4808-8632-38d9df14ed9d
user: What is the most tasty fruit?
assistant:  I would have to say watermelon. It is so juicy and juicy.
user: What else do you recommend?
assistant:  I would say mangos are pretty good too. They are sweet and tangy.
You now know how to use a ChatBot with the Hugging Face Transformers library.
Translation#
Now we will see how to implement a translator. We use the No Language Left Behind model from Facebook (facebook/nllb-200-distilled-600M), which allows translation from any language. For memory efficiency, we use a lightweight version of the model.
Implementation#
traducteur = pipeline(task="translation",model="facebook/nllb-200-distilled-600M") 
text = """Le meilleur cours de d'apprentissage profond est celui-ci."""
text_translated = traducteur(text,src_lang="fra_Latn",tgt_lang="eng_Latn")
print("Le texte en anglais : ", text_translated[0]["translation_text"])
text_translated = traducteur(text,src_lang="fra_Latn",tgt_lang="jpn_Jpan")
print("Le texte en japonais : ",text_translated[0]["translation_text"])
Le texte en anglais :  The best course of deep learning is this one.
Le texte en japonais :  深い学習の最高のコースはこれです
The translation is very good (at least for English; I don’t have the expertise for Japanese!). You can also test other language combinations by specifying the correct code, which you can find on this page.
Text Summarization#
Another useful NLP task is text summarization. The model must be able to extract the most important information from a text. For this, we use the BART model from Facebook (facebook/bart-large-cnn).
resumeur=pipeline(task="summarization",model="facebook/bart-large-cnn")
text= "Troyes is a beautiful city. Troyes is a commune and the capital of the department of Aube in the Grand Est region of north-central France. It is located on the Seine river about 140 km (87 mi) south-east of Paris. Troyes is situated within the Champagne wine region and is near to the Orient Forest Regional Natural Park.Troyes had a population of 61,996 inhabitants in 2018. It is the center of the Communauté d'agglomération Troyes Champagne Métropole, which was home to 170,145 inhabitants."
summary = resumeur(text,min_length=10,max_length=100)
print("Le résumé du texte : ",summary[0]["summary_text"]) #["summary_text"]
Le résumé du texte :  Troyes is a commune and the capital of the department of Aube in the Grand Est region of north-central France. It is located on the Seine river about 140 km (87 mi) south-east of Paris. Troyes had a population of 61,996 inhabitants in 2018.
The summary is not perfect, as it is a small model, but it still managed to extract the key information and remove the “less important” elements.
Sentence Embedding#
An important aspect of NLP that we saw in the course is embedding. Recall: this involves projecting our tokens (words or characters, for example) into a latent space. This allows similar words to be close together. Words like “dogs” and “cats” will be close in the latent space, while “dog” and “is” will be far apart. We can use these embeddings to calculate the similarity between two sentences. For this, we use the sentence_transformers library, which allows extracting embeddings from a pre-trained model.
We use the all-MiniLM-L6-v2 model.
from sentence_transformers import SentenceTransformer
from sentence_transformers import util
model = SentenceTransformer("all-MiniLM-L6-v2")
We will look at the similarity between different sentences.
sentences1 = ['The cat is chasing the mouse','A man is watching the television','The latest movie is awesome']
sentences2 = ['The dog sleeps in the kitchen','A boy watches TV','The new movie is so great']
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2,convert_to_tensor=True)
cosine_scores = util.cos_sim(embeddings1,embeddings2)
for i in range(len(sentences1)):
  print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
                                                sentences2[i],
                                                cosine_scores[i][i]))
The cat is chasing the mouse 		 The dog sleep in the kitchen 		 Score: 0.0601
A man is watching the television 		 A boy watches TV 		 Score: 0.7207
The latest movie is awesome 		 The new movie is so great 		 Score: 0.7786
As you can see, sentences close in meaning have quite similar embeddings. This model is therefore interesting for extracting embeddings. Having a good embedding extraction model is often a first step in an NLP project.