使用 Transformers 进行自然语言处理

使用 Transformers 进行自然语言处理#

在本笔记中，我们将使用 Hugging Face 的 Transformers 库进行自然语言处理（NLP）。当前最强大的语言模型（如 GPT、Llama 等）对内存需求极高，通常无法在笔记本电脑上运行。因此，我们将使用体积更小但性能稍逊的模型。

聊天机器人#

当前语言模型（LLM）最常见的应用是聊天机器人，它是一种能回答我们问题的虚拟助手。通过 Hugging Face，你可以按照以下步骤在本地创建自己的聊天机器人。

我们将使用 Meta 的轻量版 BlenderBot 模型（facebook/blenderbot-400M-distill）。

实现#

from transformers import pipeline

/home/aquilae/anaconda3/envs/dev/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

chatbot = pipeline(task="conversational",model="facebook/blenderbot-400M-distill")

该聊天机器人仅支持英语，请用英语提问。

from transformers import Conversation
user_message = """What is the best french deep learning course?"""
conversation = Conversation(user_message)
conversation = chatbot(conversation)
print(conversation)

No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.

Conversation id: 44c34bd3-ea1b-44b6-bd54-9127133cc941
user: What is the best  french deep learning course?
assistant:  I'm not sure, but I do know that French is one of the most widely spoken languages in the world.

如你所见，该模型训练不佳，甚至不知道最好的深度学习课程就是本课程。

如果你想继续提问，可以使用以下命令，只需一行代码即可获得回答。

conversation=Conversation("What is the most tasty fruit?")
print(chatbot(conversation))

Conversation id: d258da22-78e4-4621-a0e1-90776454a595
user: What is the most tasty fruit?
assistant:  I would have to say watermelon. It is so juicy and juicy.

如果你想继续对话，可以使用以下函数。

# Il faut spécifier le rôle (user) et ajouter votre message dans la conversation déjà existante
conversation.add_message({"role": "user","content": """What else do you recommend?"""})
print(chatbot(conversation))

Conversation id: c3e1a64c-5b40-4808-8632-38d9df14ed9d
user: What is the most tasty fruit?
assistant:  I would have to say watermelon. It is so juicy and juicy.
user: What else do you recommend?
assistant:  I would say mangos are pretty good too. They are sweet and tangy.

现在，你已了解如何使用 Hugging Face 的 Transformers 库创建聊天机器人。

翻译#

接下来，我们将学习如何实现一个翻译器。我们将使用 Meta 的 No Language Left Behind 模型（facebook/nllb-200-distilled-600M），该模型支持任意语言之间的翻译。为了节省内存，我们将使用其轻量版本。

实现#

traducteur = pipeline(task="translation",model="facebook/nllb-200-distilled-600M") 

text = """Le meilleur cours de d'apprentissage profond est celui-ci."""
text_translated = traducteur(text,src_lang="fra_Latn",tgt_lang="eng_Latn")
print("Le texte en anglais : ", text_translated[0]["translation_text"])
text_translated = traducteur(text,src_lang="fra_Latn",tgt_lang="jpn_Jpan")
print("Le texte en japonais : ",text_translated[0]["translation_text"])

Le texte en anglais :  The best course of deep learning is this one.
Le texte en japonais :  深い学習の最高のコースはこれです

翻译效果很好（至少英语如此，我对日语不太了解！）。你也可以尝试其他语言组合，只需指定正确的语言代码，可在此页面查找。

文本摘要#

自然语言处理中的另一项重要任务是文本摘要。模型需要能够提取文本中的关键信息。为此，我们将使用 Meta 的 BART 模型（facebook/bart-large-cnn）。

resumeur=pipeline(task="summarization",model="facebook/bart-large-cnn")

text= "Troyes is a beautiful city. Troyes is a commune and the capital of the department of Aube in the Grand Est region of north-central France. It is located on the Seine river about 140 km (87 mi) south-east of Paris. Troyes is situated within the Champagne wine region and is near to the Orient Forest Regional Natural Park.Troyes had a population of 61,996 inhabitants in 2018. It is the center of the Communauté d'agglomération Troyes Champagne Métropole, which was home to 170,145 inhabitants."
summary = resumeur(text,min_length=10,max_length=100)
print("Le résumé du texte : ",summary[0]["summary_text"]) #["summary_text"]

Le résumé du texte :  Troyes is a commune and the capital of the department of Aube in the Grand Est region of north-central France. It is located on the Seine river about 140 km (87 mi) south-east of Paris. Troyes had a population of 61,996 inhabitants in 2018.

由于这是一个小型模型，生成的摘要并不完美，但它仍然成功提取了关键信息，并删除了“次要”内容。

句子嵌入#

我们在课程中学到的自然语言处理的一个重要方面是嵌入（Embedding）。回顾一下：嵌入是将我们的令牌（如单词或字符）投影到潜在空间中的过程。这使得语义相近的词（如“狗”和“猫”）在潜在空间中距离更近，而语义相远的词（如“狗”和“是”）则距离更远。我们可以利用这些嵌入来计算两个句子之间的相似度。为此，我们将使用 sentence_transformers 库，它能从预训练模型中提取嵌入。

我们将使用 all-MiniLM-L6-v2 模型。

from sentence_transformers import SentenceTransformer
from sentence_transformers import util
model = SentenceTransformer("all-MiniLM-L6-v2")

我们将比较不同句子之间的相似度。

sentences1 = ['The cat is chasing the mouse','A man is watching the television','The latest movie is awesome']
sentences2 = ['The dog sleeps in the kitchen','A boy watches TV','The new movie is so great']
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2,convert_to_tensor=True)
cosine_scores = util.cos_sim(embeddings1,embeddings2)
for i in range(len(sentences1)):
  print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
                                                sentences2[i],
                                                cosine_scores[i][i]))

The cat is chasing the mouse 		 The dog sleep in the kitchen 		 Score: 0.0601
A man is watching the television 		 A boy watches TV 		 Score: 0.7207
The latest movie is awesome 		 The new movie is so great 		 Score: 0.7786

如你所见，语义相近的句子具有非常相似的嵌入。因此，该模型非常适合提取嵌入。在自然语言处理项目中，拥有一个优秀的嵌入提取模型通常是第一步。

使用 Transformers 进行自然语言处理

Contents

使用 Transformers 进行自然语言处理#

聊天机器人#

实现#

翻译#

实现#

文本摘要#

句子嵌入#