使用 Transformers 进行自然语言处理#
在本笔记中,我们将使用 Hugging Face 的 Transformers 库进行自然语言处理(NLP)。当前最强大的语言模型(如 GPT、Llama 等)对内存需求极高,通常无法在笔记本电脑上运行。因此,我们将使用体积更小但性能稍逊的模型。
聊天机器人#
当前语言模型(LLM)最常见的应用是聊天机器人,它是一种能回答我们问题的虚拟助手。 通过 Hugging Face,你可以按照以下步骤在本地创建自己的聊天机器人。
我们将使用 Meta 的轻量版 BlenderBot 模型(facebook/blenderbot-400M-distill)。
实现#
from transformers import pipeline
/home/aquilae/anaconda3/envs/dev/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
chatbot = pipeline(task="conversational",model="facebook/blenderbot-400M-distill")
该聊天机器人仅支持英语,请用英语提问。
from transformers import Conversation
user_message = """What is the best french deep learning course?"""
conversation = Conversation(user_message)
conversation = chatbot(conversation)
print(conversation)
No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
Conversation id: 44c34bd3-ea1b-44b6-bd54-9127133cc941
user: What is the best french deep learning course?
assistant: I'm not sure, but I do know that French is one of the most widely spoken languages in the world.
如你所见,该模型训练不佳,甚至不知道最好的深度学习课程就是本课程。
如果你想继续提问,可以使用以下命令,只需一行代码即可获得回答。
conversation=Conversation("What is the most tasty fruit?")
print(chatbot(conversation))
Conversation id: d258da22-78e4-4621-a0e1-90776454a595
user: What is the most tasty fruit?
assistant: I would have to say watermelon. It is so juicy and juicy.
如果你想继续对话,可以使用以下函数。
# Il faut spécifier le rôle (user) et ajouter votre message dans la conversation déjà existante
conversation.add_message({"role": "user","content": """What else do you recommend?"""})
print(chatbot(conversation))
Conversation id: c3e1a64c-5b40-4808-8632-38d9df14ed9d
user: What is the most tasty fruit?
assistant: I would have to say watermelon. It is so juicy and juicy.
user: What else do you recommend?
assistant: I would say mangos are pretty good too. They are sweet and tangy.
现在,你已了解如何使用 Hugging Face 的 Transformers 库创建聊天机器人。
翻译#
接下来,我们将学习如何实现一个翻译器。我们将使用 Meta 的 No Language Left Behind 模型(facebook/nllb-200-distilled-600M),该模型支持任意语言之间的翻译。为了节省内存,我们将使用其轻量版本。
实现#
traducteur = pipeline(task="translation",model="facebook/nllb-200-distilled-600M")
text = """Le meilleur cours de d'apprentissage profond est celui-ci."""
text_translated = traducteur(text,src_lang="fra_Latn",tgt_lang="eng_Latn")
print("Le texte en anglais : ", text_translated[0]["translation_text"])
text_translated = traducteur(text,src_lang="fra_Latn",tgt_lang="jpn_Jpan")
print("Le texte en japonais : ",text_translated[0]["translation_text"])
Le texte en anglais : The best course of deep learning is this one.
Le texte en japonais : 深い学習の最高のコースはこれです
翻译效果很好(至少英语如此,我对日语不太了解!)。 你也可以尝试其他语言组合,只需指定正确的语言代码,可在此页面查找。
文本摘要#
自然语言处理中的另一项重要任务是文本摘要。模型需要能够提取文本中的关键信息。为此,我们将使用 Meta 的 BART 模型(facebook/bart-large-cnn)。
resumeur=pipeline(task="summarization",model="facebook/bart-large-cnn")
text= "Troyes is a beautiful city. Troyes is a commune and the capital of the department of Aube in the Grand Est region of north-central France. It is located on the Seine river about 140 km (87 mi) south-east of Paris. Troyes is situated within the Champagne wine region and is near to the Orient Forest Regional Natural Park.Troyes had a population of 61,996 inhabitants in 2018. It is the center of the Communauté d'agglomération Troyes Champagne Métropole, which was home to 170,145 inhabitants."
summary = resumeur(text,min_length=10,max_length=100)
print("Le résumé du texte : ",summary[0]["summary_text"]) #["summary_text"]
Le résumé du texte : Troyes is a commune and the capital of the department of Aube in the Grand Est region of north-central France. It is located on the Seine river about 140 km (87 mi) south-east of Paris. Troyes had a population of 61,996 inhabitants in 2018.
由于这是一个小型模型,生成的摘要并不完美,但它仍然成功提取了关键信息,并删除了“次要”内容。
句子嵌入#
我们在课程中学到的自然语言处理的一个重要方面是嵌入(Embedding)。回顾一下:嵌入是将我们的令牌(如单词或字符)投影到潜在空间中的过程。这使得语义相近的词(如“狗”和“猫”)在潜在空间中距离更近,而语义相远的词(如“狗”和“是”)则距离更远。
我们可以利用这些嵌入来计算两个句子之间的相似度。为此,我们将使用 sentence_transformers 库,它能从预训练模型中提取嵌入。
我们将使用 all-MiniLM-L6-v2 模型。
from sentence_transformers import SentenceTransformer
from sentence_transformers import util
model = SentenceTransformer("all-MiniLM-L6-v2")
我们将比较不同句子之间的相似度。
sentences1 = ['The cat is chasing the mouse','A man is watching the television','The latest movie is awesome']
sentences2 = ['The dog sleeps in the kitchen','A boy watches TV','The new movie is so great']
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2,convert_to_tensor=True)
cosine_scores = util.cos_sim(embeddings1,embeddings2)
for i in range(len(sentences1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
sentences2[i],
cosine_scores[i][i]))
The cat is chasing the mouse The dog sleep in the kitchen Score: 0.0601
A man is watching the television A boy watches TV Score: 0.7207
The latest movie is awesome The new movie is so great Score: 0.7786
如你所见,语义相近的句子具有非常相似的嵌入。因此,该模型非常适合提取嵌入。 在自然语言处理项目中,拥有一个优秀的嵌入提取模型通常是第一步。