Hugging Face's Transformers: State-of-the-art Natural Language Processing
Being a data scientist or a person who fond of data science must know what is Natural Language Processing (NLP). NLP is widely used in analysing of human language by machines to understand and interpret in a similar way as humans. It is widely used in performing tasks like language translation, speech recognition, and information extraction. Few examples of NLP are search engines, chatbots, chatGPT.
Hugging Face
Hugging Face, a software company works on developing NLP technologies and also known for creating a library called Transformers. The transformers library includes some pre-trained models for wide range of NLP tasks. Hugging face actually cut down the tedious process of building models and storing datasets for working with new test datasets.
Difference between Transformer and Transformers
Transformer is a deep learning model architecture which replaces the conventional way of approaching the text data. Before the introduction of transformer RNN or CNN works on the NLP tasks which was lagged in keeping the memory of data sequence. Transformer overcomes this with a super solution called self-attention mechanism.
While, Transformers is a Python library built by Hugging face to simplify the task that needs to be done for using it model creation in NLP. With the help of transformers several computational time and carbon foot print gets reduced. This makes the Hugging face “State-of-the-art Natural Language Processing”
Key Components
Pre-trained models
The core component of hugging face is its pre-trained models which includes BERT, GPT, RoBERTa, DistilBERT, ALBERT. These models help in fine tuning on specific tasks with small amount of labelled data.
Tokenizers
Tokenizers are mainly used to convert text data into numeric which are fed into pre-trained models. Tokenization is one of the inevitable tasks in NLP and widely used to segregate texts quickly and efficiently.
Conclusion
Hugging face is a boon to NLP where even a beginner can work around and train his/her data with high performance. To the best of knowledge I grabbed from different posts, I would share some of it use cases in NLP below for your reference to try it out
Sentiment Classification
text = """Hey Jack, I just wanted to flag something with you. Last week when you said that you didn't want to watch the move Twilight with me, even in jest, it kind of got under my skin. I mainly feel like it's hypocritical when you make me watch basketball games with you and our main activity together is watching sports on TV. I just wanted to get it off my chest. From Sophie"""
from transformers import pipeline
import pandas as pd
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
outputs = classifier(text)
pd.DataFrame(outputs)
Question answering
reader = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
question = "What movie did Jack not watch?" outputs = reader(question=question, context=text)
pd.DataFrame([outputs])
Summarization
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])