Hugging Face's Transformers: State-of-the-art Natural Language Processing

Being a data scientist or a person who fond of data science must know what is Natural Language Processing (NLP). NLP is widely used in analysing of human language by machines to understand and interpret in a similar way as humans. It is widely used in performing tasks like language translation, speech recognition, and information extraction. Few examples of NLP are search engines, chatbots, chatGPT.

Hugging Face

Hugging Face, a software company works on developing NLP technologies and also known for creating a library called Transformers. The transformers library includes some pre-trained models for wide range of NLP tasks. Hugging face actually cut down the tedious process of building models and storing datasets for working with new test datasets.

Difference between Transformer and Transformers

Transformer is a deep learning model architecture which replaces the conventional way of approaching the text data. Before the introduction of transformer RNN or CNN works on the NLP tasks which was lagged in keeping the memory of data sequence. Transformer overcomes this with a super solution called self-attention mechanism.

While, Transformers is a Python library built by Hugging face to simplify the task that needs to be done for using it model creation in NLP. With the help of transformers several computational time and carbon foot print gets reduced. This makes the Hugging face “State-of-the-art Natural Language Processing”

Key Components

Pre-trained models

The core component of hugging face is its pre-trained models which includes BERT, GPT, RoBERTa, DistilBERT, ALBERT. These models help in fine tuning on specific tasks with small amount of labelled data.

Tokenizers

Tokenizers are mainly used to convert text data into numeric which are fed into pre-trained models. Tokenization is one of the inevitable tasks in NLP and widely used to segregate texts quickly and efficiently.

Conclusion

Hugging face is a boon to NLP where even a beginner can work around and train his/her data with high performance. To the best of knowledge I grabbed from different posts, I would share some of it use cases in NLP below for your reference to try it out

Sentiment Classification

text = """Hey Jack, I just wanted to flag something with you. Last week when you said that you didn't want to watch the move Twilight with me, even in jest, it kind of got under my skin. I mainly feel like it's hypocritical when you make me watch basketball games with you and our main activity together is watching sports on TV. I just wanted to get it off my chest. From Sophie"""

from transformers import pipeline

import pandas as pd

classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

outputs = classifier(text)

pd.DataFrame(outputs)

Question answering

reader = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

question = "What movie did Jack not watch?" outputs = reader(question=question, context=text)

pd.DataFrame([outputs])

Summarization

summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)

print(outputs[0]['summary_text'])