Introduction

As we all know ChatGPT has taken the world by storm. We all know it can do amazing things. But how does it do that? What architecture or algorithm powers ChatGPT in doing such cool things? If you have asked this question to yourself and are looking for a simple yet a little bit technical answer then you are at the right place. The algorithm on which researchers at OpenAI have based ChatGPT is called The Transformer. In this article we will discuss the architecture and inspiration for transformers, Huggingface’s API for easy use of Transformers for a number of NLP tasks, and finally we will see an actual example of the implementation of Huggingface’s Transformer in Python for the task of sentence pair classification. This article assumes the reader has a basic understanding of deep neural networks and TensorFlow (for understanding the code). So, let’s get started.

The Transformer

The Structure

Natural Language Processing (NLP) is a collection of tasks in artificial intelligence that deal with making a machine/computer understand, comprehend and interpret human languages. For a long period of time, since the early 2000s, a specific class of algorithms called RNNs such as LSTM(Sepp Hochreiter et al.,1997) and GRU(Kyunghyun Cho et al., 2014) have dominated the NLP domain. But these methods had a major drawback when it came to understanding longer sequences. Their performance was very poor on longer sequences since they only looked at the most recent or last token in the sequence while encoding the sequence. But in 2017, in their research, Vaswani et al., 2017 introduced The Transformer architecture that solely relies on the attention mechanism to figure out sequential, semantic, and contextual understanding of the text. Let us take a high-level look at the Transformer architecture.

Transformer is a special type of sequence-to-sequence model that takes input in the form of a sequence (like a sentence in French) and gives another sequence as output (like an English sentence). Going down further on granularity, the transformer consists of an encoder that encodes the input sequence and a decoder that generates the output sequence as shown below

The encoder and decoder parts themselves are stacks of multiple units of encoders and decoders. There can be any number of such units but the number of encoder units has to be equal to the number of decoder units. This architecture is shown below:

The Working

The tokenized input sequences are first passed to the bottom-most encoder. The operation inside an encoder is as follows. The input sequence is passed to a self-attention unit first. The operation of self-attention is a bit complex but in essence, it does this: For each word in the input sequence, it tries to understand which part of the input sentence to pay attention to. For example, suppose the input sentence is “The bird did not fly in the sky because it was tired”. Now in this sentence, while processing the word “it”, we as humans can easily identify that it is referencing “the bird” and not the “sky”. But this is not so straightforward for a computer or a machine. Self-attention is the technique that helps computers understand which parts of the sentence to focus on while creating a representation of a word in the same sentence. In a transformer, the self-attention is multi-headed which means the same self-attention method is applied multiple times parallelly. Each attention “Head” results in different learning. The outputs of all the attention heads are aggregated to get a final self-attention output. The output of the self-attention layer is forwarded to the feed-forward neural network layer which is a simple deep neural network. The output of the feed-forward layer is then forwarded to the next encoder unit where a similar process is repeated. It has to be noted here that the bottom-most encoder unit takes tokenized word sequence as input while subsequent units take the output of previous encoder units as input. Finally, the top-most encoder unit generates an output which is then fed to decoder units. A pictorial representation of the encoder can be seen below.

The decoder unit architecture is pretty similar to the encoder unit architecture. There is one extra layer inserted between self-attention and a feed-forward neural network. This layer is called as encoder-decoder attention layer. The output of the encoder is fed to the encoder-decoder layer of all the decoder units. This way the decoder has sequence information of the input sequence and helps the decoder focus on appropriate places in the input sequence. The decoder generates the output sequence one word at a time. The difference in the operation of the self-attention layer in the case of the decoder is that the self-attention in the decoder unit only looks at past tokens during the training process because the past tokens only will be available in the prediction phase too. A pictorial representation of the decoder can be seen below.

The decoder stack outputs a vector of floats. A particular word at a time step is inferred from this vector using the final section of the model architecture which consists of the final Linear layer followed by a softmax Layer. The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector. Let’s assume that our model knows 50,000 unique English words. This would be our model’s output vocabulary that it’s learned from its training dataset. This would make the logits vector 50,000 cells wide where each cell corresponds to the score of a unique word. This output is then sent to the softmax layer. The softmax layer turns those scores into probabilities. The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step. The usual backpropagation algorithm is used to tune the parameters of the model. The loss function used to compare actual and predicted output during training is usually cross-entropy loss or K-L divergence loss. We optimize our architecture over a huge corpus of text data until we get an acceptable minimum loss.

Huggingface's Transformer

We just bird’s eye view of transformer architecture. But understanding it on a micro level and implementing it on your own (and actually getting it to work) is a pretty daunting task. This is where the generous AI community comes to our rescue. You might already be using open-source Python libraries like scikit-learn, TensorFlow, Keras, and PyTorch to implement very complex machine learning and deep learning algorithms with ease. Huggingface is a company (founded by Clément Delangue, Julien Chaumond, and Thomas Wolf) that develops tools for building applications using state-of-the-art machine learning and deep learning algorithms. They have built a library called Transformers dedicated to supporting transformer-based architectures and facilitating the distribution of pre-trained models. This means we can implement the transformer and its derivative algorithms in Python just like we implement any other machine learning model using scikit-learn! Believe me, this is the coolest thing ever! As we all know training such a huge model is not an easy task. Even if we get the code base right there are challenges regarding the availability of huge amounts of data and state-of-the-art hardware for computational power. Transformers library solves these problems by providing transformers and derived models from transformers that are pre-trained using standard NLP tasks on huge datasets. We can just download these models, tune the parameters (or not!) for our own task based on our data and make predictions (we will see this in the code demo section).

Transformers is the library designed and maintained by engineers at Huggingface and more than 400 external contributors. Detailed documentation of this library can be found at their GitHub and Website. Using Transformers, you can easily download, cache, and fine-tune transformers and related models with ease and even seamlessly integrate it into production

The design of transformers is in accordance with standard NLP workflow i.e. pre-processing the data (in NLP lingo we would call this tokenization), applying and training (or tuning) a model, and making predictions. Therefore, every model in this library comes with three building blocks:

1. Tokenizer: To convert raw text data to numeric form.

2. A transformer: To convert the tokenized data into contextual embedding. This is the model.

3. A head: Uses contextualized embeddings to perform a specific task at hand. It can be text generation, text classification, etc.

The installation of the library is very similar to other Python libraries you use. Detailed instructions for installation can be found here.

Not just models, the Transformers library also provides access to more than 2000 datasets and layered APIs, allowing users to directly interact with these models. Enough talk now let’s get our hands dirty with some practical implementation of Transformers!

Code Demo

We will demonstrate the use of Transformers on a sentence pair classification task using a model derived from original Transformer architecture called BERT. BERT is basically a trained Transformer Encoder stack that stands for Bidirectional Encoder Representations from Transformers. In other words, BERT is the encoder stack of the Transformer. The Transformer is initially trained on an NLP task such as text generation and then the encoder part can be separately used to encode any text to get the contextualized text embeddings. The following image gives an idea about BERT architecture.

As seen above there are two variants of BERT. BERT Base has 12 encoder units while BERT Large has 24 encoder units. The problem we are going to solve is Natural Language Inferencing(NLI). NLI is a task of determining whether the given “hypothesis” and “premise” logically follow (entailment) or unfollow (contradiction) or are undetermined (neutral) to each other. Given two sentences the task of NLI is to determine if sentence 2 follows(entails), unfollows(contradicts), or is independent(neutral) of sentence 1. The following examples demonstrate this task:

Text (Premise)	Hypothesis	Judgment (Label)
A soccer game with multiple males playing.	Some men are playing a sport.	entailment
A black race car starts up in front of a crowd of people.	A man is driving down a lonely road.	contradiction
An older and a younger man smiling.	Two men are smiling and laughing at the cats playing on the floor.	neutral

Text (Premise)

Hypothesis

Judgment (Label)

A soccer game with multiple males playing.

Some men are playing a sport.

entailment

A black race car starts up in front of a crowd of people.

A man is driving down a lonely road.

contradiction

An older and a younger man smiling.

Two men are smiling and laughing at the cats playing on the floor.

neutral

You can find more about this problem and dataset here. So this is basically a Multiclass classification problem. On examining the data, it is found that all three classes have roughly the same number of examples in training, validation, and test set. Therefore, we will use accuracy as a model performance measure. Before training the model we need to tokenize the input sentences. For this, we use the tokenizer provided by the Transformers library for BERT. Before that let us understand how to prepare the data for tokenization. We are going to input two sentences. These sentences are passed to the tokenizer as follows:

[CLS][sentence1][SEP][sentence2]

Here first we have a special token “[CLS]” which is added for the classification task. The use of this token will be clear later. Then we have sentence1 then we have a special token “[SEP]” indicating separation between two sentences and then we have sentence2. Furthermore, It is not possible for all the sentences to have the same length. But a model requires all the inputs to be of the same length. So we select a parameter max_length which will tell the model the maximum length to be considered for all the input sentences. Some sentences having lengths greater than max_length will be truncated to max_length and some sentences having lengths shorter than max length are padded with special tokens such as “[PAD]”. Once this preprocessing is done, the works are converted to their corresponding numerical ids. The following code shows how to implement BERT tokenization from Transformers.

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')

Here we import a tokenizer specifically for bert-large-uncased model. This means the model we are going to use is BERT Large and it is not sensitive to the case of the words in the input text data. The tokenizer accepts pairs of sentences and gives three different kinds of outputs for each pair of sentences as shown below.

# Encoded token ids from BERT tokenizer.
input_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="input_ids")
# Attention masks indicates to the model which tokens should be attended to.
attention_masks = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="attention_masks")
# Token type ids are binary masks identifying different sequences in the model.
token_type_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="token_type_ids" )

The first output is token_ids which basically assigns a unique id to each word, and the second output is attention_mask which is a binary vector indicating the padding cells i.e. indicating the model which tokens are padding tokens so that they will not be considered while training. The third output is token_type_ids which is also a binary vector indicating whether the token is from the first or second sentence.

Once we have created objects for tokenization and created input layers for these. We need to pass this tokenized input to the BERT model. But we do not want to tune the parameters of this loaded model. So we freeze the BERT model from training. We can load and freeze the BERT from Transformers as follows:

# Loading pretrained BERT model.
from transformers import TFBertModel
bert_model = TFBertModel.from_pretrained("bert-large-uncased")
# Freeze the BERT model to reuse the pretrained features without modifying them.
bert_model.trainable = False
# Output of BERT
bert_output  = bert_model([input_ids, attention_masks, token_type_ids])
last_hidden_state = bert_output['last_hidden_state']
pooler_output = bert_output['pooler_output']

There are two outputs from the BERT model. One is denoted by last_hidden_state which is the hidden state of the last layer of the encoder and pooler_output is the output belonging to the [CLS] token that we had added while tokenization. In classification tasks, such as this one, the pooler_output is enough to extract the contextualized embeddings on top of which a classifier can be built. But for other use cases such as machine translation or text generation, a complete hidden state of the last layer is required. It does not mean that you can not use the hidden state output for classification. So, in this case, we use the last hidden state output and build a deep neural net-based classifier on top of it as follows:

# Add trainable layers on top of frozen layers to adapt the pretrained features on the new data.
bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True))(last_hidden_state)
# Applying hybrid pooling approach to bi_lstm sequence output.
avg_pool = tf.keras.layers.GlobalAveragePooling1D()(bi_lstm)
max_pool = tf.keras.layers.GlobalMaxPooling1D()(bi_lstm)
concat = tf.keras.layers.concatenate([avg_pool, max_pool])
dropout = tf.keras.layers.Dropout(0.3)(concat)
output = tf.keras.layers.Dense(3, activation="softmax")(dropout)
model = tf.keras.models.Model(inputs=[input_ids, attention_masks, token_type_ids], outputs=output)

So now we have our model architecture ready. We need to compile the model and train it. We can use the following code for that

model.compile(optimizer=tf.keras.optimizers.Adam(),loss="categorical_crossentropy",metrics=["acc"])
history=model.fit(train_data,validation_data = val_data,epochs=epochs)

Here we use Adam optimizer and cross-entropy loss.

For this configuration in 5 epochs, we are able to get nearly 85 percent accuracy on the train as well as validation data. This accuracy is pretty good and with more tweaks and experiments it can be further increased.

The complete code for this case study can be found here.

Summary

We saw that the Transformer and its derivatives are a very powerful class of algorithms for NLP tasks and can be used for transfer learning. Huggingface provides an API for easy use of these models and it also provides the Transformers library to make use of these models. We also saw a simple code example where we solved a text pair classification task using Transformers, Tensorflow, and Python.