"Transfer Learning", Trending and Hottest in NLP

Natural language processing (NLP) is a branch of Artificial Intelligence (AI), that gives computers the ability to understand text and spoken words in the same way human beings can. NLP resolves ambiguity in language and adds useful numeric structure to the data for a lot of applications, such as speech recognition or text analytics etc. Transfer learning is one of the trending and hottest topics in NLP and it is a novel way to train machine learning models.

Here in this blog, we are going to discuss about the advances in Transfer Learning in the form of pre-trained models for NLP. Transfer learning is a technique where a model trained on a large dataset is used to perform similar tasks on another dataset. Such models are called pre-trained model. It is better to use a pre-trained model as a starting point to solve a problem rather than building a model from scratch. Transfer learning is a technique where a deep learning model trained on a large dataset is used to perform similar tasks on another dataset. We call such a deep learning model a pre-trained model.

In the above picture, Transfer learning use knowledge learned from a previous task/domain for a new one. The formal definition for transfer learning states that, given a source domain, a corresponding source task, as well as a target domain and a target task , the objective of transfer learning now is to enable us to learn the target conditional probability distribution in target domain with the information gained from source domain and where source domain ≠target domain or source task ≠ target task.

Wide varieties of transformer-based models are there for performing different NLP tasks, but most important ones are:

“BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train deep bidirectional representations from unlabelled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks.”

BERT
Roberta
GPT-2.
Fine Tuning Methods:
Train the entire architecture – training the entire pre-trained model on our dataset and feed the output to a softmax layer
Train some layers while freezing others – Training it partially. What we can do is keep the weights of initial layers of the model frozen while we retrain only the higher layers.
Freeze the entire architecture – We can even freeze all the layers of the model and attach a few neural network layers of our own and train this new model.
BERT:
BERT (Bidirectional Encoder Representations from Transformers) is a big neural network architecture, with a huge number of parameters, ranging from 100 million to over 300 million. So, training a BERT model from scratch causes overfitting. So, a pre-trained BERT model that was trained on a huge dataset, is ideal to use. We can train the model further on a relatively smaller dataset and this process is known as model fine-tuning.
- L = Number of layers (i.e., #Transformer encoder blocks in the stack).
  H = Hidden size (i.e. the size of q, k and v vectors).
  A = Number of attention heads.
  
  BERT Base: L=12, H=768, A=12.
  Total Parameters=110M!
  BERT Large: L=24, H=1024, A=16.
  Total Parameters=340M!!

Generative Pre-trained Transformer 2 (GPT-2):

GPT-2 is an opensource artificial intelligence created by Open AI in February 2019. GPT-2 translates text, answers questions, summarize passages, and generates text output on a level that, while sometimes indistinguishable from that of humans, can become repetitive or nonsensical when generating long passages. It is a general purpose learner; it was not specifically trained to do any of these tasks, and its ability to perform them is an extension of its general ability to accurately synthesize the next item in an arbitrary sequence. GPT-2 was created as a "direct scale-up" of Open AI's 2018 GPT model, with a ten-fold increase in both its parameter count and the size of its training dataset.

The GPT architecture implements a deep neural network, specifically a transformer model, which uses attention in place of previous recurrence- and convolution-based architectures. Attention mechanisms allow the model to selectively focus on segments of input text it predicts to be the most relevant. This model allows for greatly increase parallelization and outperforms previous benchmarks for RNN/CNN/LSTM-based models.

Open AI released the complete version of the GPT-2 language model (with 1.5 billion parameters) in November 2019. GPT-2 was to be followed by the 175-billion-parameter GPT-3 revealed to the public in 2020 (whose source code has never been made available). Access to GPT-3 is provided exclusively through API’s offered by Open AI and Microsoft.

Roberta:

RoBERTa (short for “Robustly Optimized BERT Approach”) is a variant of the BERT (Bidirectional Encoder Representations from Transformers) model, which was developed by researchers at Facebook AI. Like BERT, RoBERTa is a transformer-based language model that uses self-attention to process input sequences and generate contextualized representations of words in a sentence.

One key difference between RoBERTa and BERT is that RoBERTa was trained on a much larger dataset and using a more effective training procedure. In particular, RoBERTa was trained on a dataset of 160GB of text, which is more than 10 times larger than the dataset used to train BERT. Additionally, RoBERTa uses a dynamic masking technique during training that helps the model learn more robust and generalizable representations of words.

Overall, RoBERTa is a powerful and effective language model that has made significant contributions to the field of NLP and has helped to drive progress in a wide range of applications.

These models are used for various NLP related tasks such as Sentiment Analysis, Named Entity Recognition, Question Answering etc.

Sentiment Analysis:

It is the process of classifying whether a block of text is positive, negative, or, neutral. Sentiment analysis is contextual mining of words which indicates the social sentiment of a brand and also helps the business to determine whether the product which they are manufacturing is going to make a demand in the market or not. The goal which Sentiment analysis tries to gain is to analyze people’s opinion in a way that it can help the businesses expand. It focuses not only on polarity (positive, negative & neutral) but also on emotions (happy, sad, angry, etc.)

Named Entity Recognition:

The named entity recognition (NER) is one of the most popular data pre-processing task. It involves the identification of key information in the text and classification into a set of predefined categories. An entity is basically the thing that is consistently talked about or refer to in the text.

Some of the categories that are the most important architecture in NER such that:

Person

Organization

Place/ location

Other common tasks include classifying of the following:

date/time.

expression

Numeral measurement (money, percent, weight, etc)

E-mail address

SentimentAnalysis using Bert:

Sentiment Analysis using Bert involves the following steps:

1. Installing Transformers:

We can install transformers by running the following command: pip install transformers

After the installation is completed, we will load the pre-trained BERT Tokenizer and Sequence Classifier and also the input example and input feature. we will build our model with the Sequence Classifier and our tokenizer with BERT’s Tokenizer.

We have the main BERT model, a dropout layer to prevent overfitting, and finally a dense layer for classification task:

To see the summary of our BERT model using the following code using model.summary( )

IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service. These review dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study. In this case study, we will only use the training dataset.

The two main imports needed are:

import tensorflow as tf

Import pandas as pd

Then, we can download the dataset from Stanford’s directory with tf.keras.utils.get_file function, as shown below:

Train and Test Split:

Creating Input Sequences

Now we will create two main functions:

convert_data_to_examples: This will accept our train and test datasets and convert each row into an Input Example object.

convert_examples_to_tf_dataset: This function will tokenize the Input Example objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.

Configuring the BERT model and Finetuning:

Fine-tuning the model for 2 epochs will give us around 95% accuracy, which is great.

Making Predictions

Creating a list of two reviews , the first one is a positive review, while the second one is clearly negative.

conclusion:

Like this we can perform many NLP tasks using BERT, GPT-2 and ROBERTa. We can explore it in later blogs. Hope you all will understand the overview of this blog.