TRAINING LANGUAGE MODELS TO FOLLOW INSTRUCTIONS WITH HUMAN FEEDBACK, MAJOR IMPACT VIA CHATGPT AND POPULARIZED RLHF.
To start with Training Language Models, we'll first understand what is a Language Model ? why is it used? and where is it used?
In Artificial intelligence (AI) , Natural Language Processing (NLP) is widely used to understand and imitate human abilities by reading and understanding texts. So, LANGUAGE MODELS play an important role in Natural language processing from
1) text generations
2)text classification
3)answering the questions
they use
1)Natural Language Generation
2)Natural Language Classification
3)Natural Language Understanding
HOW THEY ARE USED?
As we know now that these Language models can read the text and are capable of predicting the next word in the text like the one in our mobile phone text messages and in search engines, they are capable of speech recognition , optical character recognition and handwriting recognition also. Lets dive deep into the content,
By now we all must have heard about CHAT-GPT a popular chatbot launched by Open AI API. These bots understands our questions and predict the best answers using NLP.
A dataset was first extracted from the Open AI API playground users and are fine -tuned with a Transfer Learning method for a conversational applications using Supervised and Reinforcement Learning .
Fine-tuning uses human preferences as a reward signal to tune the models. A reward model was then created to predict which model output the labelers would prefer. Labelers are one who perform train and test split on the data. Then the reward model was used as a reward function to fine tune on a supervised data, and this model was expanded using several iterations of Proximal Policy Optimization (PPO)algorithm.
Natural Language Processing should avoid harmful contents and provide truthfulness. Chat GPT have come up with better results than the previous models of GPT3 and Instruct-GPT, but still suffer from algorithm bias when expected for long answers from the reviewers while training the data. This is common in Large Language Models.
REINFORCEMENT LEARNING WITH HUMAN FEEDBACK

Reinforcement Learning using PPO on reward model. The AI trainers rank the multiple model- generated responses based on the relevance, coherence and quality. This is how they collect the feedback. This feedback is incorporated in reinforcement learning phase.
It uses three stages of deployment:
1) Pretraining a language model (LM)
2) Data collection and Data training
3) Fine tuning the language model
Open AI released its first code of RLHF in Tensorflow in 2019.
Today, there are already a few active repositories for RLHF in PyTorch that grew out of this.