Transformers Fine-tuning with TandA
Transfer and Adapt: An effective technique for fine-tuning pre-trained Transformer models for natural language tasks.
In recent years, virtual assistants have become a central asset for technological companies. This has increased the interest of AI researchers in studying and developing conversational agents, some popular examples being Google Home, Siri and Alexa. This has renewed the research interest in Question Answering (QA) and, in particular, in two main tasks:
(i) Answer sentence selection (AS2), which, given a question and a set of answer sentence candidates, consists in selecting sentences (e.g., retrieved by a search engine) correctly answering the question.
(ii) Machine reading (MR) (Chen et al. 2017) or reading comprehension, which, given a question and a reference text, consists in finding a text span answering it.
Even though the latter is gaining more and more popularity, AS2 is more relevant to a production scenario since, a combination of a search engine and an AS2 model already implements an initial QA system.
Recent AS2 models are based on Deep Neural Networks (DNNs), which learn distributed representations of the input data and are trained to apply a series of non-linear transformations to the input question and answer, represented as compositions of word or character embeddings. DNN architectures learn answer sentence-relevant patterns using intrapair similarities as well as cross-pair, question-to-question and answer-to-answer similarities, when modeling the input texts.
Here we will discuss the use of Transformer-based models for AS2 and provide effective solutions to tackle the data scarceness problem for AS2 and the instability of the fine-tuning step. Key points on which study is done:
• How to improve stability of Transformer models by adding an intermediate fine-tuning step, which aims at specializing them to the target task (AS2), i.e., this step transfers a pre-trained language model to a model for the target task. •How the transferred model can be effectively adapted to the target domain with a subsequent fine-tuning step, even when using target data of small size.
• Transfer and Adapt (TandA) approach makes fine-tuning:
(i) easier and more stable, without the need of cherry picking parameters
(ii) robust to noise, i.e., noisy data from the target domain can be utilized to train an accurate model
• We built ASNQ, a dataset for AS2, by transforming the recently released Natural Questions(NQ) corpus (Kwiatkowski et al. 2019) from MR to AS2 task. This was essential as our transfer step requires a large and accurate dataset.
• Finally, the generality of our approach and empirical investigation suggest that our TandA findings also apply to other NLP tasks, especially, textual inference, although empirical analysis is essential to confirm these claims.
TandA: Transfer and Adapt
TandA propose to train Transformer models for the AS2 by applying a two-step fine-tuning, called Transfer AND Adapt. The first step transfers the language model of the Transformer to the AS2 task; and the second fine-tuning step adapts the obtained model to the specific target domain, i.e., specific types of questions and answers.
1. AS2 task and model definition: AS2 can be defined as follows: given a question q and a set of answer sentence candidates S = {s1, .., sn}, select a sentence sk that correctly answers q. We can model the task as a function r : Q×P(S) → S, defined as r(q, S) = sk, where k = argmaxi p(q, si) and p(q, si) is the probability of correctness of si .


We estimate p(q, si) using neural networks, in particular, Transformer models, as explained below. : Transformer architecture with on top a linear classifier for fine-tuning on AS2. Here [CLS] Tok1 1 ,...,Tok1 N [SEP] Tok2 1 ,...,Tok2 M[EOS] is the input to the model.
For AS2, the training data comprises of question and sentence pairs with positive or negative labels according to the test: the sentence correctly answers the question or not. This finetuning is rather critical as the initial task learned during the pre-training stage is very different from AS2.
2.Transformers for AS2: The fine-tuning process is divided into two steps: transfer to the task and then adapt to the target domain (TANDA). This is advantageous over a single fine-tuning step, since the latter would require either (i) the availability of a large dataset for the target domain, which is undesirable due to higher difficulty and cost of collection of domain specific data over general data; or (ii) merging the general and domain specific data in a single training step, which is not optimal since the model needs to be specialized only to the target data.
3.TandA: A standard fine-tuning step using a large scale general purpose dataset for AS2. This step is supposed to transfer the Transformer language model to the AS2 task. The resulting model will not perform optimally on the data of the target domain due to the specificity of the latter. We thus apply a second fine-tuning step to adapt the classifier to the target AS2 domain.
ASNQ- Answer-Sentence Natural Questions
An accurate, general and large AS2 corpus to validate the benefits of TANDA. Since existing AS2 datasets are small in size, we built a new AS2 dataset called Answer Sentence Natural Questions (ASNQ) derived from the recently released Google Natural Questions (NQ) dataset. NQ is a large scale dataset intended for the MR task, where each question is associated with a Wikipedia page.
For each question, a long paragraph (long-answer) that contains the answer is extracted from the reference page. Each long-answer may contain phrases annotated as short-answer. A long-answer can contain multiple sentences, thus NQ is not directly applicable for AS2. An example of data instance conversion from NQ to ASNQ.
1. Sentences from the document that are in the long-answer but do not contain the annotated short-answers. It is possible that these sentences might contain the short-answer.
2. Sentences from the document that are not in the long-answer but contain the short-answer string, that is, such occurrence is purely accidental.
3. Sentences from the document that are neither in the long-answer nor contain the short-answer.
Properties of TandA:
Stability of TandA: TandA can robustly transfer language models to the target task and this produces more stable models. For stability, we mean a low variance of the model accuracy (i) between two consecutive training epochs, and (ii) between two pairs of models that have close accuracy on the development set. For example, BERT FT has a high variance in accuracy with the number of epochs, leading to some extreme cases of an on-off behavior.
Robustness to Noise in WikiQA and TREC-QA: Better model stability also means robustness to noise. We empirically studied this conjecture by artificially injecting noise in the training sets of WikiQA and TREC-QA, by randomly sampling questions-answer pairs from the training set and switching their label values.
The first step of TandA produces an intermediate model with three main features: (i) it can be more effectively used for fine-tuning on the target NLP application, being more stable and easier to adapt to other tasks; (ii) it is robust to noise, which might affect the target domain data; and (iii) it enables modularity and efficiency, i.e., once a Transformer model is adapted to the target general task, e.g., AS2, only the adapt step is needed for each targeted domain. This is an important advantage in terms of scalability as the data of (possibly many) different target domains can be typically smaller than the dataset for the transfer step(ASNQ), thereby causing the main computation to be factorized on the initial transfer step.
