Back

Minerva - Your AI Math Friend

Created 2 years ago
610 Views
1 Comments
SiddharthBathrachalam
@SiddharthBathrachalam
SiddharthBathrachalam
@SiddharthBathrachalamProfile is locked. Login

Introduction

Minerva is developed by Google. It is a transformer-based natural language processing model used to solve problems in quantitative reasoning, such as mathematics and science, given only the input statement without underlying formulation. Minerva is currently capable of solving university-level problems without breaking down the problem into mathematical expressions to feed into other tools to get the result.

Minerva is built on the same principles and architecture as other NLP models like Google's BERT and Chat GPT, but a subtle differentiation is that the model is trained on documents comprising sophisticated mathematical expressions, thereby enabling the model to preserve the underlying math expressions.

Besides successfully solving problems, Minerva is also capable of giving clear-cut information on each step during the process. Minerva is an NLP model whose purpose is text prediction or to predict the appropriate next word based on probability.

Ultimately, it is not just a tool but a toolbox capable of solving a plethora of problems. It is an invaluable tool for students for learning and researchers to delegate tasks.

To understand Minerva, we need to understand the basics of transformers and their components.

Working of Transformers and its key process and components

Diagram of a Transformer

A transformer contains a Transform encoder and a Transform decoder, both are fundamental components of neural networks.

The Transformer decoder and encoder have a forward pass layer and some other layer before passed to the neural network that is relevant to the task at hand.

Forward pass contains embedded layers and self-attention layers. Self-attention layer has a long range of dependencies, that is we can capture a long range of previous words which is an advantage over existing recurrent neural networks and also capture semantic meaning of the words by mapping similar words of similar context to similar vectors.

The text encoder also does positional encoding into the input embedding matrix.

The corpse of text go through preprocessing, which includes removing punctuations, digits, and stop words, as well as lowercasing, (and other necessary steps in accordance to the task at hand.

The text is then passed through a word2vec embedding layer. Word2Vec assigns lesser magnitude identifiers to more frequent words and higher magnitude values to less frequent words to avoid numerical instability and larger gradient descents.

The unique identifiers are associated with embedded vectors through the creation of the embedding matrix. Each embedding vector reduces the cost function through backpropagation and gradient descent.

The embedding vectors are also referred to as dense vectors because of their higher dimensionality.

At the end of the embedded layer now the embedded matrix is passed to the other layers, One of the essential other layers is the Self-attention layer.

Visualization of word to vector representation.

Self-attention layers use three matrices: Query, Value, and Key matrices to produce an attention output or weight matrix. The weights for Query, Value, and Key are initialized with random weights, and each of them is produced by the dot product of an embedded matrix.

To produce the attention weight matrix, we carry out the dot product of Q and K and divide it by the square root of the embedded matrix. This division reduces the dimensionality of the embedded matrix and prevents the higher gradient from causing issues.

Then, we apply the Soft-max function to the attention matrix so that the components of vectors add up to 1, indeed we are normalizing the score to make it a probability distribution over the input sequence. After that, the score is matrix multiplied by the Value matrix and returned as the output of the self-attention layer.

This output of the layer is passed to several other layers like residual connections and feed-forward neural networks, which helps in the process predicting the output label.

If there is a difference in the output label, the weight of the self-attention layer and additional layers change by backpropagation to find the optimal weights.

Recent research proves that scaling the model improves the model's performance on predictability.

Each edge is the weights associated with words. Intensity represents the higher relationships.

Minerva uses different word to vector algorithm FastText which similar to Word2Vector algorithm but FastText breaks the sentence into substrings rather than each word. Why?

  • It decrease computational power.

  • In quantitative reasoning problem, most words are reductant. Words associated with numerical information is valuable.

Minerva's training dataset

The Minerva team collected documents with large mathematical descriptions to train and help them to preserve the mathmatical notations, all this done without changing the internal architecture of Transformer based NLP.

The model was trained on 175 GB mathematical text. The documents are primarily collected from the following sources.

  • ArXiv - arxiv.org is a free distribution service and an open-access archive for 2 million+ scholarly articles in the fields of physics, mathematics and computer science.

  • Webpages with mathematical texts.

In web pages, the mathematical texts are represented by TeX notation, so the model learned to render TeX notation.

Minerva uses pre-trained models like PaLM (Pathways language model) which is based on NLP- transformer released by Google. It is similar to what I explained in the workings of the transformer in the previous section. Google's main idea is to increase the efficiency of the model at which it can calculate the output rather than changing the model. Google modified certain aspects of the transformer to make the model more efficient, following are some of the aspects

  • Self-attention and feed-forward layers (transform the embeddings into a useful representation) are computed in parallel

  • SwiGLU Activation function. (The SwiGLU activation function is a modification of the standard Gated Linear Unit (GLU) activation function used in transformers. SwiGLU is designed to be more computationally efficient, while still maintaining the same level of accuracy as GLU.)

  • Rotary positional embeddings. (positional embeddings can change during training but rotary positional embeddings, which are dynamically updated during training. )

PaLM contains 540GB parameters, which is trained for a month in 1024 TPU chips. TPU- Tensor processing unit which is a specialized processor developed for Google that is optimized for TensorFlow machine learning development. It can perform astonishing 180 trillion floating-point operations per second. To put this perspective, Imagine a problem that involves performing 10^18 mathematical operations. If you were to perform this on a calculator, assuming it takes 1 second to perform each operation, it would take you more than 31,000 years to complete the task!

But with the help of TPU chips, like the ones used to train PaLM, this same problem can be solved in a matter of hours or even minutes.

Inference-time techniques

Inference-time is the process of making a trained model predict un-seen data. Remember the main purpose of NLP is to generalize on new unseen data and learn it.

Few - shot prompting

Few-shot prompting is a technique where the model is given an example to solve a problem. The examples provided do not involve complex math, but they help the model to solve the required problem in a single step by showing it an example, such as a direct equation. This approach allows the model to generalize to new problems based on a few examples, making it more efficient than traditional methods that require extensive training with large datasets.

Few-shot prompting example shown to model

Chain-Prompting

Chain-Prompting is a technique in which the model is given a prompt that consists of a chain of related statements or a complex problem with multiple lines. The model then learns from example and tries to predict the output from unseen data. This technique can be used to solve complex problems that require multiple steps or involve multiple variables. By breaking down the problem into smaller steps or sub-problems, the model can learn to solve the entire problem more efficiently.

Chain of thought prompting

Majority Voting:

Minerva assigns probabilities to the different possible outputs. It generates many outputs by stochastically sampling all potential outcomes while answering a question. The model tries to sample the same answer-question pair to immediately an prompt answer. Majority voting prevents the wrong answer if the model goes wrong and singles it to prompt the correct answer.

Efficiency after applying different inference techniques.

Efficiency of model before and after applying techniques.

  • Minerva 540B is a model that without applying Majority voting but other techniques of inference.

  • Minera 540B maj1@k after applying Majority voting along with other techniques of inferenc.

  • Efficiency is about 10% increase if the Majority voting is performed.

Examples of Minerva processed questions which it able to make 'right' prediction:

Examples of Minerva processed questions which it able to make 'wrong' prediction:

Statistics on errors.

Analysis of False positives reveals that 8% of the positive answers claimed by the model are false positives.

Conclusion

Abstract theoretical note.

Minerva is a model that is used to solve quantitative problems. It was trained on the 175GB math text. Able to preserve mathematical text unaltered. It utilizes the PaLM pre-trained model which has 540B parameters, It efficiently solves a problem provided with an example.

Final opinion note.

Minerva's potential on solving problems at a university level is quite intriguing. Solving problems without any underlying presumption or formulation based on the problem is remarkable. Its potential to accelerate problem-solving and simplify complex mathematical and scientific problems are quite astounding. As with any technological advancement, it will be interesting to see how it develops and the impact it will have on the field of mathematics and science.

Reference

https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html

https://arxiv.org/pdf/2203.11171.pdf

https://analyticsdrift.com/google-developed-minerva-an-ai-that-can-answer-math-questions/#:~:text=Google%20Developed%20Minerva%2C%20an%20AI%20That%20Can%20Answer%20Math%20Questions,-July%201%2C%202022&text=Google%20unveiled%20Minerva%20AI%2C%20a,into%20existence%20in%20April%2C%202022.

Comments
Please login to comment.