Image GPT: Generative Pretraining from Pixels

One of the most promising Artificial Intelligence(AI) technologies today is the 'Generative Pre-training Transformer(GPT) neural networks' that has been designed for Natural language processing(NLP).

Generative Pre-training Transformer(GPT) Model is an Artificial Intelligence(AI) model in Machine Learning(ML), that involves generating new content on its own by training the model using a large dataset of examples.

American company 'OpenAI’s', line of GPTs(generative pre-trained transformers) is a family of language models that are pre-trained on large text-based datasets. These AI models are built with “transformer deep learning neural networks", which enable them to learn and understand contextual relationships between words in a text.

Image GPT(iGPT) — Generative Pretraining from Pixels, to complete an image is an AI development, that uses the same principles of Generative Pre-training Transformer Model in Natural Language Processing(NLP).

OpenAI released the first GPT model in 2018. The developers released the first full version of this neural network, called GPT-1, followed by GPT-2 in 2019 and GPT-3 in 2020, and the updated GPT-3.5 in 2022.

But these models has not been successful in producing strong features for image classification. On the contrast, iGPT aims to understand and bridge this gap.

In 2020, OpenAI released Image GPT (iGPT), a Transformer-based model that operates on sequences of pixels instead of sequences of text. OpenAI found that, just as GPT models for text could generate realistic samples of natural language, iGPT could "generate coherent image completions and samples," given an input of initial pixels.

Similar to the text prediction models, Image-Generative Pre-training Transformer(iGPT) model can generate new images by predicting the next pixel in the image, based on the pixels that have been previously generated.

iGPT is capable of learning powerful image features. iGPT is a type of artificial intelligence that can create realistic images from a given sample of images in the object recognition/ object detection tasks. The model appears to understand 2-D image characteristics such as object appearance and category. iGPT’s ability to process visual imagery could be a potentially powerful feature.

iGPT - Transformer generative model operates in two phases:- In the first phase, each position gathers information from its surrounding context in order to build a contextualized image feature. In the second phase, this contextualized feature is used to solve the conditional next pixel prediction task.

This advanced technology that has been developed allows the model to complete the image just by giving 'half the image' by humans. In the images below, the left end is the input image, the right end is the original image, and the middle four are the images generated by the model.

Moreover this year, OpenAI launched GPT-4, the latest in its series of large language models capable of generate text when presented with image and text inputs. According to the company, GPT-4 leaps forward on a number of key metrics, including improved creativity, the ability to process visual inputs like images. Image processing neural network not only recognizes objects in images, but also understands their context.

Source: https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf