Pre-training and Finetuning
The pretraining + finetuning paradigm, which firstly turned in style in Pc Imaginative and prescient, refers back to the course of of coaching a mannequin utilizing two levels: pretraining after which finetuning.
In pretraining stage, the mannequin is educated on a large-scale dataset that associated to the downstream job at hand. In Pc Imaginative and prescient, that is executed often by studying a picture classification mannequin on ImageNet, with its mostly used subset ILSVR containing 1K classes, every has 1K photographs.
Though 1M photographs doesn’t sound like “large-scale” by at this time’s commonplace, ILSVR was really outstanding in a decade in the past and was certainly a lot a lot bigger than what we might have for particular CV duties.
Additionally, the CV neighborhood has explored quite a lot of methods to eliminate supervised pre-training as effectively, for instance MoCo (by Kaiming He et al.) and SimCLR (by Ting Chen et al.), and so on.
After pre-training, the mannequin is assumed to have learnt some normal information in regards to the job, which might speed up the educational course of on the downstream job.
Then involves finetuning: On this stage, the mannequin shall be educated on a selected downstream job with high-quality labeled knowledge, usually in a lot smaller scale in comparison with ImageNet. Throughout this stage, the mannequin will decide up some domain-specific information associated to the duty at-hand, which helps enhance its efficiency.
For lots of CV duties, this pretraining + finetuning paradigm demonstrates higher efficiency in comparison with instantly coaching the identical mannequin from scratch on the restricted task-specific knowledge, particularly when the mannequin is complicated and therefore extra more likely to overfit on restricted coaching knowledge. Mixed with trendy CNN networks akin to ResNet, this results in a efficiency leap in lots of CV benchmarks, the place a few of which even obtain near-human efficiency.
Subsequently, a pure query arises: how can we replicate such success in NLP?
Earlier Explorations of Pretraining Previous to GPT-1
Actually, the NLP neighborhood by no means stops attempting on this course, and a number of the efforts can date again to as early as 2013, akin to Word2Vec and GloVe (International Vectors for Phrase Illustration).
Word2Vec
The Word2Vec paper “Distributed Representations of Phrases and Phrases and their Compositionality” was honored with the “Take a look at of Time” award at NeurIPS 2023. It’s actually a must-read for anybody not accustomed to this work.
At this time it feels so pure to signify phrases or tokens as embedding vectors, however this wasn’t the case earlier than Word2Vec. At the moment, phrases had been generally represented by one-hot encoding or some count-based statistics akin to TD-IDF (time period frequency-inverse doc frequency) or co-occurrence matrices.
For instance in one-hot encoding, given a vocabulary of measurement N, every phrase on this vocabulary shall be assigned an index i, after which will probably be represented as a sparse vector of size N the place solely the i-th factor is ready to 1.
Take the next case for instance: on this toy vocabulary we solely have 4 phrases: the (index 0), cat (index 1), sat (index 2) and on (index 3), and due to this fact every phrase shall be represented as a sparse vector of size 4(the ->1000, cat -> 0100, sat -> 0010, on -> 0001).
The issue with this easy technique is that, as vocabulary grows bigger and bigger in real-world instances, the one-hot vectors will change into extraordinarily lengthy. Additionally, neural networks usually are not designed to deal with these sparse vectors effectively.
Moreover, the semantic relationships between associated phrases shall be misplaced throughout this course of because the index for every phrase is randomly assigned, which means comparable phrases haven’t any connection on this illustration.
With that, you’ll be able to higher perceive the importance of Word2Vec’s contribution now: By representing phrases as steady vectors in a high-dimensional house the place phrases with comparable contexts have comparable vectors, it utterly revolutionized the sphere of NLP.
With Word2Vec, associated phrases shall be mapped nearer within the embedding house. For instance, within the determine beneath the authors present the PCA projection of phrase embeddings for some nations and their corresponding capitals, with their relationships routinely captured by Word2Vec with none supervised info offered.
Word2Vec is learnt in an unsupervised method, and as soon as the embeddings are learnt, they are often simply utilized in downstream duties. This is likely one of the earliest efforts exploring semi-supervised studying in NLP.
Extra particularly, it may leverage both the CBOW (Steady Bag of Phrases) or Skip-Gram architectures to study phrase embeddings.
In CBOW, the mannequin tries to foretell the goal phrase based mostly on its surrounding phrases. For instance, given the sentence “The cat sat on the mat,” CBOW would attempt to predict the goal phrase “sat” given the context phrases “The,” “cat,” “on,” “the.” This structure is efficient when the aim is to foretell a single phrase from the context.
Nonetheless, Skip-Gram works fairly the other approach — it makes use of a goal phrase to foretell its surrounding context phrases. Taking the identical sentence as instance, this time the goal phrase “sat” turns into the enter, and the mannequin would attempt to predict context phrases like “The,” “cat,” “on,” and “the.” Skip-Gram is especially helpful for capturing uncommon phrases by leveraging the context wherein they seem.
GloVe
One other work alongside this line of analysis is GloVe, which can be an unsupervised technique to generate phrase embeddings. In contrast to Word2Vec which focuses on an area context, GloVe is designed to seize world statistical info by developing a phrase co-occurrence matrix and factorizing it to acquire dense phrase vectors.
Be aware that each Word2Vec and GloVe can primarily switch word-level info, which is commonly not ample in dealing with complicated NLP duties as we have to seize high-level semantics within the embeddings. This results in more moderen explorations on unsupervised pre-training of NLP fashions.
Unsupervised Pre-Coaching
Earlier than GPT, many works have explored unsupervised pre-training with completely different targets, akin to language mannequin, machine translation and discourse coherence, and so on. Nonetheless, every technique solely outperforms others on sure downstream duties and it remained unclear what optimization targets had been best or most helpful for switch.
You might have observed that language fashions had already been explored as coaching targets in a number of the earlier works, however why didn’t these strategies succeed like GPT?
The reply is Transformer fashions.
When the sooner works had been proposed, there isn’t any Transformer fashions but, so researchers might solely depend on RNN fashions like LSTM for pre-training.
This brings us to the subsequent matter: the Transformer structure utilized in GPT.
Decoder-only Transformer
In GPT, the Transformer structure is a modified model of the unique Transformer known as decoder-only Transformer. It is a simplified Transformer structure proposed by Google in 2018, and it incorporates solely the decoder.
Beneath is a comparability of the encoder-decoder structure launched within the unique Transformer vs. the decoder-only Transformer structure utilized in GPT. Mainly, the decoder-only structure removes the encoder half fully together with the cross-attention, resulting in a extra simplified structure.
So what’s the good thing about making Transformer decoder-only?
In contrast with encoder-only fashions akin to BERT, decoder-only fashions usually carry out higher in producing coherent and contextually related textual content, making them very best for textual content technology duties.
Encoder-only fashions like BERT, however, usually carry out higher in duties that require understanding the enter knowledge, like textual content classification, sentiment evaluation, and named entity recognition, and so on.
There may be one other kind of fashions that make use of each the encoder and decoder Transformer, akin to T5 and BART, with the encoder processes the enter, whereas the decoder generates the output based mostly on the encoded illustration. Whereas such a design makes them extra versatile in dealing with a variety of duties, they’re usually extra computationally intensive than encoder-only or decoder-only fashions.
In a nutshell, whereas each constructed on Transformer fashions and tried to leverage pre-training + finetuning scheme, GPT and BERT have chosen very other ways to realize that comparable aim. Extra particularly, GPT conducts pre-training in an auto-regressive method, whereas BERT follows an auto-encoding method.
Auto-regressive vs. Auto-encoding Language Fashions
A simple method to perceive their distinction is to check their coaching targets.
In Auto-regressive language fashions, the coaching goal is commonly to foretell the subsequent token within the sequence, based mostly on earlier tokens. As a result of dependency on earlier tokens, this often result in a unidirectional (usually left-to-right) method, as we present within the left of Determine 6.
In contrast, auto-encoding language fashions are sometimes educated with targets like Masked Language Mannequin or reconstructing the complete enter from corrupted variations. That is usually executed in a bi-directional method the place the mannequin can leverage all of the tokens across the masked one, in different phrases, each the left and proper aspect tokens. That is illustrated in the appropriate of Determine 6.
Merely put, auto-regressive LM is extra appropriate for textual content technology, however its unidirectional modeling method might restrict its functionality in understanding the complete context. Auto-encoding LM, however, can do a greater job at context understanding, however just isn’t designed for generative duties.