Massive Language Fashions (LLMs), corresponding to ChatGPT, Gemini, Claude, and so on., have been round for some time now, and I imagine all of us already used a minimum of one in all them. As this text is written, ChatGPT already implements the fourth era of the GPT-based mannequin, named GPT-4. However have you learnt what GPT truly is, and what the underlying neural community structure appears like? On this article we’re going to speak about GPT fashions, particularly GPT-1, GPT-2 and GPT-3. I will even reveal how you can code them from scratch with PyTorch to be able to get higher understanding in regards to the construction of those fashions.
A Transient Historical past of GPT
Earlier than we get into GPT, we have to perceive the unique Transformer structure upfront. Typically talking, a Transformer consists of two primary elements: the Encoder and the Decoder. The previous is accountable for understanding enter sequence, whereas the latter is used for producing one other sequence based mostly on the enter. For instance, in a query answering activity, the decoder will produce a solution to the enter sequence, whereas in a machine translation activity it’s used for producing the interpretation of the enter.