As we’ve been anticipating, fashions have gotten more and more able to understanding several types of inputs. We’ve seen picture transformer fashions (see my blogs on fine-tuning Flux and the research behind MM1) and now we’re starting to see video fashions hit the scene.
In December of 2024, Meta unveiled their new Apollo household of fashions. Once they unveiled these, in addition they revealed a paper detailing their analysis and work round Giant Multimodal Fashions (LMMs). The paper is filled with nice particulars, so quite than attempt to cowl all of it I’ll be specializing in the 4 main design selections they highlighted when making their mannequin.
Let’s dive in!
Embedding
Let’s first format some fast concepts which are vital to know what’s happening right here. Each Transformer depends on embeddings for its enter. Nonetheless, consumer enter is often first transformed from one thing user-understood (textual content, movies) to tokens after which embeddings. To transform to embeddings, we use an embedding mannequin. For multi-modal inputs, we usually use a distinct encoder for every enter kind.