The CNN That Challenges ViT

The invention of ViT (Imaginative and prescient Transformer) causes us to assume that CNNs are out of date. However is that this actually true?

It’s broadly believed that the spectacular efficiency of ViT comes primarily from its transformer-based structure. Nevertheless, researchers from Meta argued that it’s not completely true. If we take a better have a look at the architectural design, ViT launched radical modifications not solely to the construction of the community but additionally to the mannequin configurations. Meta’s researchers thought that maybe it isn’t the construction that makes ViT superior, however its configuration. So as to show this, they tried to use the ViT configuration parameters to the ResNet structure from 2015.

— They usually discovered their thesis true.

On this article I’m going to speak about ConvNeXt which was first proposed within the paper titled “A ConvNet for the 2020s” written by Liu et al. [1] again in 2022. Right here I’ll additionally attempt to implement it myself from scratch with PyTorch so that you could get higher understanding of the modifications produced from the unique ResNet. In reality, the precise ConvNeXt implementation is out there of their GitHub repository [2], however I discover it too advanced to elucidate line by line. Thus, I made a decision to write down it down alone in order that I can clarify it with my fashion, which I imagine is extra beginner-friendly. Disclaimer on, my implementation may not completely replicate the unique one, however I feel it’s nonetheless good to think about my code as a useful resource to study. So, after studying my article I like to recommend you verify the unique code particularly should you’re planning to make use of ConvNeXt to your challenge.

The Hyperparameter Tuning

What the authors primarily did within the analysis was hyperparameter tuning on the ResNet mannequin. Usually talking, there have been 5 features they experimented with: macro design, ResNeXt, inverted bottleneck, massive kernel, and micro design. We will see the experimental outcomes on these features within the following determine.

Determine 1. The hyperparameter tuning outcomes carried out on the unique ResNet structure [1].

There have been two ResNet variants used of their experiments: ResNet-50 and ResNet-200 (proven in purple and grey, respectively). Let’s now give attention to the outcomes obtained from tuning the ResNet-50 structure. Based mostly on the determine, we are able to see that this mannequin initially obtained 78.8% accuracy on ImageNet dataset. They tuned this mannequin till finally it reached 82.0%, surpassing the state-of-the-art Swin-T structure which solely achieved 81.3% (the orange bar). This tuned model of the ResNet mannequin is the one so-called ConvNeXt proposed within the paper. Their experiments on ResNet-200 affirm that the earlier outcomes are legitimate since its tuned model, i.e., ConvNeXt-B, additionally efficiently surpasses the efficiency of Swin-B (the bigger variant of Swin-T).

Macro Design

The primary change made on the unique ResNet was the macro design. If we take a better have a look at Determine 2 beneath, we are able to see {that a} ResNet mannequin primarily consists of 4 principal levels, particularly conv2_x, conv3_x, conv4_x and conv5_x, which every of them additionally includes a number of bottleneck blocks. Speaking extra particularly about ResNet-50, the bottleneck blocks in every stage is repeated 3, 4, 6 and three occasions, respectively. Afterward, I’ll refer to those numbers as stage ratio.

Determine 2. The ResNet structure variants [3].

The authors of the ConvNeXt paper tried to vary this stage ratio based on the Swin-T structure, i.e., 1:1:3:1. Effectively, it’s truly 2:2:6:2 should you see the architectural particulars from the unique Swin Transformer paper in Determine 3, but it surely’s principally only a derivation from the identical ratio. By making use of this configuration, authors obtained 0.6% enchancment (from 78.8% to 79.4%). Thus, they determined to make use of 1:1:3:1 stage ratio for the upcoming experiments.

Determine 3. The Swin Transformer structure variants [4].

Nonetheless associated to macro design, modifications have been additionally made to the primary convolution layer of ResNet. If you happen to return to Determine 2 (the conv1 row), you’ll see that it initially makes use of 7×7 kernel with stride 2, which reduces the picture measurement from 224×224 to 112×112. Being impressed by Swin Transformer, authors additionally wished to deal with the enter picture as non-overlapping patches. Thus, they modified the kernel measurement to 4×4 and the stride to 4. This concept was truly adopted from the unique ViT, the place it makes use of 16×16 kernel with stride 16. One factor you might want to know in ConvNeXt is that the ensuing patches are handled as a regular picture somewhat than a sequence. With this modification, the accuracy barely improved from 79.4% to 79.5%. Therefore, authors used this configuration for the primary convolution layer within the subsequent experiments.

ResNeXt-ification

Because the macro design is completed, the subsequent factor authors did was to undertake the ResNeXt structure, which was first proposed in a paper titled “Aggregated Residual Transformations for Deep Neural Networks” [5]. The concept of ResNeXt is that it principally applies group convolution to the bottleneck blocks of the ResNet structure. In case you’re not but accustomed to group convolution, it primarily works by separating enter channels into teams and performing convolution operations inside every group independently, permitting sooner computation because the variety of teams will increase. ConvNeXt adopts this concept by setting the variety of teams to be the identical because the variety of kernels. This strategy, which is usually often called depthwise convolution, permits the community to acquire the bottom potential computational complexity. Nevertheless, you will need to notice that growing the variety of convolution teams like this results in a discount in accuracy because it lowers the mannequin capability to study. Thus, the drop in accuracy to 78.3% was anticipated.

That wasn’t the top of the ResNeXt-ification part, although. In reality, the ResNeXt paper provides us a steerage that if we enhance the variety of teams, we additionally have to increase the width of the community, i.e., add extra channels. Thus, ConvNeXt authors readjusted the variety of kernels based mostly on the one utilized in Swin-T. You possibly can see in Determine 2 and three that ResNet initially makes use of 64, 128, 256 and 512 kernels in every stage, whereas Swin-T makes use of 96, 192, 384, and 768. Such a rise within the mannequin width permits the community to considerably push the accuracy to 80.5%.

Inverted Bottleneck

Nonetheless with Determine 2, additionally it is seen that ResNet-50, ResNet-101, and ResNet-152 share the very same bottleneck construction. As an illustration, the block at stage conv5_x consists of three convolution layers with 512, 512, and 2048 kernels, the place the enter of the primary convolution is both 1024 (coming from the conv4_x stage) or 2048 (from the earlier block within the conv5_x stage itself). These ResNet variations primarily comply with the broad → slender → broad construction, which is the explanation that this block known as bottleneck. As a substitute of utilizing a construction like this, ConvNeXt employs the inverted model of bottleneck, the place it follows the slender → broad → slender construction adopted from the feed-forward layer of the Transformer structure. In Determine 4 beneath (a) is the bottleneck block utilized in ResNet and (b) is the so-called inverted bottleneck block. By utilizing this construction, the mannequin accuracy elevated from 80.5% to 80.6%.

Kernel Measurement

The subsequent exploration was carried out on the kernel measurement contained in the inverted bottleneck block. Earlier than experimenting with completely different kernel sizes, additional modification was carried out to the construction of the block, the place authors swapped the order of the primary and second layer such that the depthwise convolution is now positioned initially of the block as seen in Determine 4 (c). Because of this modification, the block is now known as ConvNeXt block because it now not utterly resembles the unique inverted bottleneck construction. This concept was truly adopted from Transformer, the place the MSA (Multihead Self-Consideration) layer is positioned earlier than the MLP layers. Within the case of ConvNeXt, the depthwise convolution acts because the substitute of MSA, whereas the linear layers in MLP Transformers are changed by pointwise convolutions. Merely shifting up the depthwise convolution like this decreased the accuracy from 80.6% to 79.9%. Nevertheless, that is acceptable as a result of the present experiment set remains to be ongoing.

Experiments on the kernel measurement was then utilized solely on the depthwise convolution layer, leaving the remaining pointwise convolutions unchanged. Right here authors tried to make use of completely different kernel sizes, the place they discovered that 7×7 labored greatest because it efficiently recovered the accuracy again to 80.6% with decrease computational complexity (4.6 vs 4.2 GFLOPS). Curiously, this kernel measurement matches the window dimensions within the Swin Transformer structure, which corresponds to the patch measurement used within the self-attention mechanism. You possibly can truly see this in Determine 3 the place the window sizes in Swin Transformer variants are all 7×7.

Micro Design

The ultimate facet tuned within the paper is the so-called micro design, which primarily refers back to the issues associated to the intricate particulars of the community. Just like the earlier ones, the parameters used listed here are primarily additionally adopted from Transformers. Authors initially changed ReLU with GELU. Though with this substitute the accuracy remained the identical (80.6%), however they determined to go together with this activation operate for the following experiments. The accuracy lastly elevated after the variety of activation features was decreased. As a substitute of making use of GELU after every convolution layer within the ConvNeXt block, this activation operate was positioned solely between the 2 pointwise convolutions. This modification allowed the community to spice up the accuracy as much as 81.3%, at which level this rating was already on par with the Swin-T structure whereas nonetheless having decrease GFLOPS (4.2 vs 4.5).

Subsequent, it’s a widespread observe to make use of Conv-BN-ReLU construction in CNN-based structure, which is strictly what ResNet implements as nicely. As a substitute of following this conference, authors determined to implement solely a single batch normalization layer, which is positioned earlier than the primary pointwise convolution layer. This modification improved the accuracy to 81.4%, surpassing the accuracy of Swin-T by a bit of bit. Regardless of this achievement, parameter tuning was nonetheless continued by changing batch norm with layer norm, which once more raised the accuracy by 0.1% to 81.5%. All of the modifications associated to micro design resulted within the structure proven in Determine 5 (the rightmost picture). Right here you may see how a ConvNeXt block differs from Swin Transformer and ResNet blocks.

Determine 5. What the Swin-T, ResNet-50 and ConvNeXt-T blocks seem like on the preliminary stage [1].

The very last thing the authors did associated to the micro design was making use of separate downsampling layers. Within the unique ResNet structure, the spatial dimension of a tensor reduces by half after we transfer from one stage to a different. You possibly can see in Determine 2 that originally ResNet accepts enter of measurement 224×224 which then shrinks to 112×112, 56×56, 28×28, 14×14, and seven×7 at stage conv1, conv2_x, conv3_x, conv4_x and conv5_x, respectively. Particularly in conv2_x and the following ones, the spatial dimension discount is completed by altering the stride parameter of the pointwise convolution to 2. As a substitute of doing so, ConvNeXt performs downsampling by putting one other convolution layer proper earlier than the element-wise summation operation throughout the block. The kernel measurement and stride of this layer are set to 2, simulating a non-overlapping sliding window. In reality, it’s talked about within the paper that utilizing this separate downsampling layer induced the accuracy to degrade as a substitute. However, authors managed to unravel this difficulty by making use of extra layer normalization layers at a number of components of the community, i.e., earlier than every downsampling layer, after the stem stage and after the worldwide common pooling layer (proper earlier than the ultimate output layer). With this tuning, authors efficiently boosted the accuracy to 82.0%, which is far greater than Swin-T (81.3%) whereas nonetheless having the very same GFLOPS (4.5).

And that’s principally all of the modifications made on the unique ResNet to create the ConvNeXt structure. Don’t fear if it nonetheless feels a bit unclear for now — I imagine issues will change into clearer as we get into the code.

ConvNeXt Implementation

Determine 6 beneath shows the small print of the complete ConvNeXt-T structure which we are going to later implement each single of its elements one after the other. Right here you too can see the way it differs from ResNet-50 and Swin-T, the 2 fashions which are corresponding to ConvNeXt-T.

Determine 6. The small print of the ResNet-50, ConvNeXt-T, and Swin-T architectures [1].

With regards to the implementation, the very first thing we have to do is to import the required modules. The one two we import listed here are the bottom torch module and its nn submodule for loading neural community layers.

# Codeblock 1
import torch
import torch.nn as nn

ConvNeXt Block

Now let’s begin with the ConvNeXt block. You possibly can see in Determine 6 that the block buildings in res2, res3, res4, and res5 levels are principally the identical, by which all of these correspond to the rightmost illustration in Determine 5. Thanks to those equivalent buildings, we are able to implement them in a single class and use it repeatedly. Have a look at the Codeblock 2a and 2b beneath to see how I do this.

# Codeblock 2a
class ConvNeXtBlock(nn.Module):
    def __init__(self, num_channels):         #(1)
        tremendous().__init__()
        hidden_channels = num_channels * 4    #(2)

        
        self.conv0 = nn.Conv2d(in_channels=num_channels,         #(3) 
                               out_channels=num_channels,        #(4)
                               kernel_size=7,    #(5)
                               stride=1,
                               padding=3,        #(6)
                               teams=num_channels)              #(7)
        
        self.norm = nn.LayerNorm(normalized_shape=num_channels)  #(8)
        
        self.conv1 = nn.Conv2d(in_channels=num_channels,         #(9)
                               out_channels=hidden_channels, 
                               kernel_size=1, 
                               stride=1, 
                               padding=0)
        
        self.gelu = nn.GELU()  #(10)
        
        self.conv2 = nn.Conv2d(in_channels=hidden_channels,      #(11)
                               out_channels=num_channels, 
                               kernel_size=1, 
                               stride=1, 
                               padding=0)

I made a decision to call this class ConvNeXtBlock. You possibly can see at line #(1) within the above codeblock that this class accepts num_channels as the one parameter, by which it denotes each the variety of enter and output channels. Keep in mind that a ConvNeXt block follows the sample of the inverted bottleneck construction, i.e., slender → broad → slender. If you happen to take a better have a look at Determine 6, you’ll discover that the broad half is 4 occasions bigger than the slender half. Thus, we set the worth of the hidden_channels variable accordingly (#(2)).

Subsequent, we initialize 3 convolution layers which I check with them as conv0, conv1 and conv2. Each single of those convolution layers has their very own specs. For conv0, we set the variety of enter and output channels to be the identical, which is the explanation that each its in_channels and out_channels parameters are set to num_channels (#(3–4)). We set the kernel measurement of this layer to 7×7 (#(5)). Given this specification, we have to set the padding measurement to three in an effort to retain the spatial dimension (#(6)). Don’t overlook to set the teams parameter to num_channels as a result of we wish this to be a depthwise convolution layer (#(7)). However, the conv1 layer (#(9)) is accountable to extend the variety of picture channels, whereas the following conv2 layer (#(11)) is employed to shrink the tensor again to the unique channel depend. You will need to notice that conv1 and conv2 are each utilizing 1×1 kernel measurement, which primarily implies that it solely works by combining info alongside the channel dimension. Moreover, right here we additionally have to initialize layer norm (#(8)) and GELU activation operate (#(10)) because the substitute for batch norm and ReLU.

As all layers required within the ConvNeXtBlock have been initialized, what we have to do subsequent is to outline the stream of the tensor within the ahead() methodology beneath.

# Codeblock 2b
    def ahead(self, x):
        residual = x                 #(1)
        print(f'x & residualt: {x.measurement()}')
        
        x = self.conv0(x)
        print(f'after conv0t: {x.measurement()}')
        
        x = x.permute(0, 2, 3, 1)    #(2)
        print(f'after permutet: {x.measurement()}')
        
        x = self.norm(x)
        print(f'after normt: {x.measurement()}')
        
        x = x.permute(0, 3, 1, 2)    #(3)
        print(f'after permutet: {x.measurement()}')
        
        x = self.conv1(x)
        print(f'after conv1t: {x.measurement()}')
        
        x = self.gelu(x)
        print(f'after gelut: {x.measurement()}')
        
        x = self.conv2(x)
        print(f'after conv2t: {x.measurement()}')
        
        x = x + residual             #(4)
        print(f'after summationt: {x.measurement()}')
        
        return x

What we principally do within the above code is simply passing the tensor to every layer we outlined earlier sequentially. Nevertheless, there are two issues I want to focus on right here. First, we have to retailer the unique enter tensor to the residual variable (#(1)), by which it’ll skip over all operations throughout the ConvNeXt block. Secondly, do not forget that layer norm is usually used for sequential information, the place it usually has a distinct form from that of picture information. As a result of this purpose, we have to modify the tensor dimension such that the form turns into (N, H, W, C) (#(2)) earlier than we truly carry out the layer normalization operation. Afterwards, don’t overlook to permute this tensor again to (N, C, H, W) (#(3)). The ensuing tensor is then handed by way of the remaining layers earlier than being summed with the residual connection (#(4)).

To verify if our ConvNeXtBlock class works correctly, we are able to check it utilizing the Codeblock 3 beneath. Right here we’re going to simulate the block utilized in res2 stage. So, we set the num_channels parameter to 96 (#(1)) and create a dummy tensor which we assume as a batch of single picture of measurement 56×56 (#(2)).

# Codeblock 3
convnext_block_test = ConvNeXtBlock(num_channels=96)  #(1)
x_test = torch.rand(1, 96, 56, 56)  #(2)

out_test = convnext_block_test(x_test)

Under is what the ensuing output seems to be like. Speaking in regards to the inner stream, it looks as if all layers we stacked earlier work correctly. At line #(1) within the output beneath we are able to see that the tensor dimension modified to 1×56×56×96 (N, H, W, C) after being permuted. This tensor measurement then modified again to 1×96×56×56 (N, C, H, W) after the second permute operation (#(2)). Subsequent, the conv1 layer efficiently expanded the variety of channels to be 4 occasions larger than the enter (#(3)) which was then decreased again to the unique channel depend (#(4)). Right here you may see that the tensor form on the first and the final layer are precisely the identical, permitting us to stack a number of ConvNeXt blocks as many as we wish.

# Codeblock 3 Output
x & residual    : torch.Measurement([1, 96, 56, 56])
after conv0     : torch.Measurement([1, 96, 56, 56])  
after permute   : torch.Measurement([1, 56, 56, 96])    #(1)
after norm      : torch.Measurement([1, 56, 56, 96])
after permute   : torch.Measurement([1, 96, 56, 56])    #(2)
after conv1     : torch.Measurement([1, 384, 56, 56])   #(3)
after gelu      : torch.Measurement([1, 384, 56, 56])
after conv2     : torch.Measurement([1, 96, 56, 56])    #(4)
after summation : torch.Measurement([1, 96, 56, 56])

ConvNeXt Block Transition

The subsequent part I need to implement is the one I check with because the ConvNeXt block transition. The concept of this block is definitely much like the ConvNeXt block we carried out earlier, besides that this transition block is used after we are about to maneuver from a stage to the following one. Extra particularly, this block will later be employed as the primary ConvNeXt block in every stage (besides res2). The explanation I implement it in separate class is that there are some intricate particulars that differ from the ConvNeXt block. Moreover, it’s price noting that the time period transition will not be formally used within the paper. Slightly, it’s simply the phrase I exploit alone to explain this concept. — I truly additionally used this method again once I write in regards to the smaller ResNet model, i.e., ResNet-18 and ResNet-34. Click on on the hyperlink at reference quantity [6] on the finish of this text should you’re to learn that one.

# Codeblock 4a
class ConvNeXtBlockTransition(nn.Module):
    def __init__(self, in_channels, out_channels):  #(1)
        tremendous().__init__()
        hidden_channels = out_channels * 4
        
        self.projection = nn.Conv2d(in_channels=in_channels,      #(2) 
                                    out_channels=out_channels, 
                                    kernel_size=1, 
                                    stride=2,
                                    padding=0)
        
        self.conv0 = nn.Conv2d(in_channels=in_channels, 
                               out_channels=out_channels, 
                               kernel_size=7,
                               stride=1,
                               padding=3,
                               teams=in_channels)
        
        self.norm0 = nn.LayerNorm(normalized_shape=out_channels)
        
        self.conv1 = nn.Conv2d(in_channels=out_channels, 
                               out_channels=hidden_channels, 
                               kernel_size=1, 
                               stride=1, 
                               padding=0)
        
        self.gelu = nn.GELU()
        
        self.conv2 = nn.Conv2d(in_channels=hidden_channels, 
                               out_channels=out_channels, 
                               kernel_size=1, 
                               stride=1,
                               padding=0)
        
        self.norm1 = nn.LayerNorm(normalized_shape=out_channels)  #(3)

        self.downsample = nn.Conv2d(in_channels=out_channels,     #(4)
                                    out_channels=out_channels, 
                                    kernel_size=2, 
                                    stride=2)

The primary distinction you would possibly discover right here is the enter of the __init__() methodology, which on this case we separate the variety of enter and output channels into two parameters as seen at line #(1) in Codeblock 4a. That is primarily carried out as a result of we want this block to take the output tensor from the earlier stage which has completely different variety of channels from that of the one to be generated within the subsequent stage. Referring to Determine 6, for instance, if we have been to create the primary ConvNeXt block in res3 stage, we have to configure it such that it accepts a tensor of 96 channels from res2 and returns one other tensor with 192 channels.

Secondly, right here we implement the separate downsample layer I defined earlier (#(4)) alongside the corresponding layer norm to be positioned earlier than it (#(3)). Because the identify suggests, this layer is employed to scale back the spatial dimension of the picture by half.

Third, we initialize the so-called projection layer at line #(2). Within the ConvNeXtBlock we created earlier, this layer will not be vital as a result of the enter and output tensor is strictly the identical. Within the case of transition block, the picture spatial dimension is decreased by half, whereas on the identical time the variety of output channels is doubled. This projection layer is accountable to regulate the dimension of the residual connection in an effort to match it with the one from the principle stream, permitting element-wise operation to be carried out.

The ahead() methodology within the Codeblock 4b beneath can also be much like the one belongs to the ConvNeXtBlock class, besides that right here the residual connection must be processed with the projection layer (#(1)) whereas the principle tensor requires to be downsampled (#(2)) earlier than the summation is completed at line #(3).

# Codeblock 4b
    def ahead(self, x):
        print(f'originaltt: {x.measurement()}')

        residual = self.projection(x)  #(1)
        print(f'residual after projt: {residual.measurement()}')
        
        x = self.conv0(x)
        print(f'after conv0tt: {x.measurement()}')
        
        x = x.permute(0, 2, 3, 1)
        print(f'after permutett: {x.measurement()}')
        
        x = self.norm0(x)
        print(f'after norm1tt: {x.measurement()}')
        
        x = x.permute(0, 3, 1, 2)
        print(f'after permutett: {x.measurement()}')
        
        x = self.conv1(x)
        print(f'after conv1tt: {x.measurement()}')
        
        x = self.gelu(x)
        print(f'after gelutt: {x.measurement()}')
        
        x = self.conv2(x)
        print(f'after conv2tt: {x.measurement()}')

        x = x.permute(0, 2, 3, 1)
        print(f'after permutett: {x.measurement()}')
        
        x = self.norm1(x)
        print(f'after norm1tt: {x.measurement()}')
        
        x = x.permute(0, 3, 1, 2)
        print(f'after permutett: {x.measurement()}')
        
        x = self.downsample(x)  #(2)
        print(f'after downsamplet: {x.measurement()}')
        
        x = x + residual  #(3)
        print(f'after summationtt: {x.measurement()}')
        
        return x

Now let’s check the ConvNeXtBlockTransition class above utilizing the next codeblock. Suppose we’re about to implement the primary ConvNeXt block in stage res3. To take action, we are able to merely instantiate the transition block with in_channels=96 and out_channels=192 earlier than finally passing a dummy tensor of measurement 1×96×56×56 by way of it.

# Codeblock 5
convnext_block_transition_test = ConvNeXtBlockTransition(in_channels=96, 
                                                         out_channels=192)
x_test = torch.rand(1, 96, 56, 56)

out_test = convnext_block_transition_test(x_test)

# Codeblock 5 Output
unique            : torch.Measurement([1, 96, 56, 56])
residual after proj : torch.Measurement([1, 192, 28, 28])  #(1)
after conv0         : torch.Measurement([1, 192, 56, 56])  #(2)
after permute       : torch.Measurement([1, 56, 56, 192])
after norm0         : torch.Measurement([1, 56, 56, 192])
after permute       : torch.Measurement([1, 192, 56, 56])
after conv1         : torch.Measurement([1, 768, 56, 56])
after gelu          : torch.Measurement([1, 768, 56, 56])
after conv2         : torch.Measurement([1, 192, 56, 56])  #(3)
after permute       : torch.Measurement([1, 56, 56, 192])
after norm1         : torch.Measurement([1, 56, 56, 192])
after permute       : torch.Measurement([1, 192, 56, 56])
after downsample    : torch.Measurement([1, 192, 28, 28])  #(4)
after summation     : torch.Measurement([1, 192, 28, 28])  #(5)

You possibly can see within the ensuing output that our projection layer instantly maps the 1×96×56×56 residual tensor to 1×192×28×28 as proven at line #(1). In the meantime, the principle tensor x must be processed by the opposite layers we initialized earlier to realize this form. The steps we carried out from line #(2) to #(3) on the x tensor are principally the identical as these within the ConvNeXtBlock class. At this level we already acquired the variety of channels matches our want (192). The spatial dimension is then decreased after the tensor being processed by the downsample layer (#(4)). Because the tensor dimensions of x and residual have matched, we are able to lastly carry out the element-wise summation (#(5)).

The Whole ConvNeXt Structure

As we acquired ConvNeXtBlock and ConvNeXtBlockTransition courses prepared to make use of, we are able to now begin to assemble the complete ConvNeXt structure. Earlier than we do this, I want to introduce some config parameters first. See the Codeblock 6 beneath.

# Codeblock 6
IN_CHANNELS  = 3     #(1)
IMAGE_SIZE   = 224   #(2)

NUM_BLOCKS   = [3, 3, 9, 3]         #(3)
OUT_CHANNELS = [96, 192, 384, 768]  #(4)
NUM_CLASSES  = 1000  #(5)

The primary one is the dimension of the enter picture. As proven at line #(1) and #(2), right here we set in_channels to three and image_size to 224 since by default ConvNeXt accepts a batch of RGB photos of that measurement. The subsequent ones are associated to the mannequin configuration. On this case, I set the variety of ConvNeXt blocks of every stage to [3, 3, 9, 3] (#(3)) and the corresponding variety of output channels to [96, 192, 384, 768] (#(4)) since I need to implement the ConvNeXt-T variant. You possibly can truly change these numbers based on the configuration supplied by the unique paper proven in Determine 7. Lastly, we set the variety of neurons of the output channel to 1000, which corresponds to the variety of courses within the dataset we practice the mannequin on (#(5)).

We’ll now implement the complete structure within the ConvNeXt class proven in Codeblock 7a and 7b beneath. The next __init__() methodology may appear a bit difficult at look, however don’t fear as I’ll clarify it completely.

# Codeblock 7a
class ConvNeXt(nn.Module):
    def __init__(self):
        tremendous().__init__()
        
        self.stem = nn.Conv2d(in_channels=IN_CHANNELS,    #(1)
                              out_channels=OUT_CHANNELS[0],
                              kernel_size=4,
                              stride=4,
                             )

        self.normstem = nn.LayerNorm(normalized_shape=OUT_CHANNELS[0])  #(2)
        
        #(3)
        self.res2 = nn.ModuleList()
        for _ in vary(NUM_BLOCKS[0]):
            self.res2.append(ConvNeXtBlock(num_channels=OUT_CHANNELS[0]))
        
        #(4)
        self.res3 = nn.ModuleList([ConvNeXtBlockTransition(in_channels=OUT_CHANNELS[0], 
                                                           out_channels=OUT_CHANNELS[1])])
        for _ in vary(NUM_BLOCKS[1]-1):
            self.res3.append(ConvNeXtBlock(num_channels=OUT_CHANNELS[1]))

        #(5)
        self.res4 = nn.ModuleList([ConvNeXtBlockTransition(in_channels=OUT_CHANNELS[1], 
                                                           out_channels=OUT_CHANNELS[2])])
        for _ in vary(NUM_BLOCKS[2]-1):
            self.res4.append(ConvNeXtBlock(num_channels=OUT_CHANNELS[2]))

        #(6)
        self.res5 = nn.ModuleList([ConvNeXtBlockTransition(in_channels=OUT_CHANNELS[2], 
                                                           out_channels=OUT_CHANNELS[3])])
        for _ in vary(NUM_BLOCKS[3]-1):
            self.res5.append(ConvNeXtBlock(num_channels=OUT_CHANNELS[3]))

                
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))  #(7)
        self.normpool = nn.LayerNorm(normalized_shape=OUT_CHANNELS[3])  #(8)
        self.fc = nn.Linear(in_features=OUT_CHANNELS[3],        #(9)
                            out_features=NUM_CLASSES)
        
        self.relu = nn.ReLU()

The very first thing we do right here is initializing the stem stage (#(1)), which is actually only a convolution layer with 4×4 kernel measurement and stride 4. This configuration will successfully cut back the picture measurement to be 4 occasions smaller, the place each single pixel within the output tensor represents a 4×4 patch within the enter tensor. For the following levels, we have to wrap the corresponding ConvNeXt blocks with nn.ModuleList(). For stage res3 (#(4)), res4 (#(5)) and res5 (#(6)) we place ConvNeXtBlockTransition initially of every record as a “bridge” between levels. We don’t do that for stage res2 because the tensor produced by the stem stage is already suitable with it (#(3)). Subsequent, we initialize an nn.AdaptiveAvgPool2d layer, which shall be used to scale back the spatial dimensions of the tensor to 1×1 by computing the imply throughout every channel (#(7)). In reality, that is the very same course of utilized by ResNet to arrange the tensor from the final convolution layer in order that it matches the form required by the following output layer (#(9)). Moreover, don’t overlook to initialize two layer normalization layers which I check with as normstem (#(2)) and normpool (#(8)), by which these two layers will then be positioned proper after the stem stage and the avgpool layer.

The ahead() methodology is fairly easy. All we have to do within the following code is simply to put the layers one after one other. Remember that because the ConvNeXt blocks are saved in lists, we have to name them iteratively with loops as seen at line #(1–4). Moreover, don’t overlook to reshape the tensor produced by the nn.AdaptiveAvgPool2d layer (#(5)) in order that it is going to be suitable with the following fully-connected layer (#(6)).

# Codeblock 7b
    def ahead(self, x):
        print(f'originalt: {x.measurement()}')
        
        x = self.relu(self.stem(x))
        print(f'after stemt: {x.measurement()}')

        x = x.permute(0, 2, 3, 1)
        print(f'after permutet: {x.measurement()}')
        
        x = self.normstem(x)
        print(f'after normstemt: {x.measurement()}')
        
        x = x.permute(0, 3, 1, 2)
        print(f'after permutet: {x.measurement()}')
        
        print()
        for i, block in enumerate(self.res2):    #(1)
            x = block(x)
            print(f'after res2 #{i}t: {x.measurement()}')
        
        print()
        for i, block in enumerate(self.res3):    #(2)
            x = block(x)
            print(f'after res3 #{i}t: {x.measurement()}')
        
        print()
        for i, block in enumerate(self.res4):    #(3)
            x = block(x)
            print(f'after res4 #{i}t: {x.measurement()}')
        
        print()
        for i, block in enumerate(self.res5):    #(4)
            x = block(x)
            print(f'after res5 #{i}t: {x.measurement()}')
        
        print()
        x = self.avgpool(x)
        print(f'after avgpoolt: {x.measurement()}')

        x = x.permute(0, 2, 3, 1)
        print(f'after permutet: {x.measurement()}')
        
        x = self.normpool(x)
        print(f'after normpoolt: {x.measurement()}')
        
        x = x.permute(0, 3, 1, 2)
        print(f'after permutet: {x.measurement()}')
        
        x = x.reshape(x.form[0], -1)             #(5)
        print(f'after reshapet: {x.measurement()}')
        
        x = self.fc(x)
        print(f'after fct: {x.measurement()}')          #(6)
        
        return x

Now for the second of fact, let’s see if we now have appropriately carried out the complete ConvNeXt mannequin by working the next code. Right here I attempt to move a tensor of measurement 1×3×224×224 to the community, simulating a batch of a single RGB picture of measurement 224×224.

# Codeblock 8
convnext_test = ConvNeXt()

x_test   = torch.rand(1, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
out_test = convnext_test(x_test)

You possibly can see within the following output that it seems to be like our implementation is appropriate because the habits of the community aligns with the architectural design proven in Determine 6. The spatial dimension of the picture step by step will get smaller as we get deeper into the community, and on the identical time the variety of channels will increase as a substitute because of the ConvNeXtBlockTransition blocks we positioned initially of stage res3 (#(1)), res4 (#(2)), and res5 (#(3)). The avgpool layer then appropriately downsampled the spatial dimension to 1×1 (#(4)), permitting it to be related to the output layer (#(5)).

# Codeblock 8 Output
unique       : torch.Measurement([1, 3, 224, 224])
after stem     : torch.Measurement([1, 96, 56, 56])
after permute  : torch.Measurement([1, 56, 56, 96])
after normstem : torch.Measurement([1, 56, 56, 96])
after permute  : torch.Measurement([1, 96, 56, 56])

after res2 #0  : torch.Measurement([1, 96, 56, 56])
after res2 #1  : torch.Measurement([1, 96, 56, 56])
after res2 #2  : torch.Measurement([1, 96, 56, 56])

after res3 #0  : torch.Measurement([1, 192, 28, 28])  #(1)
after res3 #1  : torch.Measurement([1, 192, 28, 28])
after res3 #2  : torch.Measurement([1, 192, 28, 28])

after res4 #0  : torch.Measurement([1, 384, 14, 14])  #(2)
after res4 #1  : torch.Measurement([1, 384, 14, 14])
after res4 #2  : torch.Measurement([1, 384, 14, 14])
after res4 #3  : torch.Measurement([1, 384, 14, 14])
after res4 #4  : torch.Measurement([1, 384, 14, 14])
after res4 #5  : torch.Measurement([1, 384, 14, 14])
after res4 #6  : torch.Measurement([1, 384, 14, 14])
after res4 #7  : torch.Measurement([1, 384, 14, 14])
after res4 #8  : torch.Measurement([1, 384, 14, 14])

after res5 #0  : torch.Measurement([1, 768, 7, 7])    #(3)
after res5 #1  : torch.Measurement([1, 768, 7, 7])
after res5 #2  : torch.Measurement([1, 768, 7, 7])

after avgpool  : torch.Measurement([1, 768, 1, 1])    #(4)
after permute  : torch.Measurement([1, 1, 1, 768])
after normpool : torch.Measurement([1, 1, 1, 768])
after permute  : torch.Measurement([1, 768, 1, 1])
after reshape  : torch.Measurement([1, 768])
after fc       : torch.Measurement([1, 1000])         #(5)

Ending

Effectively, that was just about the whole lot in regards to the concept and the implementation of the ConvNeXt structure. Once more, I do acknowledge that the code I display above may not totally seize the whole lot since this text is meant to cowl the final concept of the mannequin. So, I extremely advocate you learn the unique implementation by Meta’s researchers [2] if you wish to know extra in regards to the intricate particulars.

I hope you discover this text helpful. Thanks for studying!

P.S. the pocket book used on this article is out there on my GitHub repo. See the hyperlink at reference quantity [7].

References

[1] Zhuang Liu et al. A ConvNet for the 2020s. Arxiv. https://arxiv.org/pdf/2201.03545 [Accessed January 18, 2025].

[2] facebookresearch. ConvNeXt. GitHub. https://github.com/facebookresearch/ConvNeXt/blob/main/models/convnext.py [Accessed January 18, 2025].

[3] Kaiming He et al. Deep Residual Studying for Picture Recognition. Arxiv. https://arxiv.org/pdf/1512.03385 [Accessed January 18, 2025].

[4] Ze Liu et al. Swin Transformer: Hierarchical Imaginative and prescient Transformer utilizing Shifted Home windows. Arxiv. https://arxiv.org/pdf/2103.14030 [Accessed January 18, 2025].

[5] Saining Xie et al. Aggregated Residual Transformations for Deep Neural Networks. Arxiv. https://arxiv.org/pdf/1611.05431 [Accessed January 18, 2025].

[6] Muhammad Ardi. Paper Walkthrough: Residual Community (ResNet). Python in Plain English. https://python.plainenglish.io/paper-walkthrough-residual-network-resnet-62af58d1c521 [Accessed January 19, 2025].

[7] MuhammadArdiPutra. The CNN That Challenges ViT — ConvNeXt. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/The%20CNN%20That%20Challenges%20ViT%20-%20ConvNeXt.ipynb [Accessed January 24, 2025].

Source link

How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

STOP Building Useless ML Projects – What Actually Works

Implementing IBCS rules in Power BI

AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Usشماره خاله #شماره خاله# تهران #شماره خاله# اصفهانشماره خاله #شماره خاله# تهران #شماره خاله#…

The Former C.I.A. Officer Capitalizing On Europe’s Military Spending Boom

Why Data Scientists Can’t Afford Too Many Dimensions and What They Can Do About It | by Niklas Lang | Jan, 2025

Our Picks