For those who learn the title of this text, you may in all probability assume that ResNeXt is instantly derived from ResNet. Nicely, that’s true, however I believe it’s not totally correct. In actual fact, to me ResNeXt is sort of like the mix of ResNet, VGG, and Inception on the similar time — I’ll present you the rationale in a second. On this article we’re going to discuss in regards to the ResNeXt structure, which incorporates the historical past, the small print of the structure itself, and the final however not least, the code implementation from scratch with PyTorch.
The Historical past of ResNeXt
The hyperparameter we normally put our concern on when tuning a neural community mannequin is the depth and width, which corresponds to the variety of layers and the variety of channels, respectively. We see this in VGG and ResNet, the place the authors of the 2 fashions proposed small-sized kernels and skip-connections in order that they’ll enhance the depth of the mannequin simply. In concept, this easy strategy is certainly able to increasing mannequin capability. Nevertheless, the 2 hyperparameter dimensions are all the time related to a major change within the variety of parameters, which is certainly an issue since sooner or later we may have our mannequin changing into too massive simply to make a slight enchancment on accuracy. Then again, we knew that in concept Inception is computationally cheaper, but it has a fancy architectural design, which requires us to place extra effort to tune the depth and the width of this community. If in case you have ever discovered about Inception, it basically works by passing a tensor by a number of convolution layers of various kernel sizes and let the community determine which one is healthier to signify the options of a particular job.
Xie et al. questioned if they might extract one of the best a part of the three fashions in order that mannequin tuning will be simpler like VGG and ResNet whereas nonetheless sustaining the effectivity of Inception. All their concepts are wrapped in a paper titled “Aggregated Residual Transformations for Deep Neural Networks” [1], the place they named the community ResNeXt. That is basically the place a brand new idea known as cardinality got here from, through which it basically adopts the thought of Inception, i.e., passing a tensor by a number of branches, but in an easier, extra scalable means. We are able to understand cardinality as a brand new parameter potential to be tuned along with depth and width. By doing so, we now basically have the subsequent hyperparameter dimension — therefore the identify, ResNeXt — which permits us to have a better diploma of freedom to carry out parameter tuning.
ResNeXt Module
In line with the paper, there are 3 ways we are able to do to implement cardinality, which you’ll see in Determine 1 under. The paper additionally mentions that setting cardinality to 32 is one of the best apply because it usually supplies a superb stability between accuracy and computational complexity, so I’ll use this quantity to clarify the next instance.
The enter of the three modules above is strictly the identical, i.e., a picture tensor having 256 channels. In variant (a), the enter tensor is duplicated 32 instances, through which every copy shall be processed independently to signify the 32 paths. The primary convolution layer in every path is accountable to challenge the 256-channel picture into 4 utilizing 1×1 kernel, which is adopted by two extra layers: a 3×3 convolution that preserves the variety of channels, and a 1×1 convolution that expands the channels again to 256. The tensors from the 32 branches are then aggregated by element-wise summation earlier than finally being summed once more with the unique enter tensor from the very starting of the module by skip-connection.
Keep in mind that Inception makes use of the thought of split-transform-merge. That is precisely what I simply defined for the ResNeXt block variant (a), the place the break up is finished earlier than the primary 1×1 convolution layer, the rework is carried out inside every department, and the merge is the element-wise summation operations. This concept additionally applies to the ResNeXt module variant (b), through which case the merge operation is carried out by channel-wise concatenation leading to 128-channel picture (which comes from 4 channels × 32 paths). The ensuing tensor is then projected again to the unique dimension by 1×1 convolution layer earlier than finally summed with the unique enter tensor.
Discover that there’s a phrase equal within the top-left nook of the above determine. Which means these three ResNeXt block variants are principally the same when it comes to the variety of parameters, FLOPs, and the ensuing accuracy scores. This notion is sensible as a result of they’re all principally derived from the identical mathematical formulation. I’ll discuss extra about it later within the subsequent part. Regardless of this equivalency, I’ll go along with possibility (c) later within the implementation half. It is because this variant employs the so-called group convolution, which is way simpler to implement than (a) and (b). In case you’re not but acquainted with the time period, it’s basically a method in a convolution operation the place we divide all enter channels into a number of teams through which each single of these is accountable to course of channels inside the similar group earlier than finally concatenating them. Within the case of (c), we scale back the variety of channels from 256 to 128 earlier than the splitting is finished, permitting us to have 32 convolution kernel teams the place every accountable to course of 4 channels. We then challenge the tensor again to the unique variety of channels in order that we are able to sum it with the unique enter tensor.
Mathematical Definition
As I discussed earlier, right here’s what the formal mathematical definition of a ResNeXt module appears to be like like.

The above equation encapsulates all the split-transform-merge operation, the place x is the unique enter tensor, y is the output tensor, C is the cardinality parameter to find out the variety of parallel paths used, T is the transformation operate utilized to every path, and ∑ signifies that we are going to merge all data from the remodeled tensors. Nevertheless, it is very important observe that though sigma normally denotes summation, solely (a) that really sums the tensors. In the meantime, each (b) and (c) do the merging by concatenation adopted by 1×1 convolution as a substitute, which in truth continues to be equal to (a).
The Total ResNeXt Structure
The construction displayed in Determine 1 and the equation in Determine 2 principally solely correspond to a single ResNeXt block. With a view to assemble all the structure, we have to stack the block a number of instances following the construction proven in Determine 3 under.

Right here you possibly can see that the construction of ResNeXt is sort of similar to ResNet. So, I consider you’ll later discover the ResNeXt implementation extraordinarily straightforward, particularly you probably have ever applied ResNet earlier than. The primary distinction you may discover within the structure is the variety of kernels of the primary two convolution layers in every block, the place the ResNeXt block usually has twice as many kernels as that of the corresponding ResNet block, particularly ranging from the conv2 stage all the best way to the conv5 stage. Secondly, it is usually clearly seen that now we have the cardinality parameter utilized to the second convolution layer in every ResNeXt block.
The ResNeXt variant applied above, which is equal to ResNet-50, is the one known as ResNeXt-50 (32×4d). This naming conference signifies that this variant consists of fifty layers in the principle department with 32 cardinality and 4 variety of channels in every path inside the conv2 stage. As of this writing, there are three ResNeXt variants already applied in PyTorch, specifically resnext50_32x4d, resnext101_32x8d, and resnext101_64x4d [2]. You’ll be able to undoubtedly import them simply alongside the pretrained weights if you need. Nevertheless, on this article we’re going to implement the structure from scratch as a substitute.
ResNeXt Implementation
As now we have understood the underlying concept behind ResNeXt, let’s now get our palms soiled with the code! The very first thing we do is to import the required modules as proven in Codeblock 1 under.
# Codeblock 1
import torch
import torch.nn as nn
from torchinfo import abstract
Right here I’m going to implement the ResNeXt-50 (32×4d) variant. So, I have to set the parameters in Codeblock 2 in keeping with the architectural particulars proven again in Determine 3.
# Codeblock 2
CARDINALITY = 32 #(1)
NUM_CHANNELS = [3, 64, 256, 512, 1024, 2048] #(2)
NUM_BLOCKS = [3, 4, 6, 3] #(3)
NUM_CLASSES = 1000 #(4)
The CARDINALITY
variable at line #(1)
is self-explanatory, so I don’t assume I want to clarify it any additional. Subsequent, the NUM_CHANNELS
variable is used to retailer the variety of output channels of every stage, apart from index 0 the place it corresponds to the variety of enter channels (#(2)
). At line #(3)
, NUM_BLOCKS
is used to find out what number of instances we are going to repeat the corresponding block. Word that we don’t specify any quantity for the conv1 stage since this stage solely consists of a single block. Lastly right here we set the NUM_CLASSES
parameter to 1000 since ResNeXt is initially pretrained on ImageNet-1K dataset (#(4)
).
The ResNeXt Module
Because the whole ResNeXt structure is principally only a bunch of ResNeXt modules, we are able to principally create a single class to outline the module and later use it repeatedly in the principle class. On this case, I consult with the module as Block
. The implementation of this class is fairly lengthy, although. So I made a decision to interrupt it down into a number of codeblocks. Simply be certain that all of the codeblocks of the identical quantity are positioned inside the similar pocket book cell if you wish to run the code.
You’ll be able to see within the Codeblock 3a under that the __init__()
technique of this class accepts a number of parameters. The in_channels
parameter (#(1)
) is used to set the variety of channels of the tensor to be handed into the block. I set it to be adjustable as a result of the blocks in several stage may have totally different enter shapes. Secondly, the add_channel
and downsample
parameters (#(2,4)
) are flags to manage whether or not the block will carry out downsampling. For those who take a better have a look at Determine 3, you’ll discover that each time we transfer from one stage to a different, the variety of output channels of the block turns into twice as massive because the output from the earlier stage whereas on the similar time the spatial dimension is decreased by half. We have to set each add_channel
and downsample
to True
each time we transfer from one stage to the following one. In any other case, we set the 2 parameters to False
if we solely transfer from one block to a different inside the similar stage. The channel_multiplier
parameter (#(3)
), however, is used to find out the variety of output channels relative to the variety of enter channels by altering the multiplication issue. This parameter is vital as a result of there’s a particular case the place we have to make the variety of output channels to be 4 instances bigger as a substitute of two, i.e., after we transfer from conv1 stage (64) to conv2 stage (256).
# Codeblock 3a
class Block(nn.Module):
def __init__(self,
in_channels, #(1)
add_channel=False, #(2)
channel_multiplier=2, #(3)
downsample=False): #(4)
tremendous().__init__()
self.add_channel = add_channel
self.channel_multiplier = channel_multiplier
self.downsample = downsample
if self.add_channel: #(5)
out_channels = in_channels*self.channel_multiplier #(6)
else:
out_channels = in_channels #(7)
mid_channels = out_channels//2 #(8).
if self.downsample: #(9)
stride = 2 #(10)
else:
stride = 1
The parameters we simply mentioned instantly management the if
statements at line #(5)
and #(9)
. The previous goes to be executed each time the add_channel
is True
, through which case the variety of enter channels shall be multiplied by channel_multiplier
to acquire the variety of output channels (#(6)
). In the meantime, whether it is False
, we are going to make enter and the output tensor dimension to be the identical (#(7)
). Right here we set mid_channels
to be half the scale of out_channels
(#(8)
). It is because in keeping with Determine 3 the variety of channels within the output tensor of the primary two convolution layers inside every block is half of that of the third convolution layer. Subsequent, the downsample
flag we outlined earlier is used to manage the if
assertion at line #(9)
. Each time it’s set to True
, it’s going to assign the stride
variable to 2 (#(10)
), which is able to later trigger the convolution layer to cut back the spatial dimension of the picture by half.
Nonetheless contained in the __init__()
technique, let’s now outline the layers inside the ResNeXt block. See the Codeblock 3b under for the small print.
# Codeblock 3b
if self.add_channel or self.downsample: #(1)
self.projection = nn.Conv2d(in_channels=in_channels, #(2)
out_channels=out_channels,
kernel_size=1,
stride=stride,
padding=0,
bias=False)
nn.init.kaiming_normal_(self.projection.weight, nonlinearity='relu')
self.bn_proj = nn.BatchNorm2d(num_features=out_channels)
self.conv0 = nn.Conv2d(in_channels=in_channels, #(3)
out_channels=mid_channels, #(4)
kernel_size=1,
stride=1,
padding=0,
bias=False)
nn.init.kaiming_normal_(self.conv0.weight, nonlinearity='relu')
self.bn0 = nn.BatchNorm2d(num_features=mid_channels)
self.conv1 = nn.Conv2d(in_channels=mid_channels, #(5)
out_channels=mid_channels,
kernel_size=3,
stride=stride, #(6)
padding=1,
bias=False,
teams=CARDINALITY) #(7)
nn.init.kaiming_normal_(self.conv1.weight, nonlinearity='relu')
self.bn1 = nn.BatchNorm2d(num_features=mid_channels)
self.conv2 = nn.Conv2d(in_channels=mid_channels, #(8)
out_channels=out_channels, #(9)
kernel_size=1,
stride=1,
padding=0,
bias=False)
nn.init.kaiming_normal_(self.conv2.weight, nonlinearity='relu')
self.bn2 = nn.BatchNorm2d(num_features=out_channels)
self.relu = nn.ReLU()
Keep in mind that there are instances the place the output dimension of a ResNeXt block is totally different from the enter. In such a case, element-wise summation on the final step can’t be carried out (consult with Determine 1). That is the rationale that we have to initialize a projection
layer each time both the add_channel
or downsample
flags are True
(#(1)
). This projection
layer (#(2)
), which is a 1×1 convolution, is used to course of the tensor within the skip-connection in order that the output form goes to match the tensor processed by the principle circulation, permitting them to be summed. In any other case, if we wish the ResNeXt module to protect the tensor dimension, we have to set each flags to False
in order that the projection layer is not going to be initialized since we are able to instantly sum the skip-connection with the tensor from the principle circulation.
The principle circulation of the ResNeXt module itself includes three convolution layers, which I consult with as conv0
, conv1
and conv2
, as written at line #(3)
, #(5)
and #(8)
respectively. If we take a better have a look at these layers, we are able to see that each conv0
and conv2
are liable for manipulating the variety of channels. At strains #(3)
and #(4)
, we are able to see that conv0
modifications the variety of picture channels from in_channels
to mid_channels
, whereas conv2
modifications it from mid_channels
to out_channels
(#(8-9)
). Then again, the conv1
layer is accountable to manage the spatial dimension by the stride
parameter (#(6)
), through which the worth is set in keeping with the dowsample
flag we mentioned earlier. Moreover, this conv1
layer will do all the split-transform-merge course of by group convolution (#(7)
), which within the case of ResNeXt it corresponds to cardinality.
Moreover, right here we initialize batch normalization layers named bn_proj
, bn0
, bn1
, and bn2
. Later within the ahead()
technique, we’re going to place them proper after the corresponding convolution layers following the Conv-BN-ReLU construction, which is a regular apply in terms of establishing a CNN-based mannequin. Not solely that, discover that right here we additionally write nn.init.kaiming_normal_()
after the initialization of every convolution layer. That is basically carried out in order that the preliminary layer weights observe the Kaiming regular distribution as talked about within the paper.
That was every little thing in regards to the __init__()
technique, now that we’re going to transfer on to the ahead()
technique to really outline the circulation of the ResNeXt module. See the Codeblock 3c under.
# Codeblock 3c
def ahead(self, x):
print(f'originaltt: {x.dimension()}')
if self.add_channel or self.downsample: #(1)
residual = self.bn_proj(self.projection(x)) #(2)
print(f'after projectiont: {residual.dimension()}')
else:
residual = x #(3)
print(f'no projectiontt: {residual.dimension()}')
x = self.conv0(x) #(4)
x = self.bn0(x)
x = self.relu(x)
print(f'after conv0-bn0-relut: {x.dimension()}')
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
print(f'after conv1-bn1-relut: {x.dimension()}')
x = self.conv2(x) #(5)
x = self.bn2(x)
print(f'after conv2-bn2tt: {x.dimension()}')
x = x + residual
x = self.relu(x) #(6)
print(f'after summationtt: {x.dimension()}')
return x
Right here you possibly can see that this operate accepts x
as the one enter, through which it’s principally a tensor produced by the earlier ResNeXt block. The if
assertion I write at line #(1)
checks whether or not we’re about to carry out downsampling. If that’s the case, the tensor within the skip-connection goes to be handed by the projection
layer and the corresponding batch normalization layer earlier than finally saved within the residual
variable (#(2)
). But when downsampling isn’t carried out, we’re going to set residual
to be precisely the identical as x
(#(3)
). Subsequent, we are going to course of the principle tensor x
utilizing the stack of convolution layers ranging from conv0
(#(4)
) all the best way to conv2
(#(5)
). You will need to observe that the Conv-BN-ReLU construction of the conv2
layer is barely totally different, the place the ReLU activation operate is utilized after element-wise summation is carried out (#(6)
).
Now let’s take a look at the ResNeXt block we simply created to search out out whether or not now we have applied it accurately. There are three situations I’m going to check right here, specifically after we transfer from one stage to a different (setting each add_channel
and downsample
to True
), after we transfer from one block to a different inside the similar stage (each add_channel
and downsample
are False
), and after we transfer from conv1 stage to conv2 stage (setting downsample
to False
and add_channel
to True
with 4 channel multiplier).
Check Case 1
The Codeblock 4 under demonstrates the primary take a look at case, through which right here I simulate the primary block of the conv3 stage. For those who return to Determine 3, you will notice that the output from the earlier stage is a 256-channel picture. Thus, we have to set the in_channels
parameter in keeping with this quantity. In the meantime, the output of the ResNeXt block within the stage has 512 channels with 28×28 spatial dimension. This tensor form transformation is definitely the rationale that we set each flags to True
. Right here we assume that the x
tensor handed by the community is a dummy picture produced by the conv2 stage.
# Codeblock 4
block = Block(in_channels=256, add_channel=True, downsample=True)
x = torch.randn(1, 256, 56, 56)
out = block(x)
And under is what the output appears to be like like. It’s seen at line #(1)
that our projection
layer efficiently projected the tensor to 512×28×28, precisely matching the form of the output tensor from the principle circulation (#(4)
). The conv0
layer at line #(2)
doesn’t alter the tensor dimension in any respect since on this case our in_channels
and mid_channels
are the identical. The precise spatial downsampling is carried out by the conv1
layer, the place the picture decision is decreased from 56×56 to twenty-eight×28 (#(3)
) because of the stride which is about to 2 for this case. The method is then continued by the conv2
layer which doubles the variety of channels from 256 to 512 (#(4)
). Lastly, this tensor shall be element-wise summed with the projected skip-connection tensor (#(5)
). And with that, we efficiently transformed our tensor from 256×56×56 to 512×28×28.
# Codeblock 4 Output
unique : torch.Measurement([1, 256, 56, 56])
after projection : torch.Measurement([1, 512, 28, 28]) #(1)
after conv0-bn0-relu : torch.Measurement([1, 256, 56, 56]) #(2)
after conv1-bn1-relu : torch.Measurement([1, 256, 28, 28]) #(3)
after conv2-bn2 : torch.Measurement([1, 512, 28, 28]) #(4)
after summation : torch.Measurement([1, 512, 28, 28]) #(5)
Check Case 2
With a view to reveal the second take a look at case, right here I’ll simulate the block contained in the conv3 stage which the enter is a tensor produced by the earlier block inside the similar stage. In such a case, we wish the enter and output dimension of this ResNeXt module to be the identical, therefore we have to set each add_channel
and downsample
to False
. See the Codeblock 5 and the ensuing output under for the small print.
# Codeblock 5
block = Block(in_channels=512, add_channel=False, downsample=False)
x = torch.randn(1, 512, 28, 28)
out = block(x)
# Codeblock 5 Output
unique : torch.Measurement([1, 512, 28, 28])
no projection : torch.Measurement([1, 512, 28, 28]) #(1)
after conv0-bn0-relu : torch.Measurement([1, 256, 28, 28]) #(2)
after conv1-bn1-relu : torch.Measurement([1, 256, 28, 28])
after conv2-bn2 : torch.Measurement([1, 512, 28, 28]) #(3)
after summation : torch.Measurement([1, 512, 28, 28])
As I’ve talked about earlier, the projection layer isn’t going for use if the enter tensor isn’t downsampled. That is the rationale that at line #(1)
now we have our skip-connection tensor form unchanged. Subsequent, now we have our channel depend decreased to 256 by the conv0
layer since on this case mid_channels
is half the scale of out_channels
(#(2)
). We finally increase this variety of channels again to 512 utilizing the conv2 layer (#(3)
). Moreover, this sort of construction is often referred to as bottleneck because it follows the wide-narrow-wide sample, which was first launched within the unique ResNet paper [3].
Check Case 3
The third take a look at is definitely a particular case since we’re about to simulate the primary block within the conv2 stage, the place we have to set the add_channel
flag to True
whereas the downsample
to False
. Right here we don’t need to carry out spatial downsampling within the convolution layer as a result of it’s already carried out by a maxpooling layer. Moreover, you can even see in Determine 3 that the conv1 stage returns a picture of 64 channels. Resulting from this cause, we have to set the channel_multiplier
parameter to 4 since we wish the next conv2 stage to return 256 channels. See the small print within the Codeblock 6 under.
# Codeblock 6
block = Block(in_channels=64, add_channel=True, channel_multiplier=4, downsample=False)
x = torch.randn(1, 64, 56, 56)
out = block(x)
# Codeblock 6 Output
unique : torch.Measurement([1, 64, 56, 56])
after projection : torch.Measurement([1, 256, 56, 56]) #(1)
after conv0-bn0-relu : torch.Measurement([1, 128, 56, 56]) #(2)
after conv1-bn1-relu : torch.Measurement([1, 128, 56, 56])
after conv2-bn2 : torch.Measurement([1, 256, 56, 56]) #(3)
after summation : torch.Measurement([1, 256, 56, 56])
It’s seen within the ensuing output above that the ResNeXt module robotically make the most of the projection
layer, which on this case it efficiently transformed the 64×56×56 tensor into 256×56×56 (#(1)
). Right here you possibly can see that the variety of channels expanded to be 4 instances bigger whereas the spatial dimension remained the identical. Afterwards, we shrink the channel depend to 128 (#(2)
) and increase it again to 256 (#(3)
) to simulate the bottleneck mechanism. Thus, we are able to now carry out summation between the tensor from the principle circulation and the one produced by the projection
layer.
At this level we already received our ResNeXt module works correctly to deal with the three instances. So, I consider this module is now able to be assembled to really assemble all the ResNeXt structure.
The Total ResNeXt Structure
Because the following ResNeXt class is fairly lengthy, I break it down into two codeblocks to make issues simpler to observe. What we principally have to do within the __init__()
technique in Codeblock 7a is to initialize the ResNeXt modules utilizing the Block
class we created earlier. The best way to implement the conv3 (#(9)
), conv4 (#(12)
) and conv5 (#(15)
) levels are fairly simple since what we principally have to do is simply to initialize the blocks inside nn.ModuleList
. Keep in mind that the primary block inside every stage is a downsampling block, whereas the remainder them should not meant to carry out downsampling. Resulting from this cause, we have to initialize the primary block manually by setting each add_channel
and downsample
flags to True
(#(10,13,16)
) whereas the remaining blocks are initialized utilizing loops which iterate in keeping with the numbers saved within the NUM_CHANNELS
record (#(11,14,17)
).
# Codeblock 7a
class ResNeXt(nn.Module):
def __init__(self):
tremendous().__init__()
# conv1 stage #(1)
self.resnext_conv1 = nn.Conv2d(in_channels=NUM_CHANNELS[0],
out_channels=NUM_CHANNELS[1],
kernel_size=7, #(2)
stride=2, #(3)
padding=3,
bias=False)
nn.init.kaiming_normal_(self.resnext_conv1.weight,
nonlinearity='relu')
self.resnext_bn1 = nn.BatchNorm2d(num_features=NUM_CHANNELS[1])
self.relu = nn.ReLU()
self.resnext_maxpool1 = nn.MaxPool2d(kernel_size=3, #(4)
stride=2,
padding=1)
# conv2 stage #(5)
self.resnext_conv2 = nn.ModuleList([
Block(in_channels=NUM_CHANNELS[1],
add_channel=True, #(6)
channel_multiplier=4,
downsample=False) #(7)
])
for _ in vary(NUM_BLOCKS[0]-1): #(8)
self.resnext_conv2.append(Block(in_channels=NUM_CHANNELS[2]))
# conv3 stage #(9)
self.resnext_conv3 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[2], #(10)
add_channel=True,
downsample=True)])
for _ in vary(NUM_BLOCKS[1]-1): #(11)
self.resnext_conv3.append(Block(in_channels=NUM_CHANNELS[3]))
# conv4 stage #(12)
self.resnext_conv4 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[3], #(13)
add_channel=True,
downsample=True)])
for _ in vary(NUM_BLOCKS[2]-1): #(14)
self.resnext_conv4.append(Block(in_channels=NUM_CHANNELS[4]))
# conv5 stage #(15)
self.resnext_conv5 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[4], #(16)
add_channel=True,
downsample=True)])
for _ in vary(NUM_BLOCKS[3]-1): #(17)
self.resnext_conv5.append(Block(in_channels=NUM_CHANNELS[5]))
self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1)) #(18)
self.fc = nn.Linear(in_features=NUM_CHANNELS[5], #(19)
out_features=NUM_CLASSES)
As we mentioned earlier, the conv2 stage (#(5)
) is a bit particular for the reason that first block inside this stage does enhance the variety of channels but it doesn’t scale back the spatial dimension. That is basically the rationale that I set the add_channel
parameter to True
(#(6)
) whereas the downsample
parameter is about to False
(#(7)
). The initialization of the remaining blocks is identical as the opposite levels we mentioned earlier, the place we are able to simply do it with a easy loop (#(8)
).
The conv1 stage (#(1)
) however, doesn’t make the most of the Block
class for the reason that construction is totally totally different from the opposite levels. In line with Determine 3, this stage solely includes a single 7×7 convolution layer (#(2)
), which permits us to seize a bigger context from the enter picture. The tensor produced by this layer may have half the spatial dimensions of the enter because of the stride
parameter which is about to 2 (#(3)
). Additional downsampling is carried out utilizing maxpooling layer with the identical stride, which once more, reduces the spatial dimension by half (#(4)
). — In actual fact, this maxpooling layer needs to be contained in the conv2 stage as a substitute, however on this implementation I put it outdoors the nn.ModuleList
of that stage for the sake of simplicity.
Lastly, we have to initialize a world common pooling layer (#(18)
) which works by taking the typical worth of every channel within the tensor produced by the final convolution layer. By doing this, we’re going to have a single quantity representing every channel. This tensor will then be linked to the output layer that produces NUM_CLASSES
(1000) neurons (#(19)
), through which each single of them corresponds to every class within the dataset.
Now have a look at the Codeblock 7b under to see how I outline the ahead()
technique. I believe there may be not a lot factor I want to clarify since what we principally do right here is simply to go the tensor from one layer to the next one sequentially.
# Codeblock 7b
def ahead(self, x):
print(f'originaltt: {x.dimension()}')
x = self.relu(self.resnext_bn1(self.resnext_conv1(x)))
print(f'after resnext_conv1t: {x.dimension()}')
x = self.resnext_maxpool1(x)
print(f'after resnext_maxpool1t: {x.dimension()}')
for i, block in enumerate(self.resnext_conv2):
x = block(x)
print(f'after resnext_conv2 #{i}t: {x.dimension()}')
for i, block in enumerate(self.resnext_conv3):
x = block(x)
print(f'after resnext_conv3 #{i}t: {x.dimension()}')
for i, block in enumerate(self.resnext_conv4):
x = block(x)
print(f'after resnext_conv4 #{i}t: {x.dimension()}')
for i, block in enumerate(self.resnext_conv5):
x = block(x)
print(f'after resnext_conv5 #{i}t: {x.dimension()}')
x = self.avgpool(x)
print(f'after avgpooltt: {x.dimension()}')
x = torch.flatten(x, start_dim=1)
print(f'after flattentt: {x.dimension()}')
x = self.fc(x)
print(f'after fctt: {x.dimension()}')
return x
Subsequent, let’s take a look at our ResNeXt class utilizing the next code. Right here I’m going to check it by passing a dummy tensor of dimension 3×224×224 which simulates a single RGB picture of dimension 224×224.
# Codeblock 8
resnext = ResNeXt()
x = torch.randn(1, 3, 224, 224)
out = resnext(x)
# Codeblock 8 Output
unique : torch.Measurement([1, 3, 224, 224])
after resnext_conv1 : torch.Measurement([1, 64, 112, 112]) #(1)
after resnext_maxpool1 : torch.Measurement([1, 64, 56, 56]) #(2)
after resnext_conv2 #0 : torch.Measurement([1, 256, 56, 56]) #(3)
after resnext_conv2 #1 : torch.Measurement([1, 256, 56, 56]) #(4)
after resnext_conv2 #2 : torch.Measurement([1, 256, 56, 56]) #(5)
after resnext_conv3 #0 : torch.Measurement([1, 512, 28, 28])
after resnext_conv3 #1 : torch.Measurement([1, 512, 28, 28])
after resnext_conv3 #2 : torch.Measurement([1, 512, 28, 28])
after resnext_conv3 #3 : torch.Measurement([1, 512, 28, 28])
after resnext_conv4 #0 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #1 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #2 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #3 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #4 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #5 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv5 #0 : torch.Measurement([1, 2048, 7, 7])
after resnext_conv5 #1 : torch.Measurement([1, 2048, 7, 7])
after resnext_conv5 #2 : torch.Measurement([1, 2048, 7, 7])
after avgpool : torch.Measurement([1, 2048, 1, 1]) #(6)
after flatten : torch.Measurement([1, 2048]) #(7)
after fc : torch.Measurement([1, 1000]) #(8)
We are able to see within the above output that our conv1 stage accurately scale back the spatial dimension from 224×224 to 112×112 whereas on the similar time additionally growing the variety of channels to 64 (#(1)
). The downsapling is sustained by the maxpooling layer, the place it makes the spatial dimension of the picture decreased to 56×56 (#(2)
). Transferring on to the conv2 stage, we are able to see that our first block within the stage efficiently transformed the 64-channel picture into 256 (#(3)
), through which the next blocks in the identical stage protect the dimension of this tensor (#(4–5)
). The identical factor can also be carried out by the following levels till we attain the worldwide common pooling layer (#(6)
). You will need to observe that we have to carry out tensor flattening (#(7)
) to drop the empty axes earlier than finally connecting it to the output layer (#(8)
). And that concludes how a tensor flows by the ResNeXt structure.
Moreover, you should utilize the abstract()
operate that we beforehand loaded from torchinfo
if you wish to get even deeper into the architectural particulars. You’ll be able to see on the finish of the output under that we received 25,028,904 parameters in whole. In actual fact, this variety of params matches precisely with the one belongs to the ResNeXt-50 32x4d mannequin from PyTorch, so I consider our implementation right here is appropriate. You’ll be able to confirm this within the hyperlink at reference quantity [4].
# Codeblock 9
resnext = ResNeXt()
abstract(resnext, input_size=(1, 3, 224, 224))
# Codeblock 9 Output
==========================================================================================
Layer (kind:depth-idx) Output Form Param #
==========================================================================================
ResNeXt [1000] --
├─Conv2d: 1-1 [1, 64, 112, 112] 9,408
├─BatchNorm2d: 1-2 [1, 64, 112, 112] 128
├─ReLU: 1-3 [1, 64, 112, 112] --
├─MaxPool2d: 1-4 [1, 64, 56, 56] --
├─ModuleList: 1-5 -- --
│ └─Block: 2-1 [1, 256, 56, 56] --
│ │ └─Conv2d: 3-1 [1, 256, 56, 56] 16,384
│ │ └─BatchNorm2d: 3-2 [1, 256, 56, 56] 512
│ │ └─Conv2d: 3-3 [1, 128, 56, 56] 8,192
│ │ └─BatchNorm2d: 3-4 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-5 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-6 [1, 128, 56, 56] 4,608
│ │ └─BatchNorm2d: 3-7 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-8 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-9 [1, 256, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-10 [1, 256, 56, 56] 512
│ │ └─ReLU: 3-11 [1, 256, 56, 56] --
│ └─Block: 2-2 [1, 256, 56, 56] --
│ │ └─Conv2d: 3-12 [1, 128, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-13 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-14 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-15 [1, 128, 56, 56] 4,608
│ │ └─BatchNorm2d: 3-16 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-17 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-18 [1, 256, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-19 [1, 256, 56, 56] 512
│ │ └─ReLU: 3-20 [1, 256, 56, 56] --
│ └─Block: 2-3 [1, 256, 56, 56] --
│ │ └─Conv2d: 3-21 [1, 128, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-22 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-23 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-24 [1, 128, 56, 56] 4,608
│ │ └─BatchNorm2d: 3-25 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-26 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-27 [1, 256, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-28 [1, 256, 56, 56] 512
│ │ └─ReLU: 3-29 [1, 256, 56, 56] --
├─ModuleList: 1-6 -- --
│ └─Block: 2-4 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-30 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-31 [1, 512, 28, 28] 1,024
│ │ └─Conv2d: 3-32 [1, 256, 56, 56] 65,536
│ │ └─BatchNorm2d: 3-33 [1, 256, 56, 56] 512
│ │ └─ReLU: 3-34 [1, 256, 56, 56] --
│ │ └─Conv2d: 3-35 [1, 256, 28, 28] 18,432
│ │ └─BatchNorm2d: 3-36 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-37 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-38 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-39 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-40 [1, 512, 28, 28] --
│ └─Block: 2-5 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-41 [1, 256, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-42 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-43 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-44 [1, 256, 28, 28] 18,432
│ │ └─BatchNorm2d: 3-45 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-46 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-47 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-48 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-49 [1, 512, 28, 28] --
│ └─Block: 2-6 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-50 [1, 256, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-51 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-52 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-53 [1, 256, 28, 28] 18,432
│ │ └─BatchNorm2d: 3-54 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-55 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-56 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-57 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-58 [1, 512, 28, 28] --
│ └─Block: 2-7 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-59 [1, 256, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-60 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-61 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-62 [1, 256, 28, 28] 18,432
│ │ └─BatchNorm2d: 3-63 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-64 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-65 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-66 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-67 [1, 512, 28, 28] --
├─ModuleList: 1-7 -- --
│ └─Block: 2-8 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-68 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-69 [1, 1024, 14, 14] 2,048
│ │ └─Conv2d: 3-70 [1, 512, 28, 28] 262,144
│ │ └─BatchNorm2d: 3-71 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-72 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-73 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-74 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-75 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-76 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-77 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-78 [1, 1024, 14, 14] --
│ └─Block: 2-9 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-79 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-80 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-81 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-82 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-83 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-84 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-85 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-86 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-87 [1, 1024, 14, 14] --
│ └─Block: 2-10 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-88 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-89 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-90 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-91 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-92 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-93 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-94 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-95 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-96 [1, 1024, 14, 14] --
│ └─Block: 2-11 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-97 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-98 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-99 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-100 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-101 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-102 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-103 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-104 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-105 [1, 1024, 14, 14] --
│ └─Block: 2-12 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-106 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-107 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-108 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-109 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-110 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-111 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-112 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-113 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-114 [1, 1024, 14, 14] --
│ └─Block: 2-13 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-115 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-116 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-117 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-118 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-119 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-120 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-121 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-122 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-123 [1, 1024, 14, 14] --
├─ModuleList: 1-8 -- --
│ └─Block: 2-14 [1, 2048, 7, 7] --
│ │ └─Conv2d: 3-124 [1, 2048, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-125 [1, 2048, 7, 7] 4,096
│ │ └─Conv2d: 3-126 [1, 1024, 14, 14] 1,048,576
│ │ └─BatchNorm2d: 3-127 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-128 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-129 [1, 1024, 7, 7] 294,912
│ │ └─BatchNorm2d: 3-130 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-131 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-132 [1, 2048, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-133 [1, 2048, 7, 7] 4,096
│ │ └─ReLU: 3-134 [1, 2048, 7, 7] --
│ └─Block: 2-15 [1, 2048, 7, 7] --
│ │ └─Conv2d: 3-135 [1, 1024, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-136 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-137 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-138 [1, 1024, 7, 7] 294,912
│ │ └─BatchNorm2d: 3-139 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-140 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-141 [1, 2048, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-142 [1, 2048, 7, 7] 4,096
│ │ └─ReLU: 3-143 [1, 2048, 7, 7] --
│ └─Block: 2-16 [1, 2048, 7, 7] --
│ │ └─Conv2d: 3-144 [1, 1024, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-145 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-146 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-147 [1, 1024, 7, 7] 294,912
│ │ └─BatchNorm2d: 3-148 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-149 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-150 [1, 2048, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-151 [1, 2048, 7, 7] 4,096
│ │ └─ReLU: 3-152 [1, 2048, 7, 7] --
├─AdaptiveAvgPool2d: 1-9 [1, 2048, 1, 1] --
├─Linear: 1-10 [1, 1000] 2,049,000
==========================================================================================
Whole params: 25,028,904
Trainable params: 25,028,904
Non-trainable params: 0
Whole mult-adds (Models.GIGABYTES): 6.28
==========================================================================================
Enter dimension (MB): 0.60
Ahead/backward go dimension (MB): 230.42
Params dimension (MB): 100.12
Estimated Whole Measurement (MB): 331.13
==========================================================================================
Ending
I believe that’s every little thing about ResNeXt and its implementation. It’s also possible to discover all the code used on this article on my GitHub repo [5].
I hope you study one thing new right this moment, and thanks very a lot for studying! See you in my subsequent article.
References
[1] Saining Xie et al. Aggregated Residual Transformations for Deep Neural Networks. Arxiv. https://arxiv.org/abs/1611.05431 [Accessed March 1, 2025].
[2] ResNeXt. PyTorch. https://pytorch.org/vision/main/models/resnext.html [Accessed March 1, 2025].
[3] Kaiming He et al. Deep Residual Studying for Picture Recognition. Arxiv. https://arxiv.org/abs/1512.03385 [Accessed March 1, 2025].
[4] resnext50_32x4d. PyTorch. https://pytorch.org/vision/main/models/generated/torchvision.models.resnext50_32x4d.html#torchvision.models.resnext50_32x4d [Accessed March 1, 2025].
[5] MuhammadArdiPutra. Taking ResNet to the NeXt Stage — ResNeXt. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/Taking%20ResNet%20to%20the%20NeXt%20Level%20-%20ResNeXt.ipynb [Accessed April 7, 2025].