The Channel-Wise Attention | Squeeze and Excitation

After we speak about consideration in laptop imaginative and prescient, one factor that most likely involves your thoughts first is the one used within the Imaginative and prescient Transformer (ViT) structure. Actually, that’s not the one consideration mechanism we’ve got for picture information. There’s really one other one known as Squeeze and Excitation Community (SENet). If the eye in ViT operates spatially, i.e., assigning weights to totally different patches of a picture, the eye mechanism proposed in SENet operates in channel-wise method, i.e., assigning weights to totally different channels. — On this article, we’re going to focus on how the Squeeze and Excitation structure works, easy methods to implement it from scratch, and easy methods to combine the community into the ResNeXt mannequin.

The Squeeze and Excitation Module

SENet, which was first proposed in a paper titled “Squeeze-and-Excitation Networks” by Hu et al. [1], just isn’t a standalone community like VGG, Inception, or ResNet. As a substitute, it’s really a constructing block to be positioned on an current community. In CNN-based fashions, we assume that pixels spatially shut to one another have excessive correlations, which is the explanation that we make use of small-sized kernels to seize these correlations. This type of assumption is principally the inductive bias of CNN. However, SENet introduces a brand new inductive bias, the place the authors assume that each picture channel contributes in another way to predicting a particular class. By making use of SE modules to a CNN, the mannequin not solely depends on spatial patterns but in addition captures the significance of every channel. To raised illustrate this, we will consider a picture of fireplace, the place the purple channel would theoretically give the next contribution to the ultimate prediction than the blue and inexperienced channels.

The construction of the SE module itself is proven in Determine 1. Because the identify of the community suggests, there are two most important steps completed on this module: squeeze and excitation. The squeeze half corresponds to the operation denoted as F_sq, whereas the excitation half contains each F_ex and F_scale. However, the F_tr operation, is definitely not the a part of the SE module. Slightly, it represents a change perform that initially belongs to the mannequin the place the SE module is utilized. For instance, if we had been to put this SE module on ResNet, the F_tr operation refers back to the stack of convolution layers throughout the bottleneck block.

Determine 1. The construction of the Squeeze and Excitation module [1].

Speaking extra particularly concerning the F_sq operation, it basically works by using international common pooling mechanism, the place it’s used to seize the data from your entire spatial dimension of every channel. By doing so, each channel of the enter tensor goes to be represented by a single quantity, which is principally simply the common worth of the corresponding channel. The authors consult with this operation as international info embedding. Mathematically talking, this could formally be written within the equation proven in Determine 2, the place we principally sum all values throughout the peak H and width W earlier than finally dividing it with the variety of pixels inside that channel (H×W).

Determine 2. The mathematical expression of the worldwide common pooling mechanism in SE module [1].

In the meantime, each excitation and scaling operations are known as adaptive recalibration since what they basically do is to dynamically regulate the weightings of every channel within the enter tensor in line with its significance. Actually, the diagram in Determine 1 doesn’t fully depict your entire SENet structure. You’ll be able to see within the determine that F_ex seems to be a single operation, but it really consists of two linear layers every adopted by an activation perform. See the Determine 3 beneath for the small print.

Determine 3. The mathematical formulation of the ***F_ex*** operation [1].

The 2 linear layers are denoted as W_1 and W_2, whereas δ and σ symbolize ReLU and sigmoid activation capabilities, respectively. So, primarily based on this mathematical definition, what we principally have to do later within the implementation is to move tensor z (the average-pooled tensor) via the primary linear layer, adopted by the ReLU activation perform, the second linear layer, and lastly the sigmoid activation perform. Do not forget that the sigmoid perform normalizes enter values to be throughout the vary of 0 to 1. On this case, we’ll understand the ensuing output as the load of every channel, the place a worth near 1 signifies that the corresponding channel incorporates essential info, therefore we permit the mannequin to pay extra consideration to that channel. In any other case, if the ensuing quantity is near 0, it signifies that the corresponding channel doesn’t contribute that a lot to the output.

In an effort to make the most of these channel weights, we will carry out the F_scale operation, which is principally only a multiplication of the unique tensor u and the load tensor s, as proven in Determine 4 beneath. By doing this, we basically retain the values throughout the essential channels whereas on the similar time suppressing the values of the unimportant ones.

Determine 4. The scaling course of is only a multiplication of the unique and the load tensors [1].

By the best way sorry for getting a bit too mathy right here, lol. However I consider this can enable you perceive the code later within the implementation part.

The place to Put the SE Module

Making use of the SE module on a plain CNN mannequin like VGG is straightforward, as we will merely place it proper after every convolution layer. Nevertheless, it may not be easy within the case of Inception or ResNet due to the presence of parallel branches in these two networks. To deal with this confusion, authors present a information to implement the SE module particularly on the 2 fashions as proven in Determine 5 beneath.

Determine 5. The place SE module is positioned in Inception and ResNet [1].

For the Inception mannequin, as a substitute of inserting SE module proper after every convolution layer, we move the enter tensor via your entire Inception block (together with all of the branches inside) after which connect the SE module afterwards. The identical strategy additionally works for ResNet, however take into account that the summation between the tensor in skip connection and the principle movement occurs after the principle tensor has been processed by the SE module.

As I discussed earlier, the excitation stage basically consists of two linear layers. If we take a more in-depth have a look at the above construction, we will see that the output form of the primary linear layer is 1×1×C/r. The variable r known as discount ratio which reduces the dimensionality of the load tensor earlier than finally projecting it again to 1×1×C via the second linear layer. The dimensionality discount completed by the primary layer acts as a bottleneck operation, which is beneficial to restrict mannequin complexity and to enhance generalization. Authors performed experiments on totally different r values, and so they discovered that r = 16 produces the most effective stability between accuracy and complexity.

Determine 6. A number of methods doable for use to connect SE module in ResNet [1].

Along with implementing the SE module in ResNet, it’s seen in Determine 6 that there are literally a number of methods we will comply with to take action. In line with the experimental ends in Determine 7, it seems like the usual SE, SE-PRE, and SE-Identification blocks obtained related outcomes, whereas on the similar time all of them outperformed SE-POST by a big margin. This means that the location of the SE module impacts mannequin efficiency by way of accuracy. Primarily based on these findings, the authors argue that we’re going to receive good outcomes so long as we apply the SE module earlier than the element-wise summation operation. Later within the coding part, I’m going to reveal easy methods to implement the usual SE block.

Determine 7. Experimental outcomes on totally different SE module integration methods [1].

Extra Experimental Outcomes

There are literally much more experimental outcomes mentioned within the paper. One in all them is a desk displaying accuracy rating enhancements when SE module is utilized to current CNN-based fashions. The desk I’m referring to is displayed in Determine 8 beneath.

Determine 8. Experimental outcomes on making use of SE module on totally different fashions [1][2].

The columns highlighted in blue symbolize the error charges of every mannequin and those in pink point out the computational complexity measured in GFLOPs. The re-implementation column refers back to the plain mannequin that the authors applied themselves, whereas the SENet column represents the identical mannequin outfitted with SE module. The desk clearly exhibits that each top-1 and top-5 errors lower when the SE module is utilized. It is very important know that though including the SE module causes the GFLOPs to get larger, but this improve is significantly marginal in comparison with the discount in error price.

Subsequent, we will really reveal attention-grabbing insights by printing out the values contained within the SE modules in the course of the inference part. Let’s check out the charts in Determine 9 beneath to raised illustrate this. The x axis of those charts denotes the channel numbers, the y axis represents how a lot weight does every channel have in line with its significance, and the colour of the traces signifies the category being predicted.

Determine 9. What the activation of SE modules seems like in several community depth [1].

In shallower layers, the options captured by SE module are class-agnostic, which principally signifies that it captures generic info required to foretell all courses. The charts known as (a) and (b), that are the SE modules from ResNet stage 2 and three, present that there’s not a lot distinction in channel exercise from one class to a different, indicating that these two modules don’t seize info concerning a particular class. The case is definitely totally different from the SE modules in deeper layers, i.e., those in stage 4 (c) and stage 5 (d). We are able to see that these two modules regulate channel weights in another way relying on the category being predicted. That is basically the explanation that the SE modules in deeper layers are mentioned to be class-specific. Nevertheless, the authors acknowledge that there is perhaps uncommon conduct taking place in among the SE modules which occurs within the 2nd block of stage 5 (e). Right here the SE module doesn’t present significant channel recalibration conduct, indicating that it doesn’t contribute as a lot as those we mentioned earlier.

The Detailed Structure

On this article we’re going to implement the SE-ResNeXt-50 (32×4d) mannequin, which in Determine 10 it corresponds to the one within the rightmost column. The ResNeXt mannequin itself is just like ResNet, besides that the group parameter of the second convolution layer inside every block is ready to 32. In the event you’re acquainted with ResNeXt, that is basically the only but efficient method to implement the so-called cardinality. I like to recommend you learn my earlier article about ResNeXt in case you are not but acquainted with it, which the hyperlink is supplied at reference quantity [3] on the finish of this text.

Taking a more in-depth have a look at the structure, what differentiates SE-ResNet-50 from ResNet-50 is solely the presence of SE modules. The identical additionally applies to SE-ResNeXt-50 (32×4d) in comparison with ResNeXt-50 (32×4d) (not displayed within the desk). Discover within the determine beneath that the fashions with SE modules have an fc layer hooked up after the final convolution layer inside every block, which the corresponding two numbers point out the primary and second fully-connected layers contained in the SE module.

Determine 10. The entire structure of ResNet-50, SE-ResNet-50 and SE-ResNeXt-50 (32×4d) [1].

From Scratch Implementation

Do not forget that right here we’re about to combine the SE module on ResNeXt, so we have to implement each of them from scratch. Technically talking, it’s really doable to take the ResNeXt structure straight from PyTorch, then manually connect the SE module on it. Nevertheless, right here I made a decision to make use of the ResNeXt implementation from my earlier article as a substitute since I really feel like it’s a lot simpler to know than the one from PyTorch. Observe that right here I’ll give attention to setting up the SE module and easy methods to connect it to the ResNeXt mannequin fairly than explaining the ResNeXt itself since I’ve already lined it in that article [3].

Now let’s begin the code by importing the required modules.

# Codeblock 1
import torch
import torch.nn as nn

Squeeze and Excitation Module

The next SE module implementation follows the diagram proven in Determine 5 (proper). It’s price noting that the SEModule class beneath doesn’t embrace the skip-connection (curved arrow), as your entire SE module is utilized after the preliminary branching however earlier than the merging (summation).

The __init__() methodology of this class accepts two parameters: num_channels and r, as proven at line #(1) in Codeblock 2a. We positively need this SE module to be usable all through your entire community. So, we have to set the num_channels parameter to be adjustable as a result of the variety of output channels varies throughout ResNeXt blocks at totally different phases, as proven again in Determine 10. In the meantime, despite the fact that we sometimes use the identical discount ratio r within the SE modules throughout the complete community, however it’s technically doable for us to make use of totally different r for various stage, which could most likely be an attention-grabbing factor to experiment with. So, that is basically the explanation that I additionally set the r parameter to be adjustable.

# Codeblock 2a
class SEModule(nn.Module):
    def __init__(self, num_channels, r):                     #(1)
        tremendous().__init__()
        
        self.global_pooling = nn.AdaptiveAvgPool2d(output_size=(1,1))  #(2)
        self.fc0 = nn.Linear(in_features=num_channels,       #(3)
                             out_features=num_channels//r, 
                             bias=False)
        self.relu = nn.ReLU()                                #(4)
        self.fc1 = nn.Linear(in_features=num_channels//r,    #(5)
                             out_features=num_channels, 
                             bias=False)
        self.sigmoid = nn.Sigmoid()                          #(6)

There are 5 layers we have to initialize contained in the __init__() methodology. I write them down in line with the sequence given in Determine 5, i.e., international common pooling layer (#(2)), linear layer (#(3)), ReLU activation perform (#(4)), one other linear layer (#(5)), and sigmoid activation perform (#(6)). Right here you may see that the primary linear layer is accountable to carry out dimensionality discount by shrinking the variety of channels from num_channels to num_channels//r, which is able to then be expanded again to num_channels by the second linear layer. Observe that we set the bias time period of each linear layers to False, which basically means that we’ll solely make the most of the load tensors. The absence of bias phrases within the two layers forces the SE module to study the correlation between one channel to the others fairly than simply including fastened changes.

Nonetheless with the SEModule class, let’s now transfer on to the ahead() methodology to outline the movement of the community. You’ll be able to see at line #(1) in Codeblock 2b that we begin from a single enter x, which within the case of ResNeXt it’s basically a tensor produced by the third convolution layer throughout the similar ResNeXt block. As proven in Determine 5, what we have to do subsequent is to department out the community. Right here we straight course of the department utilizing the global_pooling layer, which I identify the ensuing tensor squeezed (#(2)). The unique enter tensor x itself shall be left as is since we’re not going to carry out any operation on it till the scaling part. Subsequent, we have to drop the spatial dimension of the squeezed tensor utilizing torch.flatten() (#(3)). That is principally completed as a result of we need to course of it additional with the linear layers at line #(4) and #(5), which might solely work with a single-dimensional tensor. The spatial dimension is then launched once more at line #(6), permitting us to carry out multiplication between x (the unique tensor) and excited (the channel weights) at line #(7). This whole course of produces a recalibrated model of x which we consult with as scaled. Right here I print out the tensor dimension after every step so that you could higher perceive the movement of this SE module.

# Codeblock 2b
    def ahead(self, x):                                  #(1)
        print(f'originaltt: {x.measurement()}')
        
        squeezed = self.global_pooling(x)                  #(2)
        print(f'after avgpooltt: {squeezed.measurement()}')
        
        squeezed = torch.flatten(squeezed, 1)              #(3)
        print(f'after flattentt: {squeezed.measurement()}')
        
        excited = self.relu(self.fc0(squeezed))            #(4)
        print(f'after fc0-relutt: {excited.measurement()}')
        
        excited = self.sigmoid(self.fc1(excited))          #(5)
        print(f'after fc1-sigmoidt: {excited.measurement()}')
        
        excited = excited[:, :, None, None]                #(6)
        print(f'after reshapett: {excited.measurement()}')
        
        scaled = x * excited                               #(7)
        print(f'after scalingtt: {scaled.measurement()}')
        
        return scaled

Now we’re going to see if we’ve got applied the community accurately by passing a dummy tensor via it. In Codeblock 3 beneath, I initialize an SE module and configure it to just accept a picture tensor of 512 channels and has a discount ratio of 16 (#(1)). In the event you check out the SE-ResNeXt structure in Determine 10, this SE module principally corresponds to the one within the third stage (which the output measurement is 28×28). Thus, at line #(2) we have to regulate the form of the dummy tensor accordingly. We then feed this tensor into the community utilizing the code at line #(3).

# Codeblock 3
semodule = SEModule(num_channels=512, r=16)    #(1)
x = torch.randn(1, 512, 28, 28)                #(2)

out = semodule(x)      #(3)

And beneath is what the print capabilities give us.

# Codeblock 3 Output
unique          : torch.Dimension([1, 512, 28, 28])    #(1)
after avgpool     : torch.Dimension([1, 512, 1, 1])      #(2)
after flatten     : torch.Dimension([1, 512])            #(3)
after fc0-relu    : torch.Dimension([1, 32])             #(4)
after fc1-sigmoid : torch.Dimension([1, 512])            #(5)
after reshape     : torch.Dimension([1, 512, 1, 1])      #(6)
after scaling     : torch.Dimension([1, 512, 28, 28])    #(7)

You’ll be able to see that the unique tensor form matches precisely with our dummy tensor, i.e., 1×512×28×28 (#(1)). By the best way we will ignore the #1 within the 0th axis because it basically denotes the batch measurement, which on this case I assume that we solely received a single picture in a batch. After being pooled, the spatial dimension collapses to 1×1 since now every channel is represented by a single quantity (#(2)). The aim of the flatten operation I defined earlier is to drop the 2 empty axes (#(3)) for the reason that subsequent linear layers can solely work with single-dimensional tensor. Right here you may see that the primary linear layer reduces the tensor dimension to 32 due to the discount ratio which we beforehand set to 16 (#(4)). The size of this tensor is then expanded again to 512 by the second linear layer (#(5)). Subsequent, we unsqueeze the tensor in order that we get our 1×1 spatial dimension again (#(6)), permitting us to multiply it with the enter tensor (#(7)). Primarily based on this detailed movement, you may see that an SE module principally preserves the unique tensor dimension, proving that this module may be hooked up to any CNN-based mannequin with out disrupting the unique movement of the community.

ResNeXt

As we’ve got understood easy methods to implement SE module from scratch, now that I’m going to indicate you ways we will connect it on a ResNeXt mannequin. Earlier than doing so, we have to initialize the parameters required to implement the ResNeXt structure. Within the Codeblock 4 beneath the primary 4 variables are decided in line with the ResNeXt-50 (32×4d) variant, whereas the final one (R) represents the discount ratio for the SE modules.

# Codeblock 4
CARDINALITY  = 32
NUM_CHANNELS = [3, 64, 256, 512, 1024, 2048]
NUM_BLOCKS   = [3, 4, 6, 3]
NUM_CLASSES  = 1000
R = 16

The Block class outlined in Codeblock 5a and 5b is the ResNeXt block from my earlier article. There are literally numerous issues we do contained in the __init__() methodology, however the basic concept is that we initialize three convolution layers known as conv0 (#(1)), conv1 (#(2)), and conv2 (#(3)) earlier than initializing the SE module at line #(4). We’ll later configure these layers in line with the SE-ResNeXt structure proven again in Determine 10.

# Codeblock 5a
class Block(nn.Module):
    def __init__(self, 
                 in_channels,
                 add_channel=False,
                 channel_multiplier=2,
                 downsample=False):
        tremendous().__init__()

        self.add_channel = add_channel
        self.channel_multiplier = channel_multiplier
        self.downsample = downsample
        
        
        if self.add_channel:
            out_channels = in_channels*self.channel_multiplier
        else:
            out_channels = in_channels
        
        mid_channels = out_channels//2
        
        
        if self.downsample:
            stride = 2
        else:
            stride = 1
            

        if self.add_channel or self.downsample:
            self.projection = nn.Conv2d(in_channels=in_channels,
                                        out_channels=out_channels, 
                                        kernel_size=1, 
                                        stride=stride, 
                                        padding=0, 
                                        bias=False)
            nn.init.kaiming_normal_(self.projection.weight, nonlinearity='relu')
            self.bn_proj = nn.BatchNorm2d(num_features=out_channels)

        self.conv0 = nn.Conv2d(in_channels=in_channels,       #(1)
                               out_channels=mid_channels,
                               kernel_size=1, 
                               stride=1, 
                               padding=0, 
                               bias=False)
        nn.init.kaiming_normal_(self.conv0.weight, nonlinearity='relu')
        self.bn0 = nn.BatchNorm2d(num_features=mid_channels)

        self.conv1 = nn.Conv2d(in_channels=mid_channels,      #(2)
                               out_channels=mid_channels, 
                               kernel_size=3, 
                               stride=stride,
                               padding=1, 
                               bias=False, 
                               teams=CARDINALITY)
        nn.init.kaiming_normal_(self.conv1.weight, nonlinearity='relu')
        self.bn1 = nn.BatchNorm2d(num_features=mid_channels)

        self.conv2 = nn.Conv2d(in_channels=mid_channels,      #(3)
                               out_channels=out_channels,
                               kernel_size=1, 
                               stride=1, 
                               padding=0, 
                               bias=False)
        nn.init.kaiming_normal_(self.conv2.weight, nonlinearity='relu')
        self.bn2 = nn.BatchNorm2d(num_features=out_channels)
        
        self.relu = nn.ReLU()
        
        self.semodule = SEModule(num_channels=out_channels, r=R)    #(4)

The ahead() methodology itself is usually additionally the identical as the unique ResNeXt mannequin, besides that right here we have to put the SE module proper earlier than the element-wise summation as proven at line #(1) within the Codeblock 5b beneath. Do not forget that this implementation follows the usual SE block structure in Determine 6 (b).

# Codeblock 5b
    def ahead(self, x):
        print(f'originaltt: {x.measurement()}')
        
        if self.add_channel or self.downsample:
            residual = self.bn_proj(self.projection(x))
            print(f'after projectiont: {residual.measurement()}')
        else:
            residual = x
            print(f'no projectiontt: {residual.measurement()}')
        
        x = self.conv0(x)
        x = self.bn0(x)
        x = self.relu(x)
        print(f'after conv0-bn0-relut: {x.measurement()}')

        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        print(f'after conv1-bn1-relut: {x.measurement()}')
        
        x = self.conv2(x)
        x = self.bn2(x)
        print(f'after conv2-bn2tt: {x.measurement()}')
        
        x = self.semodule(x)      #(1)
        print(f'after semodulett: {x.measurement()}')
        
        x = x + residual
        x = self.relu(x)
        print(f'after summationtt: {x.measurement()}')
        
        return x

With the above implementation, each time we instantiate a Block object we could have a ResNeXt block which is already outfitted with an SE module. Now we’re going to take a look at the above class to see if we’ve got applied it accurately. Right here I’m going to simulate a ResNeXt block throughout the third stage. The add_channel and downsample parameters are set to False since we need to protect each the variety of channels and the spatial dimension of the enter tensor.

# Codeblock 6
block = Block(in_channels=512, add_channel=False, downsample=False)
x = torch.randn(1, 512, 28, 28)

out = block(x)

Under is what the output seems like. Right here you may see that our first convolution layer efficiently lowered the variety of channels from 512 to 256 (#(1)), which is then expanded again to its unique dimension by the third convolution layer (#(2)). Afterwards, the tensor goes via the SE block which the ensuing output measurement is similar as its enter, identical to what we noticed earlier in Codeblock 3 (#(3)). Because the processing with SE module is completed, we will lastly carry out the element-wise summation between the tensor from the principle department and the one from the skip-connection (#(4)).

unique             : torch.Dimension([1, 512, 28, 28])
no projection        : torch.Dimension([1, 512, 28, 28])
after conv0-bn0-relu : torch.Dimension([1, 256, 28, 28])    #(1)
after conv1-bn1-relu : torch.Dimension([1, 256, 28, 28])
after conv2-bn2      : torch.Dimension([1, 512, 28, 28])    #(2)
after semodule       : torch.Dimension([1, 512, 28, 28])    #(3)
after summation      : torch.Dimension([1, 512, 28, 28])    #(4)

And beneath is how I implement your entire structure. What we basically have to do is simply to stack a number of SE-ResNeXt blocks in line with the structure in Determine 10. Actually, the SEResNeXt class in Codeblock 7 is precisely the identical because the ResNeXt class in my earlier article [3] (I actually copy-pasted it) since what makes SE-ResNeXt totally different from the unique ResNeXt is simply the presence of SE module throughout the Block class we mentioned earlier.

# Codeblock 7
class SEResNeXt(nn.Module):
    def __init__(self):
        tremendous().__init__()

        # conv1 stage
        self.resnext_conv1 = nn.Conv2d(in_channels=NUM_CHANNELS[0],
                                       out_channels=NUM_CHANNELS[1],
                                       kernel_size=7,
                                       stride=2,
                                       padding=3, 
                                       bias=False)
        nn.init.kaiming_normal_(self.resnext_conv1.weight, 
                                nonlinearity='relu')
        self.resnext_bn1 = nn.BatchNorm2d(num_features=NUM_CHANNELS[1])
        self.relu = nn.ReLU()
        self.resnext_maxpool1 = nn.MaxPool2d(kernel_size=3,
                                             stride=2, 
                                             padding=1)

        # conv2 stage
        self.resnext_conv2 = nn.ModuleList([
            Block(in_channels=NUM_CHANNELS[1],
                  add_channel=True,
                  channel_multiplier=4,
                  downsample=False)
        ])
        for _ in vary(NUM_BLOCKS[0]-1):
            self.resnext_conv2.append(Block(in_channels=NUM_CHANNELS[2]))

        # conv3 stage
        self.resnext_conv3 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[2],
                                                  add_channel=True, 
                                                  downsample=True)])
        for _ in vary(NUM_BLOCKS[1]-1):
            self.resnext_conv3.append(Block(in_channels=NUM_CHANNELS[3]))
            
            
        # conv4 stage
        self.resnext_conv4 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[3],
                                                  add_channel=True, 
                                                  downsample=True)])
        
        for _ in vary(NUM_BLOCKS[2]-1):
            self.resnext_conv4.append(Block(in_channels=NUM_CHANNELS[4]))
            
            
        # conv5 stage
        self.resnext_conv5 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[4],
                                                  add_channel=True, 
                                                  downsample=True)])
        
        for _ in vary(NUM_BLOCKS[3]-1):
            self.resnext_conv5.append(Block(in_channels=NUM_CHANNELS[5]))
 
       
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))

        self.fc = nn.Linear(in_features=NUM_CHANNELS[5],
                            out_features=NUM_CLASSES)
        

    def ahead(self, x):
        print(f'originaltt: {x.measurement()}')
        
        x = self.relu(self.resnext_bn1(self.resnext_conv1(x)))
        print(f'after resnext_conv1t: {x.measurement()}')
        
        x = self.resnext_maxpool1(x)
        print(f'after resnext_maxpool1t: {x.measurement()}')
        
        for i, block in enumerate(self.resnext_conv2):
            x = block(x)
            print(f'after resnext_conv2 #{i}t: {x.measurement()}')
            
        for i, block in enumerate(self.resnext_conv3):
            x = block(x)
            print(f'after resnext_conv3 #{i}t: {x.measurement()}')
            
        for i, block in enumerate(self.resnext_conv4):
            x = block(x)
            print(f'after resnext_conv4 #{i}t: {x.measurement()}')
            
        for i, block in enumerate(self.resnext_conv5):
            x = block(x)
            print(f'after resnext_conv5 #{i}t: {x.measurement()}')
        
        x = self.avgpool(x)
        print(f'after avgpooltt: {x.measurement()}')
        
        x = torch.flatten(x, start_dim=1)
        print(f'after flattentt: {x.measurement()}')
        
        x = self.fc(x)
        print(f'after fctt: {x.measurement()}')
        
        return x

As your entire SE-ResNeXt-50 (32×4d) structure is accomplished, now that we’re going to take a look at it by passing via a tensor of measurement 1×3×224×224 via the community, simulating a single RGB picture of measurement 224×224. You’ll be able to see within the output of the Codeblock 8 beneath that it looks like mannequin works correctly for the reason that tensor efficiently handed via all layers throughout the seresnext mannequin with out returning any error. Thus, I consider this mannequin is now able to be educated. By the best way don’t overlook to vary the variety of neurons within the output channel in line with the variety of courses in your dataset if you wish to really prepare this mannequin.

# Codeblock 8
seresnext = SEResNeXt()
x = torch.randn(1, 3, 224, 224)

out = seresnext(x)

# Codeblock 8 Output
unique               : torch.Dimension([1, 3, 224, 224])
after resnext_conv1    : torch.Dimension([1, 64, 112, 112])
after resnext_maxpool1 : torch.Dimension([1, 64, 56, 56])
after resnext_conv2 #0 : torch.Dimension([1, 256, 56, 56])
after resnext_conv2 #1 : torch.Dimension([1, 256, 56, 56])
after resnext_conv2 #2 : torch.Dimension([1, 256, 56, 56])
after resnext_conv3 #0 : torch.Dimension([1, 512, 28, 28])
after resnext_conv3 #1 : torch.Dimension([1, 512, 28, 28])
after resnext_conv3 #2 : torch.Dimension([1, 512, 28, 28])
after resnext_conv3 #3 : torch.Dimension([1, 512, 28, 28])
after resnext_conv4 #0 : torch.Dimension([1, 1024, 14, 14])
after resnext_conv4 #1 : torch.Dimension([1, 1024, 14, 14])
after resnext_conv4 #2 : torch.Dimension([1, 1024, 14, 14])
after resnext_conv4 #3 : torch.Dimension([1, 1024, 14, 14])
after resnext_conv4 #4 : torch.Dimension([1, 1024, 14, 14])
after resnext_conv4 #5 : torch.Dimension([1, 1024, 14, 14])
after resnext_conv5 #0 : torch.Dimension([1, 2048, 7, 7])
after resnext_conv5 #1 : torch.Dimension([1, 2048, 7, 7])
after resnext_conv5 #2 : torch.Dimension([1, 2048, 7, 7])
after avgpool          : torch.Dimension([1, 2048, 1, 1])
after flatten          : torch.Dimension([1, 2048])
after fc               : torch.Dimension([1, 1000])

Moreover, we will additionally print out the variety of parameters this mannequin has utilizing the next code. Right here you may see that the codeblock returns 27,543,848. This variety of parameters is barely larger than the unique ResNeXt mannequin counterpart, which solely has 25,028,904 parameters as talked about in my earlier article in addition to the official PyTorch documentation [4]. Such a rise within the mannequin measurement positively is sensible for the reason that ResNeXt blocks all through your entire community now have extra layers due to the presence of SE modules.

# Codeblock 9
def count_parameters(mannequin):
    return sum([params.numel() for params in model.parameters()])

count_parameters(seresnext)

# Codeblock 9 Output
27543848

Ending

And that’s just about every little thing concerning the Squeeze and Excitation module. I do encourage you to discover from right here by coaching this mannequin by yourself dataset in order that you will notice whether or not the findings introduced within the paper additionally apply to your case. Not solely that, I feel it might even be attention-grabbing for those who attempt to implement SE module on different neural community architectures like VGG or Inception by your self.

I hope you study one thing new at this time. Thanks for studying!

By the best way you can too discover the code used on this article in my GitHub repo [5].

[1] Jie Hu et al. Squeeze and Excitation Networks. Arxiv. https://arxiv.org/abs/1709.01507 [Accessed March 17, 2025].

[2] Picture initially created by writer.

[3] Taking ResNet to the Subsequent Degree. In the direction of Knowledge Science. https://towardsdatascience.com/taking-resnet-to-the-next-level/ [Accessed July 22, 2025].

[4] Resnext50_32x4d. PyTorch. https://pytorch.org/vision/main/models/generated/torchvision.models.resnext50_32x4d.html#torchvision.models.resnext50_32x4d [Accessed March 17, 2025].

[5] MuhammadArdiPutra. The Channel-Sensible Consideration — Squeeze and Excitation. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/The%20Channel-Wise%20Attention%20-%20Squeeze%20and%20Excitation.ipynb [Accessed April 7, 2025].

Source link

Agentic AI: On Evaluations | Towards Data Science

Time Series Forecasting Made Simple (Part 3.2): A Deep Dive into LOESS-Based Smoothing

Finding Golden Examples: A Smarter Approach to In-Context Learning

Are your AI agents still stuck in POC? Let’s fix that.

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Generating Data Dictionary for Excel Files Using OpenPyxl and AI Agents

The Role of Text-to-Speech in Modern E-Learning Platforms

ASP.NET Core 2025: Revolutionizing Modern Web Development by Using Cutting-Edge Features

Our Picks