10 Mins Read

How Backbone Works



One of the main steps in computer vision is the extraction of features. In the past, feature extraction was performed using conventional algorithms or filters applied to input data for further processing. With the introduction of deep learning techniques, neural networks have become more powerful and can process a larger amount of data with less effort. The development of convolution neural networks (CNN) makes it possible to work on large-scale data they are also great for feature extraction. Selecting CNNs for feature extraction or other parts of a deep learning model is not a random process. The complexity of the target task, as well as the model’s implementation, will greatly affect the outcome. Some proposed networks have become the default option in a variety of data analysis domains. A deep learning model’s backbone is responsible for feature extraction using these networks. In this section, we’ll have an overview of the most prominent backbone networks, implemented on the AIEX platform for deep learning.


One of the most famous backbones in computer vision and computer science is the VGG family, which includes VGG-16 and VGG-19. VGGs have proven to be effective in many tasks, including image classification, object detection, and segmentation. In addition, they are widely used as the backbone (for feature extraction) in many other models, such as R-CNN and Faster R-CNN. VGG-16, one of the fundamental deep learning backbones developed in 2014, consists of 16 layers including 13 convolutional layers, 5 max-pooling layers, and 3 fully connected layers, in addition to ReLU for activation. A VGG architecture has eight more layers than an AlexNet architecture, and contains 138 million parameters. VGG-19 is a deeper version of the VGG-16, it has 144 million parameters and contains 3 more layers with 16 convolutional layers, 5 max-pooling layers, and 3 fully connected layers.

VGG16 and VGG19 architecture
Figure1. VGG16 and VGG19 architecture. Source


Residual Neural Network (ResNet) is a CNN-based neural network that contain some skip-connections or recurrent units between blocks of convolutional and pooling layers. Also, the block is followed by batch normalization. Similar to the VGG family, ResNet has many versions including ResNet34 and ResNet-50 with 26M parameters, ResNet-101 with 44M, and ResNet-152 which is even deeper with 152 layers. The ResNet-50 and ResNet-101 are widely used for object detection and segmentation. The ResNet architecture is also used as the backbone for other deep learning architectures such as the Faster RCNN and the R-FCN.

ResNet architecture
Figure2. ResNet architecture. Source

Inception-v1 (GoogleNet)

One of the most widely used convolutional neural networks is Inception-V1 or GoogleNet, which is developed based on inception blocks. There are convolution layers in each block, and the filters used range from 1×1, 3×3, to 5×5, allowing for multiscale learning. Various filter sizes make the variation of dimension between blocks possible. GoogleNet, AlexNet, and VGG use global average pooling instead of Max-pooling. Figure 3 illustrates the architecture of Inception-v1.

Inception-V1 or GoogleNet architecture
Figure3. Inception-V1 or GoogleNet architecture. Source


Darknet-19 architecture,developed based on existing notions like inception and batch normalization is ideal for developing small and efficient networks. The DarkNet network is made up of convolutional-max-pooling layers, while the DarkNet-19 consists of 19 convolutional layers. The number of parameters is reduced by using a set of 1 × 1 convolutional kernels, while 3 × 3 convolutional kernels are not used much like in VGG of ResNet. DarkNet-19 is used in many object detection algorithms, including YOLO.

An overview of Darknet 19-YOLOv3
Figure4. An overview of Darknet 19-YOLOv3. Darknet-19 replaces the backbone network, in the right hand of the picture pyramid networks continue to predict on three scales. Source


The MobileNet model has been developed  as a deep learning model for low-performance machines like mobile phones. MobileNet’s architecture uses depth-wise separable convolutions and is considered a lightweight model. Two global hyper-parameters are introduced to help developers select the best model size for their problem. ImageNet is used to train MobileNet for image classification. An alternative version of MobileNet, MobileNet-v2, has been developed for object detection. A new residual block was introduced in this model, allowing a direct shortcut connection between bottleneck layers, as illustrated in Figure 5. Depth-wise separable convolutions are used in this version to filter features.

bottleneck layers in MobileNet
Figure5. shortcut connection directly between bottleneck layers in MobileNet. Source
The architecture of MobileNet V1
Figure6. The architecture of MobileNet V1. Source


SqueezeNet was developed to  reduce the number of parameters, it consists of a 1 × 1 filter in almost all layers instead of a 3 × 3 filter since with 1×1 filters the number of parameters is 9times lower. Also, the input channels are decreased to 3 × 3 filters, and delayed down-sampling of large activation maps leads to higher classification accuracy. SqueezeNet layers are a set of consecutive fire modules. Each Fire module consists of a squeeze convolution layer (with only 1×1 filters), feeding into an expand layer with a mix of 1×1 and 3×3 filters. SqueezeNet’s architecture employs the same method as ResNet for connecting layers or fire modules. According to SqueezeNet’s developers, it contains 1.24M parameters, which makes it a small backbone in comparison to other networks.

The architecture of SqueezeNet
Figure7. The architecture of SqueezeNet. Source
Fire Module in SqueezeNet
Figure8. Fire Module in SqueezeNet. Source


EfficientNet Networks, a new family of architectures, have significantly outperformed other networks in classification tasks with fewer parameters and more floating point operations per second (FLOPs). Compound scaling was implemented in order to uniformly scale the network’s width, depth, and resolution. Compared to the best existing networks, EfficientNet parameters are 8.4x smaller and 6.1x faster on inference. There are several versions of EfficientNet, starting with B0 and ending with B7. Depending on the availability of resources and the computational cost, any of the EfficientNet models can be used.There are 5.3 million parameters in EfficientNet-B0, while EfficientNet-B7 contains66 million parameters.

The EffecientNet-B0 general architecture
Figure9. The EffecientNet-B0 general architecture. Source


ShuffleNet is a computation-efficient CNN architecture developed for mobile devices with limited computational power with the goal of reducing computation costs while conserving accuracy. ShuffleNet consists of point-wise group convolution and channel shuffle notions. The group of depth-wise separable convolutions has been introduced in AlexNet for distributing the model over two GPUs, and also used in ResNeXt to demonstrate its robustness.  ShuffleNet uses point-wise group convolution to reduce the computation complexity of 1× 1 convolutions.  group of convolution layers can affect the accuracy of the network, for that, the authors of ShuffleNet used channel shuffle operation to share the information across feature channels.ShuffleNet also comes in two versions and has been widely used  for task otehr than image classification and object detection.

ShuffleNet Units
Figure10. ShuffleNet Units. a) bottleneck unit with depthwise convolution (DWConv); b) ShuffleNet unit with pointwise group convolution (GConv) and channel shuffle; c) ShuffleNet unit with stride = 2. Source


Inception-ResNet was introduced by combining the architectures of ResNet and inception. Inception-ResNet uses residual connections (skip connections between blocks of layers), and is composed of 164 layers of 4 max-pooling and 160 convolutional layers and about 55 million parameters. The Inception-ResNet network has achieved  higher accuracy and better performance at shorter epochs. InceptionResNet is used by many other architectures, such as Faster R-CNN and Mask R-CNN.

Figure11. The Inception ResNetV2 general architecture. Source

How the backbone works in The AIEX Platform

Each algorithm on the AIEX platform uses a different backbones. We start by preparing the data on the platform, the second step is training the model. we choose our desired framework from Pytorch, Tensorflow, Nvidia TAO, and TFLite, and select the desired algorithm in that framework as shown in Figure 13. Select the backbone algorithm in the backbone section. Table 1 shows all the backbones networks available on the AIEX platform for classification, segmentation, and object detection.

Training step
Figure12. Training step


aiex backbone
Figure13. The backbone in AIEX platform


Table1. The backbone for all algorithms and frameworks used by AIEX 


Related articles
To train a model or use transfer learning in machine vision, there must be enough data. Data Augmentation is...
Train, Test, and Validation Datasets
An artificial intelligence model output is affected by how we divide the input dataset. There are several factors to...
Data-Driven approach
An AI model’s performance can be increased by either improving the dataset or the model’s structure. The purpose of...
In this article, we will introduce Tensorboard and explain how it can be used on AIEX....
The majority of machine learning algorithms work by minimizing or maximizing an 'objective function'. Loss Functions are a group...
evaluation metrics
This article examines the different metrics used to evaluate machine vision models, and the metrics implemented on the AIEX...
Subscribe to our newsletter and get the latest practical content.

You can enter your email address and subscribe to our newsletter and get the latest practical content. You can enter your email address and subscribe to our newsletter.