AI Tutorial

Convolutional Neural Network (CNN): Algorithm, Architecture, Layers, Working

Table of Contents

  • Introduction
  • What Is Convolutional Neural Network?
  • Convolutional Neural Networks Architecture & Layers
  • How Do Convolutional Layers in CNN Work?
  • What is a Pooling Layer?
  • Advantages of Convolutional Neural Networks
  • Disadvantages of Convolutional Neural Networks


Artificial intelligence (AI) has witnessed significant growth over the years and will only evolve in the future. This technology is bridging the gap between human and machine capabilities. AI enthusiasts and researchers are constantly working on different aspects of AI to explore new possibilities. One such area is computer vision. 

AI enables machines to perceive the world as humans do and use the knowledge gathered to perform various tasks, such as image analysis & classification and image & video recognition. Recommendation systems, media recreation, and natural language processing. Computer vision in machine learning and deep learning is seeing great advancements, mainly in one specific algorithm- a convolutional neural network in machine learning or CNN algorithm. 

We’ll have a detailed discussion about the CNN algorithm, its architecture, and working in this blog.

What Is Convolutional Neural Network?

Convolutional neural network (CNN) is a type of deep neural network architecture applied to analyze visual imagery. It is used in computer vision, which is an important area of artificial intelligence and allows machines and computers to understand and interpret visual or image data. In machine learning, artificial neural networks are known for their excellent performance. They are used in different datasets, such as text, images, and audio. We use different types of neural networks to perform different tasks. For example, recurrent neural networks (RNNs) are used to predict sequences of words, and CNNs are used to classify images. 

In mathematics, Convolution is an operation performed on two functions and produces a third function that explains how one shape is modified by the other. We don’t go behind the mathematics to understand CNNs in neural networks. Convolutional neural networks basically reduce the size of images so they are easier to process without losing their features, which are important for good predictions.

Convolutional Neural Networks Architecture & Layers

The architecture of a Convolutional Neural Network (CNN) is specifically designed for tasks involving images and spatial data. Here's an overview of the key components and layers that make up a typical CNN architecture:

  • Input Layer:

The input layer receives the raw data, which is usually an image in the form of a grid of pixel values. The dimensions of the input layer match the dimensions of the input image.

  • Convolutional Layers:

Convolutional layers are the heart of CNNs. They consist of multiple filters (also called kernels) that slide or convolve across the input image to detect features like edges, corners, and textures.

Each filter produces a feature map, and multiple filters are used to capture different features at various scales.

Convolutional layers learn these feature representations through training.

  • Activation Function (ReLU) Layer:

After each convolution operation, a Rectified Linear Unit (ReLU) activation function is applied element-wise to introduce non-linearity to the network.

ReLU helps the network learn complex patterns and accelerates convergence.

  • Pooling (Subsampling) Layers:

Pooling layers reduce the spatial dimensions of the feature maps while retaining essential information. Common pooling techniques include max-pooling and average-pooling.

Pooling helps reduce the computational load, makes the network more robust to variations in input, and helps control overfitting.

  • Fully Connected (Dense) Layers:

Fully connected layers are traditional neural network layers where each neuron is connected to every neuron in the previous layer.

These layers perform high-level feature extraction and classification. They learn to combine low-level features detected by convolutional layers.

The final fully connected layer typically produces the network's output, often with softmax activation for classification tasks.

  • Flattening Layer:

Before connecting to the fully connected layers, the feature maps are flattened into a one-dimensional vector. This is done to match the input shape of the fully connected layers.

  • Dropout Layer (Optional):

Dropout is a regularization technique applied to fully connected layers to prevent overfitting. It randomly drops a fraction of neurons during each training iteration.

  • Output Layer:

The output layer produces the network's final predictions. The number of neurons in this layer depends on the specific task (e.g., binary classification, multi-class classification, regression).

The activation function in the output layer depends on the task, such as softmax for classification or linear for regression.

  • Loss Function:

The choice of loss function depends on the task. For classification, common loss functions include cross-entropy, while mean squared error is used for regression.

  • Optimization Algorithm:

CNNs use optimization algorithms like stochastic gradient descent (SGD), Adam, or RMSprop to minimize the loss function and adjust the network's weights during training.

  • Backpropagation:

Backpropagation is used to calculate gradients and update the weights of the network layers during training, allowing the network to learn from the training data.

  • Multiple Stacked Layers:

CNN architectures often consist of multiple stacked convolutional, activation, and pooling layers. The depth of the network helps capture hierarchical features.

How Do Convolutional Layers in CNN Work?

Let’s understand convolutional neural network working in detail. 

Convolutional neural networks, also known as Convets, are neural networks that share parameters. Suppose there is a cuboid with length, width, and height.

Now, say you take a small patch of the image and run a small neural network known as kernel or filter on it with K outputs and represent them vertically. As you slide the neural network across the image, you will get another image with different widths, heights, and depths. 

Rather than channels R, G, and B, you have more channels but with lesser width and height. This is called convolution. If the patch size and image are the same, it is a regular neural network. The small patch leads to fewer weights. 

Here is a brief explanation of the math behind convolutional neural network layers and the entire convolutional process. 

  • The layers comprise a set of learnable filters with small widths, heights, and depths similar to that of the input volume.

  • During the forward pass, we slide each filter across the entire input volume one step at a time, where each step is known as a stride. We can compute the dot product between the filter weights and patch from the input volume.

  • While sliding filters, we’ll get a 2D output for every filter and stack them. We’ll get an output volume with a depth equal to the number of filters. The network will learn the filters. 

What is a Pooling Layer?

The pooling layer, similar to the convolutional layer, takes care of reducing the spatial size of the convolved feature. This decreases the computational power needed to process data by reducing dimensions. 

Pooling is of two types- average pooling and max pooling. Pooling layer is used to decrease the computational power to process the data in a convolutional neural network. Therefore, in max pooling, we find the maximum value of a pixel from the area of an image covered by the filter or kernel. Max pooling also works as a noise suppressant, discarding any noisy activations and performing de-noising while reducing the dimensions. 

Whereas, average pooling gives the average of values from the section of an image covered by filters. It is responsible for reducing dimensions as a noise suppressing mechanism. This clearly indicates that max pooling delivers better performance than average pooling. 

Advantages of Convolutional Neural Networks

  • It offers end-to-end training without manual feature extension.

  • It can detect patterns and features in videos, images, and audio signals.

  • It can easily handle vast amounts of data and attain high accuracy.

  • It is robust to rotation, translation, and scaling invariance. 

Disadvantages of Convolutional Neural Networks

  • Needs a vast amount of labeled data.

  • Computationally expensive to train.

  • Limited interpretability and difficult comprehending what the network has learned.

  • If there is not enough data, it can be prone to overfitting.

Did you find this article helpful?