How convolution works in a CNN?

Feb 74 min read

Ever wondered how your phone recognizes faces or how self-driving cars detect objects? The magic behind these technologies is convolution, one of the most important concepts in deep learning! Today, we’ll break it down in a way that makes sense, even if you’re new to the topic. So, grab a coffee, and let’s dive in!

What is Convolution?

Think of convolution as a way to scan an image, looking for important details. Imagine placing a small magnifying glass over different parts of an image, focusing on edges, textures, and patterns. That’s basically what a convolution operation does!

A convolution operation involves:

An input image (the thing we’re analyzing)
A filter (kernel) (a tiny matrix that extracts details)
A stride (how far we move the filter)
An output image (the transformed version highlighting key features)

How Does Convolution Work?

Step-by-Step Breakdown

Place the filter on a section of the image.

Multiply top left number in both image and kernel

Multiply the overlapping numbers.

Multiply next numbers in both image and kernel

Add up the results to get a single value.

Add all the overlapping multiplication results to get the first number in output

Move the filter to the next section and repeat. Note that in our case we are assuming padding is 0 and stride is 1 about which I will explain in detail in the next section.

Moving the filter right by one column as stride is 1

Continue until the whole image is scanned!

As you can see in above image, output of convolution operation is a 2 × 2 matric whereas our input matrix is 3 × 3. What determined the output dimension of result matrix? Answer is stride and padding are two parameters in addition to filter size that determine the size of output matrix.

Stride and Padding: How They Work in Convolution

Let’s break down stride and padding for convolution using the same example image and kernel, so you can clearly understand their impact on the output size and behavior.

1. Stride: How Fast the Kernel Moves

The stride controls how many steps the kernel moves across the image in both horizontal and vertical directions. In example above kernel moves on cell at a time because stride is 1

Stride = 1: The kernel moves one pixel at a time as described in our previous example.
Stride = 2: The kernel skips every other pixel, producing a smaller output because fewer positions are calculated.

Example:

For a 3×3 image and 2×2 kernel:

With Stride = 1: The kernel moves over every possible 2×2 submatrix. This results in a 2×2 output, as shown in our earlier convolution example.
With Stride = 2: The kernel skips positions, reducing overlap, and the output becomes smaller. Only the top-left and bottom-right parts of the image are covered, giving a 1×1 output.

Takeaway: Higher stride values reduce the output size but may miss finer details.

2. Padding: Preserving Image Size

When applying convolution, the kernel can’t cover the edges of the image fully. This leads to a shrinking effect on the output size. Padding solves this by adding a border of zeros (or another value) around the input image as shown in the image below.

Types of Padding:

Valid Padding (No Padding):
- The kernel operates only within the boundaries of the image as in our previous example.
- Output size is smaller than the input size.
- Example: For a 3×3 image with a 2×2 kernel, and no padding, the output is 2×2.
Some Padding:
- Adds zeros to the borders of the image to preserve its size after convolution.
- Output size remains the same as the input size.
- Example: For the same 3×3 image and2×2 kernel with 1 padding, zeros are added to create a 5×5 image. After convolution, the output size will still be 4×4 which you can try as assignment and describe your solution in comments.

With padding of 1 we will add 0s all around the edge of original image and continue convolution as normal

Why Padding Matters?

Without padding, important details near the image edges may get ignored. Padding ensures the kernel has enough data to process the boundaries, resulting in more complete feature extraction.

How They Work Together

Stride reduces the output size, as the kernel skips pixels. Higher strides speed up processing but can miss details.
Padding counteracts the shrinking effect of convolution by preserving the original image dimensions.

By balancing stride and padding, you can fine-tune how your convolution captures patterns while maintaining computational efficiency and output quality. If you want to know the output size without actually doing the convolution operation, you can use the below formula.

Here's the formula to calculate the output size for a convolution

Output size = ⌊(Input size + 2 × Padding - Kernel size) ÷ Stride⌋ + 1

This formula works for both height and width dimensions. When you need the full output dimensions:

Output Height = ⌊(H + 2P - K_h) ÷ S⌋ + 1

Output Width = ⌊(W + 2P - K_w) ÷ S⌋ + 1

Where:

H, W = Input height and width
P = Padding
K_h, K_w = Kernel height and width
S = Stride
⌊ ⌋ represents floor division (round down)

We can cross check the formula for our earlier case as below:

Input: 3×3
Kernel: 2×2
Stride: 1
Padding: 1

Ans: ((3 + 2×1 - 2) ÷ 1) + 1 = 4

So, when you solve the above padding-based convolution, your output should be a 4x4 matrix.

I hope this guide has helped clarify the key concepts about convolutions and their output dimensions. Understanding these fundamentals is crucial for working with convolutional neural networks effectively. If you have any questions about convolution arithmetic or related topics, feel free to leave them in the comments below. Happy coding!