UP | HOME

Convolutional Networks

The input to a convolutional network is an image of size \(W_i \times H_i \times C_i\), where:

The output is \(W_o \times H_o \times C_o\) where:

The convolutional weights have dimension \(W_c \times H_c \times C_i \times C_o\).

The output is computed by sliding along the input images and considering a \(W_c \times H_c\) at a time. The amount that we slide the window per step is called the stride.

Say that \(W^c\) are the convolutional weights and \(X\) is the input. Let's say that we are looking at a certain \(W_c \times H_c\) window of the input. Here is how the output for the \(k\) -th convolutional filter is computed:

1 What does it mean to take a \(1 \times 1\) convolution?

Usually this is done to reduce the depth of the input. The input may be \(W_i \times H_i \times C_i\) and the output may be \(W_i \times H_i \times C_o\), where each of the \(C_o\) filters is a linear combination of the \(C_i\) channels. Then, for a given cell in the output, each channel of the output is a weighted sum + bias of the input channels.

2 Useful links