Convolutional Networks

The input to a convolutional network is an image of size \(W_i \times H_i \times C_i\), where:

\(W_i\) is the input width
\(H_i\) is the input height
\(C_i\) is the input channel dimension, e.g. 3 channels for red, green, and blue

The output is \(W_o \times H_o \times C_o\) where:

\(C_o\) is the number of convolutional filters

The convolutional weights have dimension \(W_c \times H_c \times C_i \times C_o\).

The output is computed by sliding along the input images and considering a \(W_c \times H_c\) at a time. The amount that we slide the window per step is called the stride.

Say that \(W^c\) are the convolutional weights and \(X\) is the input. Let's say that we are looking at a certain \(W_c \times H_c\) window of the input. Here is how the output for the \(k\) -th convolutional filter is computed:

For the \(i\) -th channel in the input, look at the \(W_c \times H_c\) window in the \(i\) -th channel, e.g. \(X[0:w,0:h,i]\). Then, take the dot product with \(W^c[0:w,0:h,i,k]\).
Sum these products for all input channels \(i\) in the input.
Add a bias term
This gets us the \(k\) -th channel of the output for that specific \(W_c \times H_c\) input window

1 What does it mean to take a \(1 \times 1\) convolution?

Usually this is done to reduce the depth of the input. The input may be \(W_i \times H_i \times C_i\) and the output may be \(W_i \times H_i \times C_o\), where each of the \(C_o\) filters is a linear combination of the \(C_i\) channels. Then, for a given cell in the output, each channel of the output is a weighted sum + bias of the input channels.

2 Useful links

Has a good real life example