LSTM

At each time-step \(t\), the LSTM sees:

\(x_{t}\) the input token
\(C_{t-1}\) the cell state. Intuitively, this is some internal memory that the LSTM carries from timestep to timestep. Some timesteps will affect the cell a lot, and others won't. The ones that don't will not kill the gradient as much.
\(h_{t-1}\) the hidden state. Intuitively, this is the output of the LSTM right before it's projected into the vocabulary dimension.

1 Updating the cell

First, we need to determine what part of the old cell state we want to keep. We compute a forget vector: \[f_t = \sigma(W_f \cdot [h_{t-1}, x_{t}] + b_f)\] where \(W_f\) is a 2-D weight matrix and \(b_f\) is a vector of bias weights. Then, \(f_t\) is a vector of weights between 0 and 1.

Then, we need to determine what updates we are adding to the new cell state. Let \(\tilde{C}_t\) be the new cell state and \(i_t\) be the coefficient that determines how much of the new state get's added: \[ \tilde{C}_t = \tanh(W_C[h_{t-1},x_t] + b_C) \] and \[ i_t = \sigma(W_i[h_{t-1},x_t] + b_i) \]

Then, we have our updated cell state! \[ C_t = f_t * C_t + i_t * \tilde{C}_t \]

2 Creating output

First, we need to decide which pieces of the cell state to output. So, we create an output vector: \[ o_t = \sigma(W_o[h_{t-1},x_{t}] + b_o) \]

Then, the output is: \[ h_t = o_t * \tanh(C_t) \]

Note that all the matrices \(W_o\), \(W_f\), and \(W_i\) only deal with the previous output and the current input.

3 helpful links

Understanding LSTMs