Distill

Attention and Augmented Recurrent Neural Networks

Methods to augment RNNs:

  1. Neural Turing Machines - use external memory that RNN can read and write to.
  2. Attentional Interfaces - allow RNNs to focus on parts of their input.
  3. Adaptive Computation Time - allows for varying amounts of computation per step.
  4. Neural Programmers - Can call functions, building programs as they run.

They all rely on attention to work.

1. Neural Turing Machines

Extend RNN with an external memory bank. How to use the memory bank is learned. Every step RNN read and write everywhere with the memory, just to different extents. Attention is used to derive the amount it reads and writes to different locations in the memory.

The memory bank is a matrix.

For a given query vector, loop over rows in the memory bank and calculate dot product between query vector and memory vector. Use a softmax to convert to a distribution. May go through other operations.

  • Read is attention pooling: \(r \rightarrow \Sigma a_iM_i\) i stands for row index.
  • Write is replace old state with attention weighted old state and update: \(M_i \rightarrow a_iw + (1-a_i)Mi\)

2. Attentional Interface

Attention allows neural network to focusing on part of a subset of the information they’re given. For example, an RNN can attend over the output of another(hidden states) of another RNN. At every time step, it focuses on different positions in the other RNN. The attention is focus everywhere, just to different extends.

The attending RNN generates a query describing what it wants to focus on. Each item is dot-producted with the query to produce a score, describing how well it matches the query. The scores are fed into a softmax to create the attention distribution.

More broadly, attentional interfaces can be used whenever one wants to interface with a neural network that has a repeating structure in its output.

3. Adaptive Computation Time

Each time step RNN do a fixed number of steps of computation. Idea here is to allow RNN to determine the computation steps needed each time step. Via a attentional distribution over the steps.

4. Neural Programmer

Some tasks are hard for NN but easy for normal approaches to computing. Idea here is to have a way to fuse neural nets with normal programming.

Neural programmer: a NN that learns to create programs in order to solve a task.

Big Picture

It seems like a lot of interesting forms of intelligence are an interaction between the creative heuristic intuition of humans and some more crisp and careful media, like language or equations. Sometimes, the medium is something that physically exists, and stores information for us, prevents us from making mistakes, or does computational heavy lifting(neural programmer calls a tool?). In other cases, the medium is a model in our head that we manipulate(like NTM uses a memory?).

Recent results in machine learning have started to have this flavor, combing the intuition of neural networks with something else. “heuristic search”, AlphaGo has a model how go works and explores how the game could play out guided by neural network intuition.

Attention gives us an easier way by partially taking all actions to varying extents. This works because we can design media - like the NTM memory - to allow fractional actions and to be differentiable.

If we could really make sub-linear time attention work, that would be very powerful!

How to Use t-SNE Effectively