Note 1

1 Introduction to Natural Language Processing

What is natural language

  • A discrete / symbolic / categorical system
  • the word is a signifier that maps to a signified(idea or thing)
  • the symbols of language can be encoded in several modalities: voice, gesture, writing, etc that are transmitted via continuous signals to the brain.


  • Easy:
    • Spell Checking
    • Keyword Search
    • Finding Synonyms
  • Medium
    • Parsing information from websites, documents, etc.
  • Hard
    • Machine Translation
    • Semantic Analysis
    • Co-reference
    • Question Answering

How to represent words?

  • Vectors.
  • To perform well on most NLP tasks, we first need to have some notion of similarity and difference between words.
  • Distance measures: Jaccard, Cosine, Euclidean, etc.

2 Word Vectors

  • Want to encode word tokens each into some vector that represents a point in some sort of “word” space.
  • There perhaps exists some N-dimensional space such that $N \ll Size(vocabulary)$ that is sufficient to encode all semantics of the language.
  • Each dimension would encode some meaning that we transfer using speech.
  • Eg. Semantic -> tense, count, gender.
  • One-hot vector:
    • represent every word as $\mathbb{R}^{ V \cross 1}$ vector with all 0s and one 1.
    • no notion of similarity, since dot product between vectors are all 0

3 SVD Based Methods

  • Loop over corpus and accumulate word co-occurrence counts in form of a matrix $X$
  • Then perform singular value decomposition on $X$ to get a $USV^T$ decomposition.
  • Use rows of $U$ as word embeddings.

3.1 Word-Document Matrix

  1. Assumption: words that are related will often appear in the same documents.
  2. $X$: word-document matrix(also known as document term matrix). $\mathbb{R}^{ V \cross M}$
  3. It scales with the number of documents(M).

3.2 Window based Co-occurrence Matrix

  1. Assumption: words that are related will appear close to each other in text
  2. $X$: word-word co-occurence matrix (count the number of times each word appears inside a window of of a particular size around the word of interest) $\mathbb{R}^{ V \cross V }$ X is an affinity matrix
  3. It is fixed with vocabulary size, but window size is a hyper-parameter

3.3 Applying SVD to the co-occurrence matrix

  • Observing the values on the diagonal entries in $S$ and cut off at some index $k$ based on the desired percentage variance captured.
  • Select first $k$ columns of $U$ to be used as word vectors.



  • Both of these methods give us word vectors that are more than sufficient to encode semantic and syntactic (part of speech) information but are associated with many other problems:
    1. Dimensions of matrix change very often. (new words, more documents)
    2. Matrix is extremely sparse since most words do not co-occur
    3. Matrix is very high dimensional $\approx 10 ^6\cross 10^6$
    4. Quadratic cost to train (SVD)
    5. Requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency. (TFIDF / BM25)
  • Some solutions to above problems:
    1. Ignore function words such as “the”, “he”, “has”, etc
    2. Apply a ramp window - weight co-occurrence count based on distance between the words in the document.
    3. Use Pearson correlation and set negative counts to 0 instead of using just raw count.
  • Iteration based methods solve many of these issues

4 Iteration Based Methods - Word2Vec

  • Instead of computing and storing global information about some huge corpus, we try to create a model that is learned one iteration at a time and eventually be able to encode the probability of a word given its context.
  • Context of a word is the set of $m$ surrounding words.
  • Assumption: distributional similarity - similar words have similar context.
  • Train a model whose parameters are the word vectors against a certain objective.
  • Word2vec is a software package includes:
    1. 2 algorithms: continuous bag-of-words(CBOW) and skip-gram.
      • CBOW aims to predict a center word from the surrounding context
      • Skip-gram predicts the distribution of context words from a center word
    2. 2 training methods: negative sampling and hierarchical softmax
      • Negative sampling defines an objective by sampling negative examples
      • Hierarchical softmax defines an objective using an efficient tree structure to compute probabilities for all the vocabulary.

4.1 Language Models (Unigrams, Bigrams, etc)

  • Create a model that assigns a probability to a sequence of tokens.
  • Assumption: word occurrences are completely independent.
    • Unigram model: $P(w_1, w_2, …, w_n) = \prod_{i=1}^{n}P(w_i)$
  • Assumption: probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it.
    • Bigram model: $P(w_1,w_2,…,w_n) = \prod_{i=2}^{n}P(w_i w_{i-1})$

4.2 Continuous Bag of Words Model(CBOW)

  • Treat {“The”, “cat”, “over”, “the”, “puddle”} as a context and from this words, be able to predict or generate the center word “jumped”.

  • For each word, we try to learn two vectors, one to use when word is context word, another to use when word is center word.

    • $\mathit{V} \in \mathbb{R}^{n \cross V }$
    • $\mathit{U} \in \mathbb{R}^{ V \cross n}$
  • How the model works:

    1. Generate one-hot word vectors for the input context of size $m$: $(x^{c-m},…,x^{c-1},x^{c+1},…,x^{c+m})$
    2. Get embedded word vectors for the context: $(v_{c-m}=Vx^{c-m},…,v_{c+m}=Vx^{c+m})$
    3. Average these vectors to get $\hat{v} = \frac{v_{c-m}+v_{c-m+1}+…+v_{c+m}}{2m}$
    4. Generate a score vector $z=U\cdot\hat{v} \in \mathbb{R}^{ V }$ As the dot product of similar vectors is higher, it will push similar words close to each other(ideally predicted center word and true center word) in order to achieve a high score.
    5. Turn the scores into probabilities $\hat{y} \in \mathbb{R}^{ V }$ with softmax. Where $softmax(v) = \frac{e^{v_i}}{\Sigma_{k=1}^{ V }e^{y_k}} for \qquad i \in V $
    6. We want the predicted probabilities to match the true probabilities, $y\in\mathbb{R}^{ V }$ which is the one-hot encoded center word.
  • Loss function:

    • Cross entropy $H(\hat{y}, y) = - \sum_{j=1}^{ V }y_jlog(\hat{y}_j)$
    • It gives us a good measure of distance.
    \[\begin{align} minimize J &=-logP(w_c|w_{c-m},...,w_{c+m}) \\ &= -logP(u_c|\hat{v}) \\ &= -log\frac{exp(u_c^T\hat{v})}{\Sigma_{j=1}^{|V|}exp(u_j^T\hat{v})} \\ &= -u_c^T\hat{v} + log\Sigma_{j=1}^{|V|}exp(u_j^T\hat{v}) \end{align}\]
  • We optimize $U$ and $V$ via stochastic gradient descent (SGD).

    • $U\leftarrow U-\alpha\nabla _uJ; V\leftarrow V-\alpha\nabla_vJ$


4.3 Skip-Gram Model

  • Given the center word “jump” the model will be asked to predict or generate the surrounding words.

  • Essentially, we are swapping $x$ and $y$ in the model.

  • Steps:

    1. Generate input one hot vector $x\in\mathbb{R}^{ V }$ for the center word.
    2. Look up the embedding vector $v_c=V_x\in\mathbb{R}^n$
    3. Generate a score vector $z=Uv_c\in \mathbb{R}^{ V }$
    4. Turn the score vector into probabilities $\hat{y}$ with softmax
    5. Desire the output with target $y^{c-m},…,y^{c+m}$
    6. We invoke a Naive Bayes assumption(conditional independence) assuming all output words are completely independent.
    \[\begin{align} minimize J &=-logP(w_{c-m}, ..., w_{c+m}|w_c) \\ &= -log\prod_{j=0,j\neq m}^{2m}P(w_{c-m+j}|w_c) \\ &= -log\prod_{j=0,j\neq m}^{2m}P(u_{c-m+j}|v_c) \\ &= -log\prod_{j=0,j\neq m}^{2m}\frac{exp(u_{c-m+j}^Tv_c)}{\sum_{k=1}^{|V|}exp(u_{k}^Tv_c)} \\ &= -\sum_{j=0,j\neq m}^{2m}u_{c-m+k}^Tv_c+2mlog\sum_{k=1}^{|V|}exp(u_k^Tv_c) \end{align}\] \[J=-\sum_{j=0,j\neq m}^{2m}logP(u_{c-m+j}|v_c) = \sum_{j=0,j\neq m}^{2m}H(\hat{y}, y_{c-m+j})\]


4.4 Negative Sampling

  • One problem with the objective function is that it is a summation over $ V $.
  • One solution is to approximate it.

  • For each training step, we sample several negative examples from a noise distribution $P_n(w)$ whose probabilities match the ordering of the frequency of the vocabulary.

  • Considering a pair of word and context $(w, c)$ let:

    • $P(D=1 w,c)$ be the probability that the pair came from the data.
    • $P(D=0 w,c)$ be the probability that the pair did not came from the data.
  • Model $P(D=1 w,c) = \sigma(v_c^Tv_w) = \frac{1}{1+ e^(-v_c^Tv_w)}$
  • MLE approach to maximize these two probabilities:

    \[\begin{align} \theta &= \underset{\theta}{argmax} \prod_{(w,c)\in D} P(D=1|w,c,\theta)\prod_{(w,c)\in \tilde{D}} P(D=0|w,c,\theta) \\ &= \underset{\theta}{argmax} \prod_{(w,c)\in D} P(D=1|w,c,\theta) \prod_{(w,c)\in \tilde{D}}(1-P(D=1|w,c,\theta)) \\ &= \underset{\theta}{argmax} \sum_{(w,c)\in D}logP(D=1|w,c,\theta) + \sum_{(w,c)\in\tilde{D}}log(1-P(D=1|w, c,\theta)) \\ &= \underset{\theta}{argmax} \sum_{(w,c)\in D}log\frac{1}{1+exp(-u_w^Tv_c)} + \sum_{(w,c)\in\tilde{D}}log(1-\frac{1}{1+exp(-u_w^Tv_c)}) \\ &= \underset{\theta}{argmax} \sum_{(w,c)\in D}log\frac{1}{1+exp(-u_w^Tv_c)} + \sum_{(w,c)\in\tilde{D}}log(\frac{1}{1+exp(u_w^Tv_c)}) \\ \end{align}\]
  • Maximizing the likelihood is the same as minimizing the negative log likelihood.
\[J=-\sum_{(w,c)\in D}log\frac{1}{1+exp(-u_w^Tv_c)} - \sum_{(w,c)\in\tilde{D}}log(\frac{1}{1+exp(u_w^Tv_c)})\]
  • $\tilde{D}$ is false or negative corpus, we can generate on the fly by randomly sampling from the word bank.
  • Now with negative sampling the skip-gram model’s objective function becomes:
\[-log\sigma(u_{c-m+j}^T\cdot v_c)-\sum_{k=1}^{K}log\sigma(-\tilde{u}_k^T\cdot v_c)\]
  • For CBOW, the objective function will be:
\[-log\sigma(u_c^T\cdot\hat{v}) - \sum_{k=1}^{K}log\sigma(-\tilde{u}_k^T\cdot \hat{v})\]
  • For $P_n(w)$, what seems to work best is the Unigram Model raised to the power of 3/4.

4.5 Hierarchical Softmax

  • Hierarchical softmax is an efficient alternative to the normal softmax.
  • In practice, hierarchical softmax tends to be better for infrequent words, while negative sampling works better for frequent words and lower dimensional vectors.

hierarchical softmax

  • Hierarchical softmax uses a binary tree to represent all words in the vocabulary. Each leaf is a a word, and there is a unique path from root to leaf.
  • There is no output representation for words. Instead each node of the graph is associated to a vector that the model is going to learn. The probability of a word $w$ given a vector $w_i$, $p(w w_i)$ is equal to the probability of a random walk starting in the root and ending in the lead node corresponding to $w$.
  • Cost to compute the probability is $O(log( V ))$
\[P(w|w_i)=\prod_{j=1}^{L(w)-1}\sigma([n(w, j+1)=ch(n(w,j))]\cdot v_{n(w,j)}^Tv_{w_i}) \\ where \\ [x] = \begin{cases} 1 & \text{ if x is true}\\ -1 & \text{otherwise} \end{cases}\]
  • $n(w, i)$ is the root $n(w, L(w))$ is the parent node of $w$
  • $ch(n)$ indicate the left children of node $n$
  • $[n(w, j+1)=ch(n(w,j))]$ will be 1 if node is left -1 if node is right, thus ensures proper normalization.
  • To train the model, our goal is to minimize the negative log likelihood by updating vectors of the nodes in the binary tree.

Note 2

1 Global Vectors for Word Representation (GloVe)

1.1 Comparison with Previous Methods

  1. Count based
    • Rely on Matrix factorization. (LSA, HAL)
    • Effectively leverage global statistical information
    • Ok with word similarities, but poorly on other tasks -> vector space structure is sub-optimal.
  2. Shallow window-based
    • Skip-gram and CBOW
    • Learn embedding by making predictions in local context windows
    • Capture complex linguistic patterns beyond word similarity
    • Fail to make use of the global co-occurrence statistics
  3. GolVe
    • Using global statistics to predict the probability of word pair co-occurrence

1.2 Co-occurrence Matrix

  • $X_{ij}$ indicates the number of time word $j$ occur in the context of word $i$
  • $X_i=\sum_kX_{ik}$ be the number of times any word $k$ appears in the context of word $i$
  • $P_{ij} = p(w_j w_i) = \frac{X_{ij}}{X_{i}}$ is the probability of $j$ appearing in the context of word $i$

1.3 Least Squares Objective

  • softmax compute the probability of word $j$ appears in the context of word $i$ \(Q_{ij} = \frac{exp(u_j^Tv_i)}{\sum_{w=1}^{W}exp(u_w^Tv_i)}\)

  • Global corss-entropy objective function trained in an on-line, stochastic fashion \(J = -\sum_{i \in \text{corpus}}\sum_{j \in \text{context(i)}}logQ_{ij}\)

  • It is more efficient to group together the same values for $i$ and $j$ \(J=-\sum_{i=1}^{W}\sum_{j=1}^{W}X_{ij}logQ_{ij}\) where the value of co-occurring frequency is given by the co-occurrence matrix X

  • One drawback of the cross-entropy loss is that it requires the distribution $Q$ to be properly normalized, which involves the expensive summation over the entire vocabulary.

  • Instead, we use a least square objective in which the normalization factors in $P$ and $Q$ are discarded: \(\hat{J}=\sum_{i=1}^{W}\sum_{j=1}^{W}X_{ij}(\hat{P_{ij}}-\hat{Q_{ij}})^2\) where $\hat{P_{ij}} = X_{ij}$ and $\hat{Q_{ij}} = exp(u_j^Tv_i)$ are the unnormalized distributions.

  • New problem: $X_{ij}$ often takes on very large values and makes the optimization difficult.

  • Solution: minimize the squared error of the logarithm of $\hat{P}$ and $\hat{Q}$ \(\begin{align} \hat{J} &= \sum_{i=1}^{W} \sum_{j=1}^{W}X_i(log(\hat{P_{ij}}) - log(\hat{Q_{ij}}))^2\\ &=\sum_{i=1}^{W} \sum_{j=1}^{W}X_i(u_j^Tv_i - logX_{ij})^2 \\ &=\sum_{i=1}^{W} \sum_{j=1}^{W}f(X_{ij})(u_j^Tv_i - logX_{ij})^2 \end{align}\) where $f(X_{ij})$ is a more general weighting function than the raw $X_i$

2 Evaluation of Word Vectors

2.1 Intrinsic Evaluation

  • Evaluate word vectors on specific intermediate subtasks. Like analogy completion for question answering. Reason? If evaluated with the final downstream task, the entire pipeline will be retrained from scratch each time we need to tweak the word vector learning.
  • Requirement: the intrinsic evaluation has a positive correlation with the final task performance.

2.2 Extrinsic Evaluation

  • Evaluate word vectors on the real task.
  • Typically, optimizing over an under performing extrinsic evaluation system does not allow us to determine which specific subsystem is at fault and this motivates the need for intrinsic evaluation.

2.3 Intrinsic Evaluation Example: Word Vector Analogies

  • $a:b::c:?$

  • find $d$ such that it maximizes the cosine similarity \(d=\underset{i}{argmax}\frac{(x_b-x_a+x_c)^Tx_i}{||x_b-x_a+x_c||}\)

  • semantci: capital city -> country

  • syntactic: adj -> it’s superlative form

2.4 Analogy Evaluations


  • Performance is heavily dependent on the model used for word embedding.
  • Performance increases with larger corpus sizes
  • Performance is lower for extremely low dimensional word vectors
  • Extremely high dimensional vectors with CBOW/SG is reasonably robust to noise


2.5 Correlation Evaluation

  • Another simple way to evaluate the quality of word vectors is by asking humans to assess the similarity between two words on a fixed scale and then comparing this with the cosine similairity between the corresponding word vectors.


2.6 Dealing With Ambiguity

  • Improving word representations via global context and multiple word prototypes. Huang 2012

3. Training for Extrinsic Tasks

3.1 Problem Formulation

  • Most NLP extrinsic tasks can be formulated as classification tasks.
  • Sentiment analysis: given a text, classify it to be negative or positive.
  • Named-entity recognition(NER): given a context and a center word, classify the center word to be one of many classes.

3.2 Retraining Word Vectors

  • If we retrain word vectors using the extrinsic task, we need to ensure that the training set is large enough to cover most words from the vocabulary.
  • When retrain word vectors over a small set of the vocabulary, these words are shifted in the word space and as a result, the performance over the final task could actually reduce.

3.3 Softmax Classification and Regularization

  • Softmax classification: \(p(y_j=1|x) = \frac{exp(W_j\cdot x)}{\sum_{c=1}^{C}exp(W_c\cdot x)}\)

  • Using the cross-entropy loss: \(-\sum_{j=1}^{C}y_jlog(p(y_j=1|x)) = -\sum_{j=1}^{C}y_j log(\frac{exp(W_j\cdot x)}{\sum_{c=1}^{C}exp(W_c\cdot x)})\)

  • Since there is only one correct answer, let $k$ be the correct one. The loss function simplify to: \(-log(\frac{exp(W_k\cdot x)}{\sum_{c=1}^{C}exp(W_c\cdot x)})\)

  • Extend to a dataset of N points: \(-\sum_{i=1}^{N}log(\frac{exp(W_{k(i)}\cdot x^{(i)})}{\sum_{c=1}^{C}exp(W_c\cdot x^{(i)})})\)

  • We have $C\cdot d+|V|\cdot d$ parameters to update each time. To avoid, over fitting, we can pose the Bayesian belief that parameters $(\theta)$ should be small in magnitude: \(-\sum_{i=1}^{N}log(\frac{exp(W_{k(i)}\cdot x^{(i)})}{\sum_{c=1}^{C}exp(W_c\cdot x^{(i)})}) + \lambda \sum_{k=1}^{C\cdot d + |V|\cdot d}{\theta_k}^2\)

3.4 Window Classification

  • Instead of using just the word of interest, use a sequence that is a center word preceded and suceeded by context words.
  • Generally, narrower window sizes lead to better performance in syntactic tests while wider windows lead to better performance in semantic tests.

3.5 Non-linear Classifiers

  • Non-linear decision boundaries to solve not linear separable classification problem.

Note 3

1 Dependency Grammar and Dependency Structure

  • Parse trees in NLP are used to analyze the syntactic structure of sentences.
    1. Constituency structures - phrase structure grammar to organize words into nested constituents.
    2. Dependency structures - shows which words depend on (modify or are argument of) which other words.
  • Dependency tree for the sentence: “Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas.”

dependency tree

1.1 Dependency Parsing

  • Task of analyzing the syntactic dependency structure of a given input sentence S.
  • Output is a dependency tree/graph $G$ where the words are connected by typed dependency relations.
    1. Learning: Given a training set $D$ of sentences annotated with dependency graphs, induce a parsing model M that can be used to parse new sentences.
    2. Parsing: Given a parsing model M and a sentence S, derive the optimal dependency graph D for S according to M.

1.2 Transition-Based Dependency Parsing

  • State machine which defines the possible transitions to create the mapping from the input sentence to the dependency tree.
  • Learning: induce a model to predict the next transition
  • Parsing: construct the optimal sequence of transitions
  • Most transition-based systems do not make use of a formal grammar.

1.3 Greedy Deterministic Transition-Based Parsing

  • Nivra 2003, showed to be better in both performance and efficiency than traditional feature-based discriminative dependency parsers.

  • States:
    • For any sentence $S=w_0w_1…w_n$ a state can be described with a triple $c=(\sigma, \beta,A)$
      1. a stack $\sigma$ of words $w_i$ from $S$
      2. a buffer $\beta$ of words $w_i$ from $S$
      3. a set of dependency arcs $A$ of the form $(w_i, r, w_j)$, where $w_i, w_j$ are from $S$, and $r$ describes a dependency relation.
    • It follows that for any sentence $S=w_0w_1…w_n$,
      1. an initial state $c_0$ is of the form $( w_0 {\sigma}, [w_1,…,w_n]{\beta}, \emptyset)$
      2. a terminal state has the form $({\sigma}, []{\beta}, A)$
  • Transitions
    • SHIFT: remove the first word in the buffer and push it on top of the stack
    • LEFT-ARC: add a dependency arc $(w_j, r, w_i)$ to the arc set $A$, where $w_i$ is the word second to the top of the stack and $w_j$ is the word at the top of the stack. Remove $w_i$ from the stack
    • RIGHT-ARC: dd a dependency arc $(w_i, r, w_j)$ to the arc set A, where $w_i$ is the word second to the top of the stack and $w_j$ is the word at the top of the stack. Remove $w_j$ from the stack


1.4 Neural Dependency Parsing

  • Feature Selection:
    • $S_{word}$ vector representations for some of the words in $S$ (and their dependents) at the top of the stack $\sigma$ and buffer $\beta$
    • $S_{tag}$: Part-of-Speech (POS) tags for some of the words in $S$. POS tags comprise a small, discrete set: $P={NN,NNP,NNS,DT,JJ,…}$
    • $S_{label}$: The arc-labels for some of the words in $S$. The arc-labels comprise a small discrete set, describing the dependency relation: $L={amod, tmod,nsubj,csubj,dobj,…}$
  • For each feature type, we will have a corresponding embedding matrix, mapping from the feature’s one hot encoding , to a $d-$dimensional dense vector representation.
  • For example:
    • $S_{word}$: top 3 words on stack and buffer, the first and second leftmost / right most children of the top two words on the stack. the leftmost of leftmost / rightmost of rightmost children of the top two words on the stack. 18 elements in total
    • $S_{tag}$: 18 corresponding tags
    • $S_{label}$: 12 in total.
  • Model:

Note 4

1 Language Models

1.1 Introduction

  • Language models compute the probability of occurrence of a number of words in a particular sequence.

  • The probability of a sequence of $m$ words ${w_1, …, w_m}$ is $P(w_1,…,w_m)$

  • It is usually conditioned on a window of $n$ previous words rather than all previous words: \(P(w_1, ..., w_m) = \prod_{i=1}^{m}P(w_i|w_1,...,w_{i-1}) \approx \prod_{i=1}^{m}P(w_i|w_{i-n},...,w_{i-1})\)

1.2 n-gram Language Models

  • One way to compute the above probabilities is n-gram.

  • Bigram: \(p(w_w|w_1) = \frac{count(w_1, w_2)}{count(w_1)}\)

  • Trigram: \(p(w_3|w_1,w_2) = \frac{count(w_1, w_2, w_3)}{count(w_1, w_2)}\)

  • Main issues with n-gram Language Models:

    1. Sparsity problems:
      • If $w_1, w_2$ never appear together, the numerator for trigram is 0. Add smoothing a small $\delta$.
      • If $w_1, w_2$ never appear together, the denominator for trigram is 0. Can condition on $w_2$ alone, called backoff.
      • Increasing $n$ makes sparsity problems worse. Typically, $n\leq 5$
    2. Storage problems with n-gram Language Models:
      • Store counts for all n-grams is large and it grow fast as corpus size grows.

1.3 Window-based Neural Language Model

  • Bengio et al Neural Probabilistic Language Model, first large-scale deep learning for natural language processing model.
  • It learns a distributed representation of words, along with the probability function for word sequences expressed in terms of these representations.

bengio neural language model

  • Consider window-based model as Language Model conditioned on fixed size history of information.

2 Recurrent Neural Networks (RNN)

  • RNN can conditioning the Language Model on all previous words.

\(h_t = \sigma(W^{(hh)}h_{t-1}+W^{(hx)}x_{[t]}) \\ \hat{y}_t = softmax(W^{(S)}h_t)\)

  • $W^{(hh)}$ and $W^{(hx)}$ are shared across time, applied repeatedly at each step.
    • Thus model learn fewer parameters
    • And is independent of the length of the input sequence
    • Defeating the curse of dimensionality

2.1 RNN Loss and Perplexity

  • RNN trained with cross entropy loss
  • At a given time $t$ the loss is
\[J^{(t)}(\theta) = -\sum_{j=1}^{|V|}y_{t,j}\cross log(\hat{y}_{t,j})\]
  • Aggregate over the corpus \(J = \frac{1}{T}\sum_{t=1}^{T}J^{(t)}(\theta) = -\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{|V|}y_{t.j}\cross log(\hat{y}_{t, j})The\)

  • The perplexity is defined as: $2^{J}$

    Lower value imply more confident in predicting the next word in sequence

2.2 Advantages, Disadvantages and Applications of RNNs

  • Advantages:
    1. process input sequences of any length
    2. model size independent of sequence lengths
    3. computation for step $t$ can (in theory) use information from many steps back
  • Disadvantages:
    1. computation is slow - because it is sequential, can’t be parallelized
    2. in practice, it is difficult to access information from many steps back due to vanishing and exploding gradients.
  • RNNs can be used for many tasks, such as tagging (POS, NER), sentence classification(category, sentiment), and as encoder or decoder modules.

2.3 Vanishing Gradient & Gradient Explosion Problems

\[\begin{align} \frac{\partial E}{\partial W} &= \sum_{t=1}^{T}\frac{\partial E_t}{\partial W} \\ \frac{\partial E_t}{\partial W} &= \sum_{k=1}^{t}\frac{\partial E_t}{\partial y_t}\frac{\partial y_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W} \\ \frac{\partial h_t}{\partial h_k} &= \prod_{j=k+1}^{t}\frac{\partial h_{j}}{\partial h_{j-1}}=\prod_{j=k+1}^{t}W^T\cross diag[f'(j_j-1)] \\ \frac{\partial E}{\partial W} &= \sum_{t=1}^{T}\sum_{k=1}^{t}\frac{\partial E_t}{\partial y_t}\frac{\partial y_t}{\partial h_t}(\prod_{j=k+1}^{t}\frac{\partial h_j}{\partial h_{j-1}})\frac{\partial h_k}{\partial W} \end{align}\]

$\partial h_{j} / \partial h_{j-1}$ is the Jacobian matrix for h. \(\|\frac{\partial h_j}{\partial h_{j-1}}\| \leq \|W^T\| \|diag[f'(h_{j-1})] \| \leq \beta_W \beta_h\) Where $\beta_W$ and $\beta_h$ are the upper bound values for the norms of the two matrices. $f’(h_{j-1})$ can only be as large as 1, given sigmoid activation. \(\|\frac{\partial h_t}{\partial h_k}\| = \| \prod_{\partial h_{j-1}}^{\partial h_{j}} \| \leq (\beta_W\beta_h)^{t-k}\)

  • if $\beta_W\beta_h$ get smaller or larger than 1, with $t-k$ sufficiently large, the gradients blows or diminishes very quickly.
  • Gradient Explosion, i.e. gradient overflow the floating number limits becomes NaN, can be detected.
  • Gradient Vanish, i.e. gradient becomes 0, can be undetected. It will be hard to tell whether it is that:
    • No relationship last that long, or
    • We can’t capture this due to gradient vanishing.

2.4 Solution to the Exploding & Vanishing Gradients

  1. For exploding gradient: gradient clipping \(\hat{g} \leftarrow \frac{\partial E}{\partial W} \\ if \quad \|\hat{g}\| \geq threshold \quad then \\ \hat{g} \leftarrow \frac{threshold}{\|\hat{g}\|}\hat{g}\)

  2. For vanishing gradient:

    1. identity matrix initialization
    2. use ReLU activations

2.5 Deep Bidirectional RNNs

  1. Bidirectional \(\begin{align} \overrightarrow{h_t} &= f(\overrightarrow{w_{x_t}} + \overrightarrow{V} \, \, \, \overrightarrow{h}_{t-1} + \overrightarrow{b}) \\ \overleftarrow{h_t} &= f(\overleftarrow{w_{x_t}} + \overleftarrow{V} \, \, \, \overleftarrow{h}_{t-1} + \overleftarrow{b}) \\ \hat{y}_t &= g(Uh_t + c) = g(U[\overrightarrow{h}_t; \overleftarrow(h)_t] + c) \end{align}\)

  2. Stacked RNN \(\overrightarrow{h}^{(i)}_t = f(\overrightarrow{W}^{(i)} h^{(t-1)} + \overrightarrow{V}^{(i)} \overrightarrow{h}^{(i)}_{t-1} + \overrightarrow{b}^{(i)})\)

  3. The two can be combined

stack bidirectional rnn

2.6 Application: RNN Translation Model

  • Encoder-Decoder RNN for translation

  1. Encoder RNN process the input sequence and produce a final hidden state, that is feed into decoder RNN as the initial hidden state.
  2. The decoder RNN gets a <BOS> token as initial input and treat each of the previous prediction as input to the current step.
  • Extensions:

    • Use dedicated Encoder and Decoder

    • Use Bidirectional RNN

    • Use Stacked RNN

    • Compute every hidden state in the decoder using three inputs

      1. Previous hidden

      2. Previous step prediction

      3. Last hidden from encoder

    • Sometimes worth trying instead of $ABC\rightarrow XY$, do $CBA\rightarrow XY$

3. Gated Recurrent Units

  • Although RNNs can theoretically capture long-term dependencies, they are very hard to actually train to do this.

  • Gated recurrent unites are designed in a manner to have more persistent memory thereby making it easier for RNNs to capture long-term dependencies. \(\begin{align} \text{update gate} \qquad z_t &= \sigma(W^{(z)}x_t + U^{(z)}h_{t-1}) \\ \text{reset gate} \qquad r_t &= \sigma(W^{(r)}x_t + U^{(r)}h_{t-1}) \\ \text{new memory} \qquad \tilde{h}_t &= tanh(r_t\circ Uh_{t-1} + Wx_t) \\ \text{hidden state} \qquad h_t &= (1-z_t)\circ\tilde{h}_t+z_t\circ h_{t-1} \\ \end{align}\)


4. Long-Short-Term-Memories

  • The motivation for LSTM is similar to GRU, the architecture of unit is a bit different.

  • LSTM has a separate cell state that is kept hidden. Hidden state is just a exposure of that cell state. \(\begin{align} \text{input gate} \qquad i_t &= \sigma(W^{(i)}x_t + U^{(i)}h_{t-1}) \\ \text{forget gate} \qquad f_t &= \sigma(W^{(f)}x_t + U^{(f)}h_{t-1}) \\ \text{output gate} \qquad o_t &= \sigma(W^{(o)}x_t + U^{(o)}h_{t-1}) \\ \text{new memeory cell} \qquad \tilde{c}_t &= tanh(W^{(c)}x_t + U^{(c)}h_{t-1}) \\ \text{final state cell} \qquad c_t &= f_t \circ c_{t-1} + i_t\circ \tilde{c}_t \\ h_t &= o_t \circ tanh(c_t) \\ \end{align}\)


Note 5

1 Neural Machine Translation with Seq2Seq

  • Type of Seq2Seq usage:
    1. Translation
    2. Conversation
    3. Summarization

1.1 Brief Note on Historical Approaches

  • Before neural network approach, translation systems were based on probabilistic models constructed from:
    1. a translation model, telling us what a sentence / phrase in a source language most likely translates into
    2. a language model, telling us how likely a given sentence / phrase is overall
  • Phrase-based systems were most common prior to Seq2Seq. It is difficult to handle long-term dependencies with phrase-based system.

1.2 Sequence-to-sequence Basics

  • Appeared in 2014 for English-French translation

  • It is an end-to-end model made up of two RNNs:

    • an encoder, which takes the model’s input sequence and encodes it into a fixed-size “context vector”
    • an decoder, which uses the context vector from encoder as a “seed” from which to generate an output sequence.

1.3 Seq2Seq architecture - encoder

  • Use a encoder RNN to process the input token sequence and the final hidden state will become the context vector.
  • Usually the Seq2Seq encoders will process the input sequence in reverse. It is done under the reasoning that the last thing the encoder sees will (roughly) corresponds to the first thing that the model ouputs. this makes it easier for the decoder to “get started”

rnn encoder

1.4 Seq2Seq architecture - decoder

  • Use a decoder RNN to generate output sequence based of the words it generated so far, as well as the context.
    • The decoding process is kicked off with a special token <EOS> (if use the same RNN for both encoding and decoding) or <BOS> (If use dedicated networks)

1.5 Bidirectional RNNs

  • Use bidirectional RNNs to traversing a sequence in both directions and concatenating the resulting final hidden states and cell states.

2 Attention Mechanism

2.1 Motivation

  • Bahdanau et al. 2016 proposing attention to address the issue with a single “context vector” for sequence-to-sequence models.
    1. Different parts of an input have different levels of significance
    2. Different parts of the output may consider different parts of the input “important”
  • Attention mechanisms make use of this observation by providing the decoder network with a chance to look at the entire input sequence at every decoding step. The decoder can then decide what input words are important at any point in time.

2.2 Bahdanau et al. NMT model

  1. Encoder: Use RNN to process input sequence and collect $(h_1,…,h_n)$

  2. Decoder: $s_i=f(s_{i-1}, y_{i-1}, c_i)$ Where $s_{i-1}$ is the previous hidden, $y_{i-1}$ is the previous prediction, $c_i$ is the context from the input sequence that is relevant for the current decoding job. It is calculated as follow \(\begin{align} e_{i,j} &= a(s_{i-1}, h_j) \\ \alpha_{i,j} &= \frac{exp(e_{i,j})}{\sum_{k=1}^{n}exp(e_{i,k})} \\ c_i &=\sum_{j=1}^{n} \alpha_{i,j}h_j \\ \end{align}\) That is:

    1. We use the hidden state $s_{i-1}$ to query the encoder hidden states, and create a score for each.
    2. Use softmax to convert these scores into a proper probability
    3. Finally compute the context vector as weighted average of the encoder hidden states.
  • Ideally the softmax will put higher value on important stuff.

2.3 Connection with translation alignment

  • The attention-based model learns to assign significance to different parts of the input for each step of the ouput.
  • In the context of translation, attention can be thought of as “alignment”
  • Bahdanau et al. argue that the attention scores $\alpha_{i,j}$ at decoding step $i$ signify the words in the input sequence that align with the word $i$ in the output sequence.

2.4 Performance on long sentence

  • As the sequence length grows, models do not use attention will miss information.

2.5 Global / Local Attention

  • Many ways to do attention. Luong et al’s attention are using current hidden state to do attention and modify what to output at the current step. Sort of post activation attention, compare to Bahdanau’s pre activation attention.
  • And they added some control after attention about how much around the aligned position should the information be aggregated as context.

global attention

local attention

3. Google’s multilingua NMT

  1. A Seq2Seq model that accepts as input a sequence of words and a token specifying what language to translate into.
  2. The model uses shared parameters to translate into any target language.
  3. It enabled “zero-short translation”, in which we can translate between two languages for which we have no translation training data. They can actually generate reasonable Japanese-Korean translations where they only had Japanese-English and Korean-English in training data.
  4. This is showing: the decoding process is not language-specific, the model is in fact maintaining an internal representation of the input / output sentences independent of the actual languages involved.

4 Sequence model decoders

  • Consider a model that computes the probability $P(\bar{s}|s)$ of a translation $\bar{s}$ given the original sequence $s$. We want to pick the translation $\bar{s}*$ that has the best probability. \(\bar{s}*=argmax_{\bar{s}}(P(\bar{s}|s))\)

  • As the search space can be huge$ V ^{len(s)}$, we need to shrink its size:
    1. Exhaustive search: compute the probability of every possible sequence, and chose the one with the highest probability. Too complex, NP_complete.

    2. Ancestral sampling: at time step $t$, we sample $x_t$ based on the conditional probability of the word at step $t$ given the past. \(x_t \sim P(x_t|x_1, \dots, x_n)\) High variance

    3. Greedy search: at time step $t$, we pick the most probable token. \(x_t=argmax_{\tilde{x_t}}P(\tilde{x_t}|x1,\dots,x_n)\) Efficient, but if we make a mistake at one time step, the rest of the sentence could be heavily impacted.

    4. Beam search: maintain $K$ candidates at each time step. \(H_t =\{(x_1^1,\dots,x_t^1),\dots,(x_1^K,\dots,x_t^K)\}\)

      and compute $H_{t+1}$ by expanding $H_{t}$ and keeping the best $K$ candidates. In other words, we pick the best $K$ sequence in the following set \(\tilde{H}_{t+1} = \bigcup_{k=1}^{K}H^{\tilde{k}}_{t+1}\) where \(H^{\tilde{K}}_{t+1} = \{(x^k_1,\dots ,x^{k}_t, v_1),\dots,(x^k_1,\dots ,x^{k}_t, v_{|V|})\}\) As we increase $K$, we gain precision and we are asymptotically exact. Beam search is the most commonly used technique in NMT.

5 Evaluation of Machine Translation Systems