Natural Language Processing
Note 1
1 Introduction to Natural Language Processing
What is natural language
 A discrete / symbolic / categorical system
 the word is a signifier that maps to a signified(idea or thing)
 the symbols of language can be encoded in several modalities: voice, gesture, writing, etc that are transmitted via continuous signals to the brain.
Tasks
 Easy:
 Spell Checking
 Keyword Search
 Finding Synonyms
 Medium
 Parsing information from websites, documents, etc.
 Hard
 Machine Translation
 Semantic Analysis
 Coreference
 Question Answering
How to represent words?
 Vectors.
 To perform well on most NLP tasks, we first need to have some notion of similarity and difference between words.
 Distance measures: Jaccard, Cosine, Euclidean, etc.
2 Word Vectors
 Want to encode word tokens each into some vector that represents a point in some sort of “word” space.
 There perhaps exists some Ndimensional space such that $N \ll Size(vocabulary)$ that is sufficient to encode all semantics of the language.
 Each dimension would encode some meaning that we transfer using speech.
 Eg. Semantic > tense, count, gender.
 Onehot vector:

represent every word as $\mathbb{R}^{ V \cross 1}$ vector with all 0s and one 1.  no notion of similarity, since dot product between vectors are all 0

3 SVD Based Methods
 Loop over corpus and accumulate word cooccurrence counts in form of a matrix $X$
 Then perform singular value decomposition on $X$ to get a $USV^T$ decomposition.
 Use rows of $U$ as word embeddings.
3.1 WordDocument Matrix
 Assumption: words that are related will often appear in the same documents.

$X$: worddocument matrix(also known as document term matrix). $\mathbb{R}^{ V \cross M}$  It scales with the number of documents(M).
3.2 Window based Cooccurrence Matrix
 Assumption: words that are related will appear close to each other in text

$X$: wordword cooccurence matrix (count the number of times each word appears inside a window of of a particular size around the word of interest) $\mathbb{R}^{ V \cross V }$ X is an affinity matrix  It is fixed with vocabulary size, but window size is a hyperparameter
3.3 Applying SVD to the cooccurrence matrix
 Observing the values on the diagonal entries in $S$ and cut off at some index $k$ based on the desired percentage variance captured.
 Select first $k$ columns of $U$ to be used as word vectors.
 Both of these methods give us word vectors that are more than sufficient to encode semantic and syntactic (part of speech) information but are associated with many other problems:
 Dimensions of matrix change very often. (new words, more documents)
 Matrix is extremely sparse since most words do not cooccur
 Matrix is very high dimensional $\approx 10 ^6\cross 10^6$
 Quadratic cost to train (SVD)
 Requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency. (TFIDF / BM25)
 Some solutions to above problems:
 Ignore function words such as “the”, “he”, “has”, etc
 Apply a ramp window  weight cooccurrence count based on distance between the words in the document.
 Use Pearson correlation and set negative counts to 0 instead of using just raw count.
 Iteration based methods solve many of these issues
4 Iteration Based Methods  Word2Vec
 Instead of computing and storing global information about some huge corpus, we try to create a model that is learned one iteration at a time and eventually be able to encode the probability of a word given its context.
 Context of a word is the set of $m$ surrounding words.
 Assumption: distributional similarity  similar words have similar context.
 Train a model whose parameters are the word vectors against a certain objective.
 Word2vec is a software package includes:
 2 algorithms: continuous bagofwords(CBOW) and skipgram.
 CBOW aims to predict a center word from the surrounding context
 Skipgram predicts the distribution of context words from a center word
 2 training methods: negative sampling and hierarchical softmax
 Negative sampling defines an objective by sampling negative examples
 Hierarchical softmax defines an objective using an efficient tree structure to compute probabilities for all the vocabulary.
 2 algorithms: continuous bagofwords(CBOW) and skipgram.
4.1 Language Models (Unigrams, Bigrams, etc)
 Create a model that assigns a probability to a sequence of tokens.
 Assumption: word occurrences are completely independent.
 Unigram model: $P(w_1, w_2, …, w_n) = \prod_{i=1}^{n}P(w_i)$
 Assumption: probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it.

Bigram model: $P(w_1,w_2,…,w_n) = \prod_{i=2}^{n}P(w_i w_{i1})$

4.2 Continuous Bag of Words Model(CBOW)

Treat {“The”, “cat”, “over”, “the”, “puddle”} as a context and from this words, be able to predict or generate the center word “jumped”.

For each word, we try to learn two vectors, one to use when word is context word, another to use when word is center word.

$\mathit{V} \in \mathbb{R}^{n \cross V }$ 
$\mathit{U} \in \mathbb{R}^{ V \cross n}$


How the model works:
 Generate onehot word vectors for the input context of size $m$: $(x^{cm},…,x^{c1},x^{c+1},…,x^{c+m})$
 Get embedded word vectors for the context: $(v_{cm}=Vx^{cm},…,v_{c+m}=Vx^{c+m})$
 Average these vectors to get $\hat{v} = \frac{v_{cm}+v_{cm+1}+…+v_{c+m}}{2m}$

Generate a score vector $z=U\cdot\hat{v} \in \mathbb{R}^{ V }$ As the dot product of similar vectors is higher, it will push similar words close to each other(ideally predicted center word and true center word) in order to achieve a high score. 
Turn the scores into probabilities $\hat{y} \in \mathbb{R}^{ V }$ with softmax. Where $softmax(v) = \frac{e^{v_i}}{\Sigma_{k=1}^{ V }e^{y_k}} for \qquad i \in V $ 
We want the predicted probabilities to match the true probabilities, $y\in\mathbb{R}^{ V }$ which is the onehot encoded center word.

Loss function:

Cross entropy $H(\hat{y}, y) =  \sum_{j=1}^{ V }y_jlog(\hat{y}_j)$  It gives us a good measure of distance.


We optimize $U$ and $V$ via stochastic gradient descent (SGD).
 $U\leftarrow U\alpha\nabla _uJ; V\leftarrow V\alpha\nabla_vJ$
4.3 SkipGram Model

Given the center word “jump” the model will be asked to predict or generate the surrounding words.

Essentially, we are swapping $x$ and $y$ in the model.

Steps:

Generate input one hot vector $x\in\mathbb{R}^{ V }$ for the center word.  Look up the embedding vector $v_c=V_x\in\mathbb{R}^n$

Generate a score vector $z=Uv_c\in \mathbb{R}^{ V }$  Turn the score vector into probabilities $\hat{y}$ with softmax
 Desire the output with target $y^{cm},…,y^{c+m}$
 We invoke a Naive Bayes assumption(conditional independence) assuming all output words are completely independent.

4.4 Negative Sampling

One problem with the objective function is that it is a summation over $ V $. 
One solution is to approximate it.

For each training step, we sample several negative examples from a noise distribution $P_n(w)$ whose probabilities match the ordering of the frequency of the vocabulary.

Considering a pair of word and context $(w, c)$ let:

$P(D=1 w,c)$ be the probability that the pair came from the data. 
$P(D=0 w,c)$ be the probability that the pair did not came from the data.


Model $P(D=1 w,c) = \sigma(v_c^Tv_w) = \frac{1}{1+ e^(v_c^Tv_w)}$ 
MLE approach to maximize these two probabilities:
\[\begin{align} \theta &= \underset{\theta}{argmax} \prod_{(w,c)\in D} P(D=1w,c,\theta)\prod_{(w,c)\in \tilde{D}} P(D=0w,c,\theta) \\ &= \underset{\theta}{argmax} \prod_{(w,c)\in D} P(D=1w,c,\theta) \prod_{(w,c)\in \tilde{D}}(1P(D=1w,c,\theta)) \\ &= \underset{\theta}{argmax} \sum_{(w,c)\in D}logP(D=1w,c,\theta) + \sum_{(w,c)\in\tilde{D}}log(1P(D=1w, c,\theta)) \\ &= \underset{\theta}{argmax} \sum_{(w,c)\in D}log\frac{1}{1+exp(u_w^Tv_c)} + \sum_{(w,c)\in\tilde{D}}log(1\frac{1}{1+exp(u_w^Tv_c)}) \\ &= \underset{\theta}{argmax} \sum_{(w,c)\in D}log\frac{1}{1+exp(u_w^Tv_c)} + \sum_{(w,c)\in\tilde{D}}log(\frac{1}{1+exp(u_w^Tv_c)}) \\ \end{align}\]  Maximizing the likelihood is the same as minimizing the negative log likelihood.
 $\tilde{D}$ is false or negative corpus, we can generate on the fly by randomly sampling from the word bank.
 Now with negative sampling the skipgram model’s objective function becomes:
 For CBOW, the objective function will be:
 For $P_n(w)$, what seems to work best is the Unigram Model raised to the power of 3/4.
4.5 Hierarchical Softmax
 Hierarchical softmax is an efficient alternative to the normal softmax.
 In practice, hierarchical softmax tends to be better for infrequent words, while negative sampling works better for frequent words and lower dimensional vectors.
 Hierarchical softmax uses a binary tree to represent all words in the vocabulary. Each leaf is a a word, and there is a unique path from root to leaf.

There is no output representation for words. Instead each node of the graph is associated to a vector that the model is going to learn. The probability of a word $w$ given a vector $w_i$, $p(w w_i)$ is equal to the probability of a random walk starting in the root and ending in the lead node corresponding to $w$. 
Cost to compute the probability is $O(log( V ))$
 $n(w, i)$ is the root $n(w, L(w))$ is the parent node of $w$
 $ch(n)$ indicate the left children of node $n$
 $[n(w, j+1)=ch(n(w,j))]$ will be 1 if node is left 1 if node is right, thus ensures proper normalization.
 To train the model, our goal is to minimize the negative log likelihood by updating vectors of the nodes in the binary tree.
Note 2
1 Global Vectors for Word Representation (GloVe)
1.1 Comparison with Previous Methods
 Count based
 Rely on Matrix factorization. (LSA, HAL)
 Effectively leverage global statistical information
 Ok with word similarities, but poorly on other tasks > vector space structure is suboptimal.
 Shallow windowbased
 Skipgram and CBOW
 Learn embedding by making predictions in local context windows
 Capture complex linguistic patterns beyond word similarity
 Fail to make use of the global cooccurrence statistics
 GolVe
 Using global statistics to predict the probability of word pair cooccurrence
1.2 Cooccurrence Matrix
 $X_{ij}$ indicates the number of time word $j$ occur in the context of word $i$
 $X_i=\sum_kX_{ik}$ be the number of times any word $k$ appears in the context of word $i$

$P_{ij} = p(w_j w_i) = \frac{X_{ij}}{X_{i}}$ is the probability of $j$ appearing in the context of word $i$
1.3 Least Squares Objective

softmax compute the probability of word $j$ appears in the context of word $i$ \(Q_{ij} = \frac{exp(u_j^Tv_i)}{\sum_{w=1}^{W}exp(u_w^Tv_i)}\)

Global corssentropy objective function trained in an online, stochastic fashion \(J = \sum_{i \in \text{corpus}}\sum_{j \in \text{context(i)}}logQ_{ij}\)

It is more efficient to group together the same values for $i$ and $j$ \(J=\sum_{i=1}^{W}\sum_{j=1}^{W}X_{ij}logQ_{ij}\) where the value of cooccurring frequency is given by the cooccurrence matrix X

One drawback of the crossentropy loss is that it requires the distribution $Q$ to be properly normalized, which involves the expensive summation over the entire vocabulary.

Instead, we use a least square objective in which the normalization factors in $P$ and $Q$ are discarded: \(\hat{J}=\sum_{i=1}^{W}\sum_{j=1}^{W}X_{ij}(\hat{P_{ij}}\hat{Q_{ij}})^2\) where $\hat{P_{ij}} = X_{ij}$ and $\hat{Q_{ij}} = exp(u_j^Tv_i)$ are the unnormalized distributions.

New problem: $X_{ij}$ often takes on very large values and makes the optimization difficult.

Solution: minimize the squared error of the logarithm of $\hat{P}$ and $\hat{Q}$ \(\begin{align} \hat{J} &= \sum_{i=1}^{W} \sum_{j=1}^{W}X_i(log(\hat{P_{ij}})  log(\hat{Q_{ij}}))^2\\ &=\sum_{i=1}^{W} \sum_{j=1}^{W}X_i(u_j^Tv_i  logX_{ij})^2 \\ &=\sum_{i=1}^{W} \sum_{j=1}^{W}f(X_{ij})(u_j^Tv_i  logX_{ij})^2 \end{align}\) where $f(X_{ij})$ is a more general weighting function than the raw $X_i$
2 Evaluation of Word Vectors
2.1 Intrinsic Evaluation
 Evaluate word vectors on specific intermediate subtasks. Like analogy completion for question answering. Reason? If evaluated with the final downstream task, the entire pipeline will be retrained from scratch each time we need to tweak the word vector learning.
 Requirement: the intrinsic evaluation has a positive correlation with the final task performance.
2.2 Extrinsic Evaluation
 Evaluate word vectors on the real task.
 Typically, optimizing over an under performing extrinsic evaluation system does not allow us to determine which specific subsystem is at fault and this motivates the need for intrinsic evaluation.
2.3 Intrinsic Evaluation Example: Word Vector Analogies

$a:b::c:?$

find $d$ such that it maximizes the cosine similarity \(d=\underset{i}{argmax}\frac{(x_bx_a+x_c)^Tx_i}{x_bx_a+x_c}\)

semantci: capital city > country

syntactic: adj > it’s superlative form
2.4 Analogy Evaluations
 Performance is heavily dependent on the model used for word embedding.
 Performance increases with larger corpus sizes
 Performance is lower for extremely low dimensional word vectors
 Extremely high dimensional vectors with CBOW/SG is reasonably robust to noise
2.5 Correlation Evaluation
 Another simple way to evaluate the quality of word vectors is by asking humans to assess the similarity between two words on a fixed scale and then comparing this with the cosine similairity between the corresponding word vectors.
2.6 Dealing With Ambiguity
 Improving word representations via global context and multiple word prototypes. Huang et.al. 2012
3. Training for Extrinsic Tasks
3.1 Problem Formulation
 Most NLP extrinsic tasks can be formulated as classification tasks.
 Sentiment analysis: given a text, classify it to be negative or positive.
 Namedentity recognition(NER): given a context and a center word, classify the center word to be one of many classes.
3.2 Retraining Word Vectors
 If we retrain word vectors using the extrinsic task, we need to ensure that the training set is large enough to cover most words from the vocabulary.
 When retrain word vectors over a small set of the vocabulary, these words are shifted in the word space and as a result, the performance over the final task could actually reduce.
3.3 Softmax Classification and Regularization

Softmax classification: \(p(y_j=1x) = \frac{exp(W_j\cdot x)}{\sum_{c=1}^{C}exp(W_c\cdot x)}\)

Using the crossentropy loss: \(\sum_{j=1}^{C}y_jlog(p(y_j=1x)) = \sum_{j=1}^{C}y_j log(\frac{exp(W_j\cdot x)}{\sum_{c=1}^{C}exp(W_c\cdot x)})\)

Since there is only one correct answer, let $k$ be the correct one. The loss function simplify to: \(log(\frac{exp(W_k\cdot x)}{\sum_{c=1}^{C}exp(W_c\cdot x)})\)

Extend to a dataset of N points: \(\sum_{i=1}^{N}log(\frac{exp(W_{k(i)}\cdot x^{(i)})}{\sum_{c=1}^{C}exp(W_c\cdot x^{(i)})})\)
 We have $C\cdot d+V\cdot d$ parameters to update each time. To avoid, over fitting, we can pose the Bayesian belief that parameters $(\theta)$ should be small in magnitude: \(\sum_{i=1}^{N}log(\frac{exp(W_{k(i)}\cdot x^{(i)})}{\sum_{c=1}^{C}exp(W_c\cdot x^{(i)})}) + \lambda \sum_{k=1}^{C\cdot d + V\cdot d}{\theta_k}^2\)
3.4 Window Classification
 Instead of using just the word of interest, use a sequence that is a center word preceded and suceeded by context words.
 Generally, narrower window sizes lead to better performance in syntactic tests while wider windows lead to better performance in semantic tests.
3.5 Nonlinear Classifiers
 Nonlinear decision boundaries to solve not linear separable classification problem.
Note 3
1 Dependency Grammar and Dependency Structure
 Parse trees in NLP are used to analyze the syntactic structure of sentences.
 Constituency structures  phrase structure grammar to organize words into nested constituents.
 Dependency structures  shows which words depend on (modify or are argument of) which other words.
 Dependency tree for the sentence: “Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas.”
1.1 Dependency Parsing
 Task of analyzing the syntactic dependency structure of a given input sentence S.
 Output is a dependency tree/graph $G$ where the words are connected by typed dependency relations.
 Learning: Given a training set $D$ of sentences annotated with dependency graphs, induce a parsing model M that can be used to parse new sentences.
 Parsing: Given a parsing model M and a sentence S, derive the optimal dependency graph D for S according to M.
1.2 TransitionBased Dependency Parsing
 State machine which defines the possible transitions to create the mapping from the input sentence to the dependency tree.
 Learning: induce a model to predict the next transition
 Parsing: construct the optimal sequence of transitions
 Most transitionbased systems do not make use of a formal grammar.
1.3 Greedy Deterministic TransitionBased Parsing

Nivra 2003, showed to be better in both performance and efficiency than traditional featurebased discriminative dependency parsers.
 States:
 For any sentence $S=w_0w_1…w_n$ a state can be described with a triple $c=(\sigma, \beta,A)$
 a stack $\sigma$ of words $w_i$ from $S$
 a buffer $\beta$ of words $w_i$ from $S$
 a set of dependency arcs $A$ of the form $(w_i, r, w_j)$, where $w_i, w_j$ are from $S$, and $r$ describes a dependency relation.
 It follows that for any sentence $S=w_0w_1…w_n$,

an initial state $c_0$ is of the form $( w_0 {\sigma}, [w_1,…,w_n]{\beta}, \emptyset)$  a terminal state has the form $({\sigma}, []{\beta}, A)$

 For any sentence $S=w_0w_1…w_n$ a state can be described with a triple $c=(\sigma, \beta,A)$
 Transitions
 SHIFT: remove the first word in the buffer and push it on top of the stack
 LEFTARC: add a dependency arc $(w_j, r, w_i)$ to the arc set $A$, where $w_i$ is the word second to the top of the stack and $w_j$ is the word at the top of the stack. Remove $w_i$ from the stack
 RIGHTARC: dd a dependency arc $(w_i, r, w_j)$ to the arc set A, where $w_i$ is the word second to the top of the stack and $w_j$ is the word at the top of the stack. Remove $w_j$ from the stack
1.4 Neural Dependency Parsing
 Feature Selection:
 $S_{word}$ vector representations for some of the words in $S$ (and their dependents) at the top of the stack $\sigma$ and buffer $\beta$
 $S_{tag}$: PartofSpeech (POS) tags for some of the words in $S$. POS tags comprise a small, discrete set: $P={NN,NNP,NNS,DT,JJ,…}$
 $S_{label}$: The arclabels for some of the words in $S$. The arclabels comprise a small discrete set, describing the dependency relation: $L={amod, tmod,nsubj,csubj,dobj,…}$
 For each feature type, we will have a corresponding embedding matrix, mapping from the feature’s one hot encoding , to a $d$dimensional dense vector representation.
 For example:
 $S_{word}$: top 3 words on stack and buffer, the first and second leftmost / right most children of the top two words on the stack. the leftmost of leftmost / rightmost of rightmost children of the top two words on the stack. 18 elements in total
 $S_{tag}$: 18 corresponding tags
 $S_{label}$: 12 in total.
 Model:
Note 4
1 Language Models
1.1 Introduction

Language models compute the probability of occurrence of a number of words in a particular sequence.

The probability of a sequence of $m$ words ${w_1, …, w_m}$ is $P(w_1,…,w_m)$

It is usually conditioned on a window of $n$ previous words rather than all previous words: \(P(w_1, ..., w_m) = \prod_{i=1}^{m}P(w_iw_1,...,w_{i1}) \approx \prod_{i=1}^{m}P(w_iw_{in},...,w_{i1})\)
1.2 ngram Language Models

One way to compute the above probabilities is ngram.

Bigram: \(p(w_ww_1) = \frac{count(w_1, w_2)}{count(w_1)}\)

Trigram: \(p(w_3w_1,w_2) = \frac{count(w_1, w_2, w_3)}{count(w_1, w_2)}\)

Main issues with ngram Language Models:
 Sparsity problems:
 If $w_1, w_2$ never appear together, the numerator for trigram is 0. Add smoothing a small $\delta$.
 If $w_1, w_2$ never appear together, the denominator for trigram is 0. Can condition on $w_2$ alone, called backoff.
 Increasing $n$ makes sparsity problems worse. Typically, $n\leq 5$
 Storage problems with ngram Language Models:
 Store counts for all ngrams is large and it grow fast as corpus size grows.
 Sparsity problems:
1.3 Windowbased Neural Language Model
 Bengio et al Neural Probabilistic Language Model, first largescale deep learning for natural language processing model.
 It learns a distributed representation of words, along with the probability function for word sequences expressed in terms of these representations.
 Consider windowbased model as Language Model conditioned on fixed size history of information.
2 Recurrent Neural Networks (RNN)
 RNN can conditioning the Language Model on all previous words.
\(h_t = \sigma(W^{(hh)}h_{t1}+W^{(hx)}x_{[t]}) \\ \hat{y}_t = softmax(W^{(S)}h_t)\)
 $W^{(hh)}$ and $W^{(hx)}$ are shared across time, applied repeatedly at each step.
 Thus model learn fewer parameters
 And is independent of the length of the input sequence
 Defeating the curse of dimensionality
2.1 RNN Loss and Perplexity
 RNN trained with cross entropy loss
 At a given time $t$ the loss is

Aggregate over the corpus \(J = \frac{1}{T}\sum_{t=1}^{T}J^{(t)}(\theta) = \frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{V}y_{t.j}\cross log(\hat{y}_{t, j})The\)

The perplexity is defined as: $2^{J}$
Lower value imply more confident in predicting the next word in sequence
2.2 Advantages, Disadvantages and Applications of RNNs
 Advantages:
 process input sequences of any length
 model size independent of sequence lengths
 computation for step $t$ can (in theory) use information from many steps back
 Disadvantages:
 computation is slow  because it is sequential, can’t be parallelized
 in practice, it is difficult to access information from many steps back due to vanishing and exploding gradients.
 RNNs can be used for many tasks, such as tagging (POS, NER), sentence classification(category, sentiment), and as encoder or decoder modules.
2.3 Vanishing Gradient & Gradient Explosion Problems
\[\begin{align} \frac{\partial E}{\partial W} &= \sum_{t=1}^{T}\frac{\partial E_t}{\partial W} \\ \frac{\partial E_t}{\partial W} &= \sum_{k=1}^{t}\frac{\partial E_t}{\partial y_t}\frac{\partial y_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W} \\ \frac{\partial h_t}{\partial h_k} &= \prod_{j=k+1}^{t}\frac{\partial h_{j}}{\partial h_{j1}}=\prod_{j=k+1}^{t}W^T\cross diag[f'(j_j1)] \\ \frac{\partial E}{\partial W} &= \sum_{t=1}^{T}\sum_{k=1}^{t}\frac{\partial E_t}{\partial y_t}\frac{\partial y_t}{\partial h_t}(\prod_{j=k+1}^{t}\frac{\partial h_j}{\partial h_{j1}})\frac{\partial h_k}{\partial W} \end{align}\]$\partial h_{j} / \partial h_{j1}$ is the Jacobian matrix for h. \(\\frac{\partial h_j}{\partial h_{j1}}\ \leq \W^T\ \diag[f'(h_{j1})] \ \leq \beta_W \beta_h\) Where $\beta_W$ and $\beta_h$ are the upper bound values for the norms of the two matrices. $f’(h_{j1})$ can only be as large as 1, given sigmoid activation. \(\\frac{\partial h_t}{\partial h_k}\ = \ \prod_{\partial h_{j1}}^{\partial h_{j}} \ \leq (\beta_W\beta_h)^{tk}\)
 if $\beta_W\beta_h$ get smaller or larger than 1, with $tk$ sufficiently large, the gradients blows or diminishes very quickly.
 Gradient Explosion, i.e. gradient overflow the floating number limits becomes NaN, can be detected.
 Gradient Vanish, i.e. gradient becomes 0, can be undetected. It will be hard to tell whether it is that:
 No relationship last that long, or
 We can’t capture this due to gradient vanishing.
2.4 Solution to the Exploding & Vanishing Gradients

For exploding gradient: gradient clipping \(\hat{g} \leftarrow \frac{\partial E}{\partial W} \\ if \quad \\hat{g}\ \geq threshold \quad then \\ \hat{g} \leftarrow \frac{threshold}{\\hat{g}\}\hat{g}\)

For vanishing gradient:
 identity matrix initialization
 use ReLU activations
2.5 Deep Bidirectional RNNs

Bidirectional \(\begin{align} \overrightarrow{h_t} &= f(\overrightarrow{w_{x_t}} + \overrightarrow{V} \, \, \, \overrightarrow{h}_{t1} + \overrightarrow{b}) \\ \overleftarrow{h_t} &= f(\overleftarrow{w_{x_t}} + \overleftarrow{V} \, \, \, \overleftarrow{h}_{t1} + \overleftarrow{b}) \\ \hat{y}_t &= g(Uh_t + c) = g(U[\overrightarrow{h}_t; \overleftarrow(h)_t] + c) \end{align}\)

Stacked RNN \(\overrightarrow{h}^{(i)}_t = f(\overrightarrow{W}^{(i)} h^{(t1)} + \overrightarrow{V}^{(i)} \overrightarrow{h}^{(i)}_{t1} + \overrightarrow{b}^{(i)})\)

The two can be combined
2.6 Application: RNN Translation Model
 EncoderDecoder RNN for translation
 Encoder RNN process the input sequence and produce a final hidden state, that is feed into decoder RNN as the initial hidden state.
 The decoder RNN gets a
<BOS>
token as initial input and treat each of the previous prediction as input to the current step.

Extensions:

Use dedicated Encoder and Decoder

Use Bidirectional RNN

Use Stacked RNN

Compute every hidden state in the decoder using three inputs

Previous hidden

Previous step prediction

Last hidden from encoder


Sometimes worth trying instead of $ABC\rightarrow XY$, do $CBA\rightarrow XY$

3. Gated Recurrent Units

Although RNNs can theoretically capture longterm dependencies, they are very hard to actually train to do this.

Gated recurrent unites are designed in a manner to have more persistent memory thereby making it easier for RNNs to capture longterm dependencies. \(\begin{align} \text{update gate} \qquad z_t &= \sigma(W^{(z)}x_t + U^{(z)}h_{t1}) \\ \text{reset gate} \qquad r_t &= \sigma(W^{(r)}x_t + U^{(r)}h_{t1}) \\ \text{new memory} \qquad \tilde{h}_t &= tanh(r_t\circ Uh_{t1} + Wx_t) \\ \text{hidden state} \qquad h_t &= (1z_t)\circ\tilde{h}_t+z_t\circ h_{t1} \\ \end{align}\)
4. LongShortTermMemories

The motivation for LSTM is similar to GRU, the architecture of unit is a bit different.

LSTM has a separate cell state that is kept hidden. Hidden state is just a exposure of that cell state. \(\begin{align} \text{input gate} \qquad i_t &= \sigma(W^{(i)}x_t + U^{(i)}h_{t1}) \\ \text{forget gate} \qquad f_t &= \sigma(W^{(f)}x_t + U^{(f)}h_{t1}) \\ \text{output gate} \qquad o_t &= \sigma(W^{(o)}x_t + U^{(o)}h_{t1}) \\ \text{new memeory cell} \qquad \tilde{c}_t &= tanh(W^{(c)}x_t + U^{(c)}h_{t1}) \\ \text{final state cell} \qquad c_t &= f_t \circ c_{t1} + i_t\circ \tilde{c}_t \\ h_t &= o_t \circ tanh(c_t) \\ \end{align}\)
Note 5
1 Neural Machine Translation with Seq2Seq
 Type of Seq2Seq usage:
 Translation
 Conversation
 Summarization
1.1 Brief Note on Historical Approaches
 Before neural network approach, translation systems were based on probabilistic models constructed from:
 a translation model, telling us what a sentence / phrase in a source language most likely translates into
 a language model, telling us how likely a given sentence / phrase is overall
 Phrasebased systems were most common prior to Seq2Seq. It is difficult to handle longterm dependencies with phrasebased system.
1.2 Sequencetosequence Basics

Appeared in 2014 for EnglishFrench translation

It is an endtoend model made up of two RNNs:
 an encoder, which takes the model’s input sequence and encodes it into a fixedsize “context vector”
 an decoder, which uses the context vector from encoder as a “seed” from which to generate an output sequence.
1.3 Seq2Seq architecture  encoder
 Use a encoder RNN to process the input token sequence and the final hidden state will become the context vector.
 Usually the Seq2Seq encoders will process the input sequence in reverse. It is done under the reasoning that the last thing the encoder sees will (roughly) corresponds to the first thing that the model ouputs. this makes it easier for the decoder to “get started”
1.4 Seq2Seq architecture  decoder
 Use a decoder RNN to generate output sequence based of the words it generated so far, as well as the context.
 The decoding process is kicked off with a special token
<EOS>
(if use the same RNN for both encoding and decoding) or<BOS>
(If use dedicated networks)
 The decoding process is kicked off with a special token
1.5 Bidirectional RNNs
 Use bidirectional RNNs to traversing a sequence in both directions and concatenating the resulting final hidden states and cell states.
2 Attention Mechanism
2.1 Motivation
 Bahdanau et al. 2016 proposing attention to address the issue with a single “context vector” for sequencetosequence models.
 Different parts of an input have different levels of significance
 Different parts of the output may consider different parts of the input “important”
 Attention mechanisms make use of this observation by providing the decoder network with a chance to look at the entire input sequence at every decoding step. The decoder can then decide what input words are important at any point in time.
2.2 Bahdanau et al. NMT model

Encoder: Use RNN to process input sequence and collect $(h_1,…,h_n)$

Decoder: $s_i=f(s_{i1}, y_{i1}, c_i)$ Where $s_{i1}$ is the previous hidden, $y_{i1}$ is the previous prediction, $c_i$ is the context from the input sequence that is relevant for the current decoding job. It is calculated as follow \(\begin{align} e_{i,j} &= a(s_{i1}, h_j) \\ \alpha_{i,j} &= \frac{exp(e_{i,j})}{\sum_{k=1}^{n}exp(e_{i,k})} \\ c_i &=\sum_{j=1}^{n} \alpha_{i,j}h_j \\ \end{align}\) That is:
 We use the hidden state $s_{i1}$ to query the encoder hidden states, and create a score for each.
 Use softmax to convert these scores into a proper probability
 Finally compute the context vector as weighted average of the encoder hidden states.
 Ideally the softmax will put higher value on important stuff.
2.3 Connection with translation alignment
 The attentionbased model learns to assign significance to different parts of the input for each step of the ouput.
 In the context of translation, attention can be thought of as “alignment”
 Bahdanau et al. argue that the attention scores $\alpha_{i,j}$ at decoding step $i$ signify the words in the input sequence that align with the word $i$ in the output sequence.
2.4 Performance on long sentence
 As the sequence length grows, models do not use attention will miss information.
2.5 Global / Local Attention
 Many ways to do attention. Luong et al’s attention are using current hidden state to do attention and modify what to output at the current step. Sort of post activation attention, compare to Bahdanau’s pre activation attention.
 And they added some control after attention about how much around the aligned position should the information be aggregated as context.
3. Google’s multilingua NMT
 A Seq2Seq model that accepts as input a sequence of words and a token specifying what language to translate into.
 The model uses shared parameters to translate into any target language.
 It enabled “zeroshort translation”, in which we can translate between two languages for which we have no translation training data. They can actually generate reasonable JapaneseKorean translations where they only had JapaneseEnglish and KoreanEnglish in training data.
 This is showing: the decoding process is not languagespecific, the model is in fact maintaining an internal representation of the input / output sentences independent of the actual languages involved.
4 Sequence model decoders

Consider a model that computes the probability $P(\bar{s}s)$ of a translation $\bar{s}$ given the original sequence $s$. We want to pick the translation $\bar{s}*$ that has the best probability. \(\bar{s}*=argmax_{\bar{s}}(P(\bar{s}s))\)

As the search space can be huge$ V ^{len(s)}$, we need to shrink its size: 
Exhaustive search: compute the probability of every possible sequence, and chose the one with the highest probability. Too complex, NP_complete.

Ancestral sampling: at time step $t$, we sample $x_t$ based on the conditional probability of the word at step $t$ given the past. \(x_t \sim P(x_tx_1, \dots, x_n)\) High variance

Greedy search: at time step $t$, we pick the most probable token. \(x_t=argmax_{\tilde{x_t}}P(\tilde{x_t}x1,\dots,x_n)\) Efficient, but if we make a mistake at one time step, the rest of the sentence could be heavily impacted.

Beam search: maintain $K$ candidates at each time step. \(H_t =\{(x_1^1,\dots,x_t^1),\dots,(x_1^K,\dots,x_t^K)\}\)
and compute $H_{t+1}$ by expanding $H_{t}$ and keeping the best $K$ candidates. In other words, we pick the best $K$ sequence in the following set \(\tilde{H}_{t+1} = \bigcup_{k=1}^{K}H^{\tilde{k}}_{t+1}\) where \(H^{\tilde{K}}_{t+1} = \{(x^k_1,\dots ,x^{k}_t, v_1),\dots,(x^k_1,\dots ,x^{k}_t, v_{V})\}\) As we increase $K$, we gain precision and we are asymptotically exact. Beam search is the most commonly used technique in NMT.
