Big Takeaways
Big Takeaways
 Embracing errors
 Trade offs
 How to setup everything so we are on track to success
Components of a Machine Learning Algorithm

Model Structure

Loss Function

Optimization
Logistic Regression

Linear model with sigmoid activation
\(y=\frac{1}{1 + e^{z}}, z =w^Tx + b, z = log(\frac{y}{1y})\) 
Log Loss \(logloss(\hat{y}, y) = \frac{1}{n}\sum(ylog(\hat{y}) + (1y)log(1\hat{y}))\) minimize the negative loglikelihood = maximize the log of likelihood

Gradient Descent \(w_i = w_i  \alpha \frac{\partial Loss(y, f(x))}{\partial w_i}\)
Discriminative vs Generative and Bayes Rule

Discriminative: model $P(y x)$ conditional probability, posterior 
Generative: model $P(x, y)$ the joint probability, use bayes rule to get $P(y x)$ 
Bayes rule \(P(A \and B) = P(AB)P(B) = P(BA)P(A) \\ P(AB) = \frac{P(BA)P(A)}{P(B)} = \frac{P(BA)P(A)}{P(BA)P(A) + P(B\neg A)P(\neg A)}\)

Naive Bayes: \(y = \underset{y_j\in Y}{argmax} \frac{P(Xy_j)P(y_j)}{P(X)} = \underset{y_j\in Y}{argmax} P(y_j)\prod_{i}P(x_iy_j)\)

MLE vs MAP
 MLE: probability from data alone
 MAP: probability from data + prior
Decision Trees
 Model structure: tree
 Loss: $\frac{1}{n}\sum Entropy(Leaf(S_i)), Entropy(S) =  \sum p(y=i)*log_2(p(y=i))$
 Optimization: Greedy search(single step lookahead) with pruning
 Interpretations
 features near the root
 most often used features
 aggregated information gains
 prominent paths (highly accurate, or lots data follow)
Feature Engineering
 Goal:
 create feature representation match model structure
 balancing among: data size, # features, feature complexity, model complexity
 Data types:
 Binary / Categorical / Numeric
 data sources:
 System state / content / user info / metadata / interaction history
 Features for text:
 Token / Bag of words / Ngrams (charlevel  OOV, spelling error etc)
 TF x IDF:

TF: Prob(term document)  term’s importance to the document  IDF: log(# total doc / # doc with term)  novelty of term across corpus

 Embeddings: w2v, fasttext >
 Normalization: help some model’s optimization  no need to learn what is big or small
 Start with standard / common practice features
 CV: Gradients, Histograms, Convolutions
 Internet: IP, Domain, Relationships, Reputation
 Time Series: window statistics, FFT
 NN: attention
 Look at errors, combine with domain experts
Feature Selection

Frequency / 10 ~ 50 appearance for model to has a chance to pick it up

Mutual Information \(MI(X, Y) = \sum_{y}\sum_{x}p(x, y)log(\frac{p(x, y)}{p(x)p(y)})\)
 Through model’s CV performance
 Model built in with selection / tree / lasso / etc
Classifier evaluations

From the confusion matrix:

Accuracy (TP + TN) / total

Precision: TP / (TP + FP) % classified 1 that are indeed 1

Recall: TP / (TP + FN) % 1 classified as 1

FP Rate: FP / (FP + TN) % 0 classified as 1
 FN rate: FN / (TP + FN) % 1 classified as 0
 F1 score 2 * Precision * Recall / (Precision + Recall)
 AUC, area under ROC curve  higher AUC does not guarantee better FP / FN at operating point


Mistakes have different costs

Operating points  sweep through the ROC curve(FN rate against FP rate)
 Find good trade off point and lookup the threshold.
 Reason to reset: more data, new model, new users, data drifts
Model Selection / Hyperparameter tuning
 Central Limit Theorem:
 distribution of sample means approximates a normal distribution
 where average of sample means gets closer and closer to population mean if we draw samples
 and the variance of sample means is proportional to population variance
 New model metric lower bound > old model metric upper bound > 5% chance old model is better?
 Cross validation
Overfitting and Underfitting
 No Free Lunch Theorem.> no one model always wins
 Bias and Variance tradeoffs
 Bias: learning setup’s inability to capture the truth
 Variance: learning setup can over sensitive on training data
 Usually traditional ML model is a trade off between the two
 Estimate with cross validation / bootstrap etc
 Bias is from model structure
 Variance can be controlled with regularization
 L1, L2 regularization for parametric model
 max_depth, min_leaf etc for trees
Ensemble
 Combine many models, to combat biasvariance trade off
 Learn the problem in many different ways and then combine
 Bagging, Random Forest, Stacking  low bias high variance individual models
 Encouraging variance and hope they cancel out.
 Learn different parts of the problem with different models and then combine
 Boosting  weak learners
 Learn the residual / or reweight data based on error
Clustering

Kmeans vs Mixture of Gaussians vs Hierarchical Agglomerative

KMeans
 Structure:  K components, each center at a centroid that is average of the component
 Loss:  Average distance from point to assigned centroid
 Optimization:  Iterative EM
init_centroids(data, k) while loss is improving: assign points to nearest centroid update centroid to the mean of it component
Dimensionality Reduction
 map feature space onto a smaller feature space such that:
 capture the core relationship in the data
 drops fine details
 PCA
Instance Based Learning
 KNN
Architecture
Hierarchy of goals
 Organizational Objectives
 Leading Indicators
 User Outcomes

Model Properties
 Effects are bottom up, but motivations are top dwon
User Experiences
 All models make mistakes, and ML model make wired mistakes
 Mitigate mistakes
 manage the number of types of mistakes user sees
 control how easy it is for users to identify mistakes
 give users options for recovering from mistakes
 Balancing the experience
 Forcefulness
 Frequency
 Value of success / Cost of failure
 Quality
 Forcefulness
 Forceful: automatic, can’t ignore, only do so when quality is good and cost of failure is low
 Passive: user need to accept, or can be ignored, when interaction is frequent and value is low
 Delete > Suppress > Inform
Design Pattern
Design Pattern  Example  Best When  Challenges 

Corpus Centric  CV / Speech  Hard but Stable Problem Can’t use data from user interaction Bootstrapping a new system 
Data collection Sophisticated modeling 
Closed Loop  Recommender  Openended / Timechanging problems Large scale 
Shaping User/ML interactions Orchestrating Evolving System 
 Match run time and training time
 Run Time
GetContext()
= Training TimeGetHistoryData()
 Shared
Featurizer
 Run Time
 Corpus Centric
 Collect data for stable problem (face detection etc)
 Model coupling must be managed
 complexity in ops
 advantages in efficiency and team
 Verifying ML systems requires drilling into subpopulations
 Think about evolution and how you want to encapsulate information
Components
 Training
 Telemetry / Corpus > training data
 Feature code in sync
 Computation & data
 Model Management
 verification
 controlled rollout
 support online evaluation
 Telemetry
 Verifying outcomes
 Collect new training data
 Select what to observe
 Runtime
 Program State > training data
 synced feature code
 model inference
 UX based of model prediction
 Orchestration / Ops
 Monitoring
 Inspect interactions
 Adaptation (thresholding etc?)
 Deal with mistakes
Adversarial Machine Learning
 ML assumes I.I.D
 but almost always violated in practice
 really always violated in adversarial settings
 The most obvious way to use machine learning is not always the best way