### Big Takeaways

• Embracing errors
• How to setup everything so we are on track to success

#### Components of a Machine Learning Algorithm

1. Model Structure

2. Loss Function

3. Optimization

#### Logistic Regression

• Linear model with sigmoid activation
$$y=\frac{1}{1 + e^{-z}}, z =w^Tx + b, z = log(\frac{y}{1-y})$$

• Log Loss $$logloss(\hat{y}, y) = -\frac{1}{n}\sum(ylog(\hat{y}) + (1-y)log(1-\hat{y}))$$ minimize the negative log-likelihood = maximize the log of likelihood

• Gradient Descent $$w_i = w_i - \alpha \frac{\partial Loss(y, f(x))}{\partial w_i}$$

#### Discriminative vs Generative and Bayes Rule

•  Discriminative: model $P(y x)$ conditional probability, posterior
•  Generative: model $P(x, y)$ the joint probability, use bayes rule to get $P(y x)$
• Bayes rule $$P(A \and B) = P(A|B)P(B) = P(B|A)P(A) \\ P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\neg A)P(\neg A)}$$

• Naive Bayes: $$y = \underset{y_j\in Y}{argmax} \frac{P(X|y_j)P(y_j)}{P(X)} = \underset{y_j\in Y}{argmax} P(y_j)\prod_{i}P(x_i|y_j)$$

• MLE vs MAP

• MLE: probability from data alone
• MAP: probability from data + prior

#### Decision Trees

• Model structure: tree
• Loss: $\frac{1}{n}\sum Entropy(Leaf(S_i)), Entropy(S) = - \sum p(y=i)*log_2(p(y=i))$
• Optimization: Greedy search(single step lookahead) with pruning
• Interpretations
• features near the root
• most often used features
• aggregated information gains
• prominent paths (highly accurate, or lots data follow)

#### Feature Engineering

• Goal:
• create feature representation match model structure
• balancing among: data size, # features, feature complexity, model complexity
• Data types:
• Binary / Categorical / Numeric
• data sources:
• System state / content / user info / metadata / interaction history
• Features for text:
• Token / Bag of words / N-grams (charlevel - OOV, spelling error etc)
• TF x IDF:
•  TF: Prob(term document) - term’s importance to the document
• IDF: log(# total doc / # doc with term) - novelty of term across corpus
• Embeddings: w2v, fasttext ->
• Normalization: help some model’s optimization - no need to learn what is big or small
• Start with standard / common practice features
• CV: Gradients, Histograms, Convolutions
• Internet: IP, Domain, Relationships, Reputation
• Time Series: window statistics, FFT
• NN: attention
• Look at errors, combine with domain experts

#### Feature Selection

• Frequency / 10 ~ 50 appearance for model to has a chance to pick it up

• Mutual Information $$MI(X, Y) = \sum_{y}\sum_{x}p(x, y)log(\frac{p(x, y)}{p(x)p(y)})$$

• Through model’s CV performance
• Model built in with selection / tree / lasso / etc

#### Classifier evaluations

• From the confusion matrix:

• Accuracy (TP + TN) / total

• Precision: TP / (TP + FP) % classified 1 that are indeed 1

• Recall: TP / (TP + FN) % 1 classified as 1

• FP Rate: FP / (FP + TN) % 0 classified as 1

• FN rate: FN / (TP + FN) % 1 classified as 0
• F1 score 2 * Precision * Recall / (Precision + Recall)
• AUC, area under ROC curve - higher AUC does not guarantee better FP / FN at operating point
• Mistakes have different costs

• Operating points - sweep through the ROC curve(FN rate against FP rate)

• Find good trade off point and lookup the threshold.
• Reason to reset: more data, new model, new users, data drifts

#### Model Selection / Hyper-parameter tuning

• Central Limit Theorem:
• distribution of sample means approximates a normal distribution
• where average of sample means gets closer and closer to population mean if we draw samples
• and the variance of sample means is proportional to population variance
• New model metric lower bound > old model metric upper bound -> 5% chance old model is better?
• Cross validation

### Overfitting and Underfitting

• No Free Lunch Theorem.-> no one model always wins
• Bias and Variance tradeoffs
• Bias: learning setup’s inability to capture the truth
• Variance: learning setup can over sensitive on training data
• Usually traditional ML model is a trade off between the two
• Estimate with cross validation / bootstrap etc
• Bias is from model structure
• Variance can be controlled with regularization
• L1, L2 regularization for parametric model
• max_depth, min_leaf etc for trees

#### Ensemble

• Combine many models, to combat bias-variance trade off
• Learn the problem in many different ways and then combine
• Bagging, Random Forest, Stacking - low bias high variance individual models
• Encouraging variance and hope they cancel out.
• Learn different parts of the problem with different models and then combine
• Boosting - weak learners
• Learn the residual / or re-weight data based on error

#### Clustering

• K-means vs Mixture of Gaussians vs Hierarchical Agglomerative

• K-Means

• Structure: - K components, each center at a centroid that is average of the component
• Loss: - Average distance from point to assigned centroid
• Optimization: - Iterative EM
init_centroids(data, k)
while loss is improving:
assign points to nearest centroid
update centroid to the mean of it component


#### Dimensionality Reduction

• map feature space onto a smaller feature space such that:
• capture the core relationship in the data
• drops fine details
• PCA

• KNN

### Architecture

#### Hierarchy of goals

• Organizational Objectives
• User Outcomes
• Model Properties

• Effects are bottom up, but motivations are top dwon

#### User Experiences

• All models make mistakes, and ML model make wired mistakes
• Mitigate mistakes
• manage the number of types of mistakes user sees
• control how easy it is for users to identify mistakes
• give users options for recovering from mistakes
• Balancing the experience
• Forcefulness
• Frequency
• Value of success / Cost of failure
• Quality
• Forcefulness
• Forceful: automatic, can’t ignore, only do so when quality is good and cost of failure is low
• Passive: user need to accept, or can be ignored, when interaction is frequent and value is low
• Delete -> Suppress -> Inform

#### Design Pattern

Design Pattern Example Best When Challenges
Corpus Centric CV / Speech Hard but Stable Problem
Can’t use data from user interaction
Bootstrapping a new system
Data collection
Sophisticated modeling
Closed Loop Recommender Open-ended / Time-changing problems
Large scale
Shaping User/ML interactions
Orchestrating Evolving System
• Match run time and training time
• Run Time GetContext() = Training Time GetHistoryData()
• Shared Featurizer
• Corpus Centric
• Collect data for stable problem (face detection etc)
• Model coupling must be managed
• complexity in ops
• advantages in efficiency and team
• Verifying ML systems requires drilling into subpopulations
• Think about evolution and how you want to encapsulate information

#### Components

• Training
• Telemetry / Corpus -> training data
• Feature code in sync
• Computation & data
• Model Management
• verification
• controlled rollout
• support online evaluation
• Telemetry
• Verifying outcomes
• Collect new training data
• Select what to observe
• Runtime
• Program State -> training data
• synced feature code
• model inference
• UX based of model prediction
• Orchestration / Ops
• Monitoring
• Inspect interactions
• Adaptation (thresholding etc?)
• Deal with mistakes

#### Adversarial Machine Learning

• ML assumes I.I.D
• but almost always violated in practice
• really always violated in adversarial settings
• The most obvious way to use machine learning is not always the best way