Big Takeaways

  • Embracing errors
  • Trade offs
  • How to setup everything so we are on track to success

Components of a Machine Learning Algorithm

  1. Model Structure

  2. Loss Function

  3. Optimization

Logistic Regression

  • Linear model with sigmoid activation
    \(y=\frac{1}{1 + e^{-z}}, z =w^Tx + b, z = log(\frac{y}{1-y})\)

  • Log Loss \(logloss(\hat{y}, y) = -\frac{1}{n}\sum(ylog(\hat{y}) + (1-y)log(1-\hat{y}))\) minimize the negative log-likelihood = maximize the log of likelihood

  • Gradient Descent \(w_i = w_i - \alpha \frac{\partial Loss(y, f(x))}{\partial w_i}\)

Discriminative vs Generative and Bayes Rule

  • Discriminative: model $P(y x)$ conditional probability, posterior
  • Generative: model $P(x, y)$ the joint probability, use bayes rule to get $P(y x)$
  • Bayes rule \(P(A \and B) = P(A|B)P(B) = P(B|A)P(A) \\ P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\neg A)P(\neg A)}\)

  • Naive Bayes: \(y = \underset{y_j\in Y}{argmax} \frac{P(X|y_j)P(y_j)}{P(X)} = \underset{y_j\in Y}{argmax} P(y_j)\prod_{i}P(x_i|y_j)\)

  • MLE vs MAP

    • MLE: probability from data alone
    • MAP: probability from data + prior

Decision Trees

  • Model structure: tree
  • Loss: $\frac{1}{n}\sum Entropy(Leaf(S_i)), Entropy(S) = - \sum p(y=i)*log_2(p(y=i))$
  • Optimization: Greedy search(single step lookahead) with pruning
  • Interpretations
    • features near the root
    • most often used features
    • aggregated information gains
    • prominent paths (highly accurate, or lots data follow)

Feature Engineering

  • Goal:
    • create feature representation match model structure
    • balancing among: data size, # features, feature complexity, model complexity
  • Data types:
    • Binary / Categorical / Numeric
  • data sources:
    • System state / content / user info / metadata / interaction history
  • Features for text:
    • Token / Bag of words / N-grams (charlevel - OOV, spelling error etc)
    • TF x IDF:
      • TF: Prob(term document) - term’s importance to the document
      • IDF: log(# total doc / # doc with term) - novelty of term across corpus
    • Embeddings: w2v, fasttext ->
  • Normalization: help some model’s optimization - no need to learn what is big or small
  • Start with standard / common practice features
    • CV: Gradients, Histograms, Convolutions
    • Internet: IP, Domain, Relationships, Reputation
    • Time Series: window statistics, FFT
    • NN: attention
  • Look at errors, combine with domain experts

Feature Selection

  • Frequency / 10 ~ 50 appearance for model to has a chance to pick it up

  • Mutual Information \(MI(X, Y) = \sum_{y}\sum_{x}p(x, y)log(\frac{p(x, y)}{p(x)p(y)})\)

  • Through model’s CV performance
  • Model built in with selection / tree / lasso / etc

Classifier evaluations

  • From the confusion matrix:

    • Accuracy (TP + TN) / total

    • Precision: TP / (TP + FP) % classified 1 that are indeed 1

    • Recall: TP / (TP + FN) % 1 classified as 1

    • FP Rate: FP / (FP + TN) % 0 classified as 1

    • FN rate: FN / (TP + FN) % 1 classified as 0
    • F1 score 2 * Precision * Recall / (Precision + Recall)
    • AUC, area under ROC curve - higher AUC does not guarantee better FP / FN at operating point
  • Mistakes have different costs

  • Operating points - sweep through the ROC curve(FN rate against FP rate)

    • Find good trade off point and lookup the threshold.
    • Reason to reset: more data, new model, new users, data drifts

Model Selection / Hyper-parameter tuning

  • Central Limit Theorem:
    • distribution of sample means approximates a normal distribution
    • where average of sample means gets closer and closer to population mean if we draw samples
    • and the variance of sample means is proportional to population variance
  • New model metric lower bound > old model metric upper bound -> 5% chance old model is better?
  • Cross validation

Overfitting and Underfitting

  • No Free Lunch Theorem.-> no one model always wins
  • Bias and Variance tradeoffs
    • Bias: learning setup’s inability to capture the truth
    • Variance: learning setup can over sensitive on training data
  • Usually traditional ML model is a trade off between the two
  • Estimate with cross validation / bootstrap etc
  • Bias is from model structure
  • Variance can be controlled with regularization
    • L1, L2 regularization for parametric model
    • max_depth, min_leaf etc for trees


  • Combine many models, to combat bias-variance trade off
  • Learn the problem in many different ways and then combine
    • Bagging, Random Forest, Stacking - low bias high variance individual models
    • Encouraging variance and hope they cancel out.
  • Learn different parts of the problem with different models and then combine
    • Boosting - weak learners
    • Learn the residual / or re-weight data based on error


  • K-means vs Mixture of Gaussians vs Hierarchical Agglomerative

  • K-Means

    • Structure: - K components, each center at a centroid that is average of the component
    • Loss: - Average distance from point to assigned centroid
    • Optimization: - Iterative EM
    init_centroids(data, k)
    while loss is improving:
        assign points to nearest centroid
        update centroid to the mean of it component

Dimensionality Reduction

  • map feature space onto a smaller feature space such that:
    • capture the core relationship in the data
    • drops fine details
  • PCA

Instance Based Learning

  • KNN


Hierarchy of goals

  • Organizational Objectives
  • Leading Indicators
  • User Outcomes
  • Model Properties

  • Effects are bottom up, but motivations are top dwon

User Experiences

  • All models make mistakes, and ML model make wired mistakes
  • Mitigate mistakes
    • manage the number of types of mistakes user sees
    • control how easy it is for users to identify mistakes
    • give users options for recovering from mistakes
  • Balancing the experience
    • Forcefulness
    • Frequency
    • Value of success / Cost of failure
    • Quality
  • Forcefulness
    • Forceful: automatic, can’t ignore, only do so when quality is good and cost of failure is low
    • Passive: user need to accept, or can be ignored, when interaction is frequent and value is low
  • Delete -> Suppress -> Inform

Design Pattern

Design Pattern Example Best When Challenges
Corpus Centric CV / Speech Hard but Stable Problem
Can’t use data from user interaction
Bootstrapping a new system
Data collection
Sophisticated modeling
Closed Loop Recommender Open-ended / Time-changing problems
Large scale
Shaping User/ML interactions
Orchestrating Evolving System
  • Match run time and training time
    • Run Time GetContext() = Training Time GetHistoryData()
    • Shared Featurizer
  • Corpus Centric
    • Collect data for stable problem (face detection etc)
    • Model coupling must be managed
      • complexity in ops
      • advantages in efficiency and team
    • Verifying ML systems requires drilling into subpopulations
    • Think about evolution and how you want to encapsulate information


  • Training
    • Telemetry / Corpus -> training data
    • Feature code in sync
    • Computation & data
  • Model Management
    • verification
    • controlled rollout
    • support online evaluation
  • Telemetry
    • Verifying outcomes
    • Collect new training data
    • Select what to observe
  • Runtime
    • Program State -> training data
    • synced feature code
    • model inference
    • UX based of model prediction
  • Orchestration / Ops
    • Monitoring
    • Inspect interactions
    • Adaptation (thresholding etc?)
    • Deal with mistakes

Adversarial Machine Learning

  • ML assumes I.I.D
    • but almost always violated in practice
    • really always violated in adversarial settings
  • The most obvious way to use machine learning is not always the best way