Big Takeaways
Big Takeaways
- Embracing errors
- Trade offs
- How to setup everything so we are on track to success
Components of a Machine Learning Algorithm
-
Model Structure
-
Loss Function
-
Optimization
Logistic Regression
-
Linear model with sigmoid activation
\(y=\frac{1}{1 + e^{-z}}, z =w^Tx + b, z = log(\frac{y}{1-y})\) -
Log Loss \(logloss(\hat{y}, y) = -\frac{1}{n}\sum(ylog(\hat{y}) + (1-y)log(1-\hat{y}))\) minimize the negative log-likelihood = maximize the log of likelihood
-
Gradient Descent \(w_i = w_i - \alpha \frac{\partial Loss(y, f(x))}{\partial w_i}\)
Discriminative vs Generative and Bayes Rule
-
Discriminative: model $P(y x)$ conditional probability, posterior -
Generative: model $P(x, y)$ the joint probability, use bayes rule to get $P(y x)$ -
Bayes rule \(P(A \and B) = P(A|B)P(B) = P(B|A)P(A) \\ P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\neg A)P(\neg A)}\)
-
Naive Bayes: \(y = \underset{y_j\in Y}{argmax} \frac{P(X|y_j)P(y_j)}{P(X)} = \underset{y_j\in Y}{argmax} P(y_j)\prod_{i}P(x_i|y_j)\)
-
MLE vs MAP
- MLE: probability from data alone
- MAP: probability from data + prior
Decision Trees
- Model structure: tree
- Loss: $\frac{1}{n}\sum Entropy(Leaf(S_i)), Entropy(S) = - \sum p(y=i)*log_2(p(y=i))$
- Optimization: Greedy search(single step lookahead) with pruning
- Interpretations
- features near the root
- most often used features
- aggregated information gains
- prominent paths (highly accurate, or lots data follow)
Feature Engineering
- Goal:
- create feature representation match model structure
- balancing among: data size, # features, feature complexity, model complexity
- Data types:
- Binary / Categorical / Numeric
- data sources:
- System state / content / user info / metadata / interaction history
- Features for text:
- Token / Bag of words / N-grams (charlevel - OOV, spelling error etc)
- TF x IDF:
-
TF: Prob(term document) - term’s importance to the document - IDF: log(# total doc / # doc with term) - novelty of term across corpus
-
- Embeddings: w2v, fasttext ->
- Normalization: help some model’s optimization - no need to learn what is big or small
- Start with standard / common practice features
- CV: Gradients, Histograms, Convolutions
- Internet: IP, Domain, Relationships, Reputation
- Time Series: window statistics, FFT
- NN: attention
- Look at errors, combine with domain experts
Feature Selection
-
Frequency / 10 ~ 50 appearance for model to has a chance to pick it up
-
Mutual Information \(MI(X, Y) = \sum_{y}\sum_{x}p(x, y)log(\frac{p(x, y)}{p(x)p(y)})\)
- Through model’s CV performance
- Model built in with selection / tree / lasso / etc
Classifier evaluations
-
From the confusion matrix:
-
Accuracy (TP + TN) / total
-
Precision: TP / (TP + FP) % classified 1 that are indeed 1
-
Recall: TP / (TP + FN) % 1 classified as 1
-
FP Rate: FP / (FP + TN) % 0 classified as 1
- FN rate: FN / (TP + FN) % 1 classified as 0
- F1 score 2 * Precision * Recall / (Precision + Recall)
- AUC, area under ROC curve - higher AUC does not guarantee better FP / FN at operating point
-
-
Mistakes have different costs
-
Operating points - sweep through the ROC curve(FN rate against FP rate)
- Find good trade off point and lookup the threshold.
- Reason to reset: more data, new model, new users, data drifts
Model Selection / Hyper-parameter tuning
- Central Limit Theorem:
- distribution of sample means approximates a normal distribution
- where average of sample means gets closer and closer to population mean if we draw samples
- and the variance of sample means is proportional to population variance
- New model metric lower bound > old model metric upper bound -> 5% chance old model is better?
- Cross validation
Overfitting and Underfitting
- No Free Lunch Theorem.-> no one model always wins
- Bias and Variance tradeoffs
- Bias: learning setup’s inability to capture the truth
- Variance: learning setup can over sensitive on training data
- Usually traditional ML model is a trade off between the two
- Estimate with cross validation / bootstrap etc
- Bias is from model structure
- Variance can be controlled with regularization
- L1, L2 regularization for parametric model
- max_depth, min_leaf etc for trees
Ensemble
- Combine many models, to combat bias-variance trade off
- Learn the problem in many different ways and then combine
- Bagging, Random Forest, Stacking - low bias high variance individual models
- Encouraging variance and hope they cancel out.
- Learn different parts of the problem with different models and then combine
- Boosting - weak learners
- Learn the residual / or re-weight data based on error
Clustering
-
K-means vs Mixture of Gaussians vs Hierarchical Agglomerative
-
K-Means
- Structure: - K components, each center at a centroid that is average of the component
- Loss: - Average distance from point to assigned centroid
- Optimization: - Iterative EM
init_centroids(data, k) while loss is improving: assign points to nearest centroid update centroid to the mean of it component
Dimensionality Reduction
- map feature space onto a smaller feature space such that:
- capture the core relationship in the data
- drops fine details
- PCA
Instance Based Learning
- KNN
Architecture
Hierarchy of goals
- Organizational Objectives
- Leading Indicators
- User Outcomes
-
Model Properties
- Effects are bottom up, but motivations are top dwon
User Experiences
- All models make mistakes, and ML model make wired mistakes
- Mitigate mistakes
- manage the number of types of mistakes user sees
- control how easy it is for users to identify mistakes
- give users options for recovering from mistakes
- Balancing the experience
- Forcefulness
- Frequency
- Value of success / Cost of failure
- Quality
- Forcefulness
- Forceful: automatic, can’t ignore, only do so when quality is good and cost of failure is low
- Passive: user need to accept, or can be ignored, when interaction is frequent and value is low
- Delete -> Suppress -> Inform
Design Pattern
Design Pattern | Example | Best When | Challenges |
---|---|---|---|
Corpus Centric | CV / Speech | Hard but Stable Problem Can’t use data from user interaction Bootstrapping a new system |
Data collection Sophisticated modeling |
Closed Loop | Recommender | Open-ended / Time-changing problems Large scale |
Shaping User/ML interactions Orchestrating Evolving System |
- Match run time and training time
- Run Time
GetContext()
= Training TimeGetHistoryData()
- Shared
Featurizer
- Run Time
- Corpus Centric
- Collect data for stable problem (face detection etc)
- Model coupling must be managed
- complexity in ops
- advantages in efficiency and team
- Verifying ML systems requires drilling into subpopulations
- Think about evolution and how you want to encapsulate information
Components
- Training
- Telemetry / Corpus -> training data
- Feature code in sync
- Computation & data
- Model Management
- verification
- controlled rollout
- support online evaluation
- Telemetry
- Verifying outcomes
- Collect new training data
- Select what to observe
- Runtime
- Program State -> training data
- synced feature code
- model inference
- UX based of model prediction
- Orchestration / Ops
- Monitoring
- Inspect interactions
- Adaptation (thresholding etc?)
- Deal with mistakes
Adversarial Machine Learning
- ML assumes I.I.D
- but almost always violated in practice
- really always violated in adversarial settings
- The most obvious way to use machine learning is not always the best way