Machine Learning Design Pattern
The Need for Machine Learning Design Pattern
Each pattern describes a problem which occurs over and over again, then describes the core of the solution to that problem.
Each solution is stated in such a way that it gives the essential field of relationships needed to solve the problem, but in a very general and abstract way – so that you can solve the problem for yourself, in your own way, by adapting it to your preferences, and the local conditions at the place where you are making it.
– A Pattern Language Christopher Alexander, 1977.
Adapted to software with Design Patterns: Elements of Resuable Object-Oriented Software, Erich et al. 1995.
Data Representation Design Patterns
- Feed in as is
- Linear scaling
- Nonlinear transformation
- Polynomial expansions
- Binning / Bucketizing - histogram equalization, use quantiles to cut
- Rank Gauss?
- Array of numbers of variable lengths
- use bulk statistics
- One Hot Encoding
- Array of categorical of variable lengths
- Multi Hot Encoding
- Frequencies - Can be normalized (TF-IDF)
- PADDED to same length with
/ NaN if values are ordered.
- use bulk statistics
Design Pattern 1: Hashed Feature
Handle common problems with categorical features:
- unkonwn value in test time
- high cardinality - not enough data to fit one coef per level - levels are treated as independent.
- cold start
How it works:
Group categorical features and accept the trade-off of collision in data representation.
- Convert levels to unique strings.
- Run a deterministic and portable hashing algorithm(FarmHash) on the string.
- MOD the number of buckets.
- Finally One Hot the buckets.
Why it works:
- unknown value in test time will get to assigned to a random bucket.
- Will not cause error.
- Ofcouse, inference for unkonwn is unknown
- number of buckets instead of number of unique values.
- Also, we don’t need to store unique values, because we have the hash function
- Hash can be random and lossy, may introduce noise as well.
- cold-start will eventually be solved
- as long as we periodically retrain the model with new data.
Tradeoffs and Alternatives have to accept levels from same bucket will be treated equally by the model, whether they are similar or different.
- Bucket collision - At least two levels share the same model parameter.
- Skew - One level dominates the bucket.
- One more hyperparameter to tune.
After hashing, consider add features aggregated by levels, so that model lose less information.
Design Pattern 2: Embeddings
Some datatypes, categoricals, texts, iamges, audios etc in their raw storage format can be hard for machine learning model to extract patterns from. Embeddings are learnable data representation, may help model to use those data.
In practice, these embeddings end up capturing closeness relationships in the input data. Thus we can use embedding as a replacement for clustering. And if the embedding is shallow, it can also subsitute dimensionality reductions.
Tradeoffs and Alternatives There will be information loss to embed data into smaller dimension(hopefully what’s lost are all noises). In return we gain information about closeness and context of the items.
Choosing the dimensionality - another hyperparameter to tune.
Some rule of thumbs: + use the fourth root of the total number of unique categorical elements. + 1.6 times the square root of total number of unique elements and no less than 600.
Train embeddings in an end to end supervised fashinon may require lots of data. Can try (Denoise)AutoEncoder.
Design Pattern 3: Feature Cross(Cartesian Product)
Feature cross helps models learn relationships between inputs faster by explicitly making each combination of input values a separate feature. By doing so, it is possible to encode nonlinearity into model, allow predictive power greater than what the features would be able to provide individually. Deep neural networks, ensemble trees can learn this, but explicitly creating these may help.
Can apply to bucketized numerical features too.
May cause high cardinality, thus can chain with an embedding layer.
1. Use regularized model to deal with the nosise from feature cross.
2. Don’t apply on two features with high correlation/association.
Design Pattern 4: Multimodal Input
Different types of data each individually embedded then concatnated to form input to model.
Or, different representations of the same data mixed together. + concat/averaging word vectors. + mixing BOW with Word Vectors. + Concatnate word vectors with stats and hand crafted features on text. + PCA/ICA/SVD and run 1D CNN on the combined inputs. + Mixture of global average pool and max pool. + Concatnate features at different resolutions(not resizing images, extract different layers)
Interpretability/Explainability becomes harder with multimodal model.
Problem Representation Design Patterns
Design Pattern 5: Reframing
The reframing design pattern refers to changing the representation of the output of a machine learning problem.
Sharp normal PDF, better stick with regression. Other wired distributions, or large variance, may treat problem as multiclass classification, predicting the bucket the value may fall into. 1. One we get a probability distribution over buckets with this reframe. 2. Regression methods that assume a conditional distribution may provide this, but wired distribution like tweedie maybe hard to work with??? 3. Quantile regression 4. Multi-task learning.
Dataset size is a concern when reframing, the more complex the alternative modling objective, more data it may need.
Design Pattern 6: Multilabel
Use multilabel when:
- A single training example can be associated with mutually exclusive labels.
- A single training example can have many hierarchical labels.
- Labelers describe the same item in different ways, and each interpretation is accurate.
NN is easy Softmax/Argmax -> Sigmoid/TopK
Hierarchical labels: 1. flatten the label hierarchy 2. mutilevel models
Overlapping labels: label smoothing, mix up?
Other models, trees etc multiple one vs all the rest binary classifiers.
Design Pattern 7: Ensembles
Combine multiple machine learning models and aggregate their results to make predictions.
reducible error (Bias + variance) + irreducible error
Bias: assumption the model makes on the relationship between X and y.
Variance: amount the model will change when training data changes.
High bias - oversimplified model - underfit High variance - overcomplicated model - overfit
Hard to achieve both, thus usually there is trade offs.
Recent work suggest with large enough data, hugely overparametrized model can achieve low variance and low bias.
Bagging, boosting, stacking.
- Bagging - bootstrap aggregating. -> targeted mainly at variance
- Boosting - iteratively adding weak learners. -> targeted mainly at bias
- Stacking - hierarchy of models, with higher level models takes previous layers model’s output as inputs. -> learned aggregation.
MCMC dropout ~= bagging, or just TTA
Drawbacks, decreased model interpretation.
Design Pattern 7: Cascade
Where a machine learning problem can be profitably broken into a series of ML problems. Cascade is not necessarily a best practice, avoid if possible.
Example, predict likelyhood of product return and need to take consideration whether the customer is retail or reseller.
+ model0 to predict P(retailer) + model1 to predict P(return | retailer) + model2 to predict P(return | reseller) + model3 to combine above three
Need special design to carry out experiments. That is: 1. model1 and model2 will have to be trained using subsets predicted via model0. 2. model3 has to be trained on outputs from model1 and model2.
We are likely need to train downstream models using stacked out-of-fold predictions from upstream models to avoid: 1. information leak 2. drastically decreasing datasize
And a pipeline is needed to organize the whole cascade and make sure the wholething will be retrained all at once.
Design Pattern 9: Netural Class
At the begining of a machine learning, collect the right data we can avoid a lot of sticky problems down the line.
Instead of hard binary classification, also collect an extra level indicate netural??? Can’t the probablity tell us this??
Design Pattern 10: Rebalancing
This does not mean the dataset lacks representation for a specific input/input level.
Metrics - + Choosing an evaluation metric - precision/recall/f1/auc Sampling - + Down sampling majority + Up sampling rare Weighting - + class weight or really sample weight
Alternatives: + Reframing / Cascade
Model Training Patterns
Design Pattern 11: Useful Overfitting
Very few senarios where intensionally overfitting can be useful.
- To verify model works, by intensionally overfit a small sample of training data. Otherwise, model’s complexity is too low. Or bugs in code.
- Knowledge distilling. Train a smaller model to mimic a well trained large model, aim at portability and quick inference time. The small network should have enough capacity, but maybe hard to train as is.
- When the goal is to approxiate a know deterministic process, thus there are no noise and no unseen y~X relationship in inference time.
Design Pattern 12: Checkpoints
- Allow resume training.
- Allow early stopping.
- Allow ensemble with multiple checkpoints.
Design Pattern 13: Transfer Learning
Take model that has been trained on the data of same modality and adapt it to the data at hand.
Eg: Used imagenet trained CNN to do image classification at hand.
Fine-tuning vs feature extraction
Design Pattern 14: Distribution Strategy
Data parallelism vs model parallelism
Synchronous training vs Asynchronous training
Synchronous training is particularly vulnerable to slow devices or poor network connection because training will stall waiting for updates from all workers. This means synchronous distribution is preferable when all devices are on a single host and there are fast devices (for example, TPUs or GPUs) with strong links. On the other hand, asynchronous distribution is preferable if there are many low-power or unreliable workers. If a single worker fails or stalls in returning a gradient update, it won’t stall the training loop. The only limitation is I/O constraints.
Design Pattern 15: Hyperparameter Tuning
- Manual Tuning
- Grid Search
- Bayesian optimization
The hyperparameters that need to be tuned fall into two groups: 1. those related to model architecture and those 2. related to model training
Resilient Serving Design Patterns
Design Pattern 16: Stateless Serving Function - Latency
Export model into a format that captures the parameters and stateless forward pass and is language agnostic.
Make sure model is only loaded in case of cold starts.
Individual requests were handled synchronously and as quickly as possible.
Expose inference as REST api.
Autoscaling can be achieved via Web App frameworks.
Design Pattern 17: Batch Serving - Throughput
When inferences are not latency sensitive. Precompute / Refresh periodically and serve.
Eg. Daily personalized playlist recommendation.
Batch serving design pattern uses distributed data processing infrastructure(MapReduce, Spark, Beam etc) to do ML inference. Or really any parallel computing backend.
Lambda architecture - A production ML system that supports both online serving and batch serving.
Design Pattern 18: Continued Model Evaluation
Machine learning model usually creates a static model from historical data. When the model goes into production, it can start to degrade and its predictions can grow increasingly unreliable.
Concept drift occurs whenever the relationship between the model inputs and target have changed. This often happens because the underlying assumptions have changed, such as models trained to learn adversarial or competitive behavior like fraud detection, spam filters, ad bidding, or cybersecurity, etc. The adversary learns to adapt and may modify their behavior.
Data drift refers to any change that has occurred to the data being fed to your model for prediction as compared to the data that was used for training. Data drift can occur when: 1. input data schema changes at the source 2. feature distributions change over time 3. the meaning of data changes - like skill sets for a role etc. 4. previously unseen data, new level, new min, max values etc.
The most direct way to identify model deterioration is to continuously monitor your model’s predictive performance over time, and assess that performance with the same evaluation metrics you used during development. This information helps determine how frequently to retrain a model or when to replace it with a new version entirely.
Online prediction input and output are regularly sampled and saved. Evaluation cannot take place until ground truth data is available. 1. One approach would be to use a human labeling service—all instances sent to the model for prediction, or maybe just the ones for which the model has marginal confidence, are sent out for human annotation. 2. Ground truth labels can also be derived from how users interact with the model and its predictions. By having users take a specific action, it is possible to obtain implicit feedback for a model’s prediction or to produce a ground truth label. Aware of hidden feedback loop.
It is important to keep in mind how the feedback loop of model predictions and capturing ground truth might affect training data down the road. For example, suppose you’ve built a model to predict when a shopping cart will be abandoned. You can even check the status of the cart at routine intervals to create ground truth labels for model evaluation. However, if your model suggests a user will abandon their shopping cart and you offer them free shipping or some discount to influence their behavior, then you’ll never know if the original model prediction was correct.
Continuous evaluation allows you to measure precisely how much in a structured way and provides a trigger to retrain the model. It is important to track and validate the triggers as well. Not knowing when your model has been retrained inevitably leads to issues. Even if the process is automated, you should always have control of the retraining of your model to better understand and debug the model in the production.
This process of retraining is often carried out by fine-tuning the previous model using any newly collected training data. Where continued evaluation may happen every day, scheduled retraining jobs may occur only every week or every month.
Estimating retraining interval
Use stale data and assess the performance of that model on more current data. Split by time and estimate?
Design Pattern 19: Two-Phase Predictions
Models deployed on the edge device typically need to be smaller than models deployed in the cloud, and consequently require balanceing trade-offs between model complexity and size, update frequency, accuracy, and low latency.
Quantization. Quantization aware training.
Split problem into two parts. A smaller, cheaper model deployed on-device (wake word) followed by a second, more complex model deployed in the cloud and triggered only when needed.
Design Pattern 20: Keyed Predictions
Collect prediction from batch inference from a distributed system(order in batch can be messed up, thus need key to grab output.)
Adding keys before invoking inference and removing keys after reordering results can be slow.
Client code provide key is better.
If bandwidth is not an issue, can send back input, output pair, and client can do the matching.
Reproducibility Design Pattern
Design Pattern 21: Transform
Save feature transformations as deterministic functions.
Design Pattern 22: Repeatable Splitting
FARM_FINGERPRINT? When new data come in, previous data will still belong to their splits.
Make sure to stratify and avoid information leak as best as possible.
Design Patter 23: Bridged Schema
payment type: card/cash -> credit card/cash/gift card
let’s say we know the old card payment data is actually 10% gift and 90% credit card. During training, random assign 10% old card data as gift card, 90% as credit card.
Or. we can encode old data as [0, .1, .9] and [0, 0, 1] for new data. cash gift credit cash gift credit
As new schemda data keeps comming in, experiment when old data no longer provide positive points for models.
Alternatives: Union Schema won’t work, it is backward compatible but during inference time, we are likly only getting the new schema data.
Cascade - imputing? Too much complexity.
Design Pattern 24: Windowed Inference
Windowed Inference design pattern handles models that require an ongoing sequence of instances in order to run inference.
For example, anormaly detection, during different time of day, definition of anormaly can be different? What it means to be extremely busy in the morning can be quite different from lunch time. Thus features are computed per time window. As a result, a single number input is not sufficient for model to do inference. We can either store all the data and calculate everytime, or we can calcualte periodically/event triggered and save stuff that is needed to perform inference on a single new input.
RNN a stateless model requires stateful input features. (h0, c0)
Lots of details that I don’t understand yet. Check Stream processing/Beam etc??
Design Pattern 25: Workflow Pipeline
Creating an end-to-end reproducible pipeline by containerizing and orchestrating the steps in machine learning process.
We want to be able to: 1. run the entire workflow end to end 2. able to run individual steps
Containers - guarantee that we’ll be able to run the same code in different environments, and we will see consistent behavior between runs.
Data Ingest -> Data Validation -> Data Analysis -> Model Training -> Model Deployment
- Kubeflow pipelines
- Apache Airflow
DAG - directed acyclic graph.
Intergrating CI/CD with pipelines. Trigger -> Build -> Deploy.
Development versus production pipelines.
Lineage tracking in ML pipeline.
Design Pattern 26: Feature Store
The Feature Store design pattern simplifies the management and reuse of features across projects by decoupling the feature creation process from the development of models using those features.
An ad hoc approach where features are created as needed by ML projects may work for one-off model development and training, but as organizations scale, this method of feature engineering becomes impractical and significant problems arise. 1. Ad hoc features aren’t easily reused. Features are re-created over and over again, either by individual users or within teams, or never leave the pipelines. 2. Data governance is made difficult if each ML project computes features from sensitive data differently. 3. Ad hoc features aren’t easily shared between teams or across projects. 4. Ad hoc features used for training and serving are inconsistent—i.e., training–serving skew. 5. Productionizing features is difficult.
Decouple feature engineering from feature usage
Design Pattern 27: Model Versioning
To gracefully handle updates to a model, deploy multiple model versions with different REST endpoints.
This ensures backward compatibility—by keeping multiple versions of a model deployed at a given time, those users relying on older versions will still be able to use the service.
Versioning also allows for fine-grained performance monitoring and analytics tracking across versions. We can compare accuracy and usage statistics, and use this to determine when to take a particular version offline. If we have a model update that we want to test with only a small subset of users, the Model Versioning design pattern makes it possible to perform A/B testing.
Additionally, with model versioning, each deployed version of our model is a microservice—thus decoupling changes to our model from our application frontend. To add support for a new version, our team’s application developers only need to change the name of the API endpoint pointing to the model.
An ML engineer deploying a new version of a model as an ML model endpoint may want to use an API gateway such as Apigee that determines which model version to call.