Imad Dabbura - Designing ML Systems

Below are notes I gathered from my readings in MLOps as well as deploying ML systems at different companies.

Project Scoping

During training, we care more about throughput. However, once we deploy the model, we care more about latency.
- Increased latency in production leads to reduction in customer satisfaction and conversion rates, which are more important than a relatively more accurate predictions.
- It is common to focus on the high percentiles of the latency such as 90th/95th/99th or even 99.9th percentiles.
Map ML metrics to business metrics and see how improvement in ML model would lead to an improvement in business metrics.
- In some cases, ML is just a small part of the overall process that makes it hard to attribute loss/revenue to a particular component in the process.
Requirements of ML systems:
- Reliability: ML systems performs correct function in the case of failures. How do we check if predictions are not wrong?
- Scalability: ML system can grow in:
  - Model complexity
  - Number of models
  - Number of requests served by each ML model
- Maintainability: Code should be documented. Code, data, and artifacts should be versioned. Experiments and models should be able to be reproduced
- Adaptability: ML system should adapt quickly to change in data distribution and business requirements
Developing an ML system is an iterative process that we typically go back and forth between different steps such as: scoping project, data engineering, model development, model deployment, monitoring and retraining, business analysis, etc.
For multiclass classification problems with a lot of classes, it may be helpful to frame it as a hierarchical classification where each example is first classified into few major classes then another classifier is used to classify subclasses and so on.
- For each class, ML model needs at least 100 samples to learn to classify that class.
For mutlilabel classification problems, we either build a binary classifier for each label or use one model for all labels
- It is more challenging to build one model with multilabel because now we need to figure out how to get predictions out of raw probabilities
If we have multiple objectives, it is better to decouple the objectives and train different model for each objective. Finally, get the final score of each prediction by combining the output from each model using different weight for each objective (which can be tuned based on business priorities). This way we can change one model without the need to retrain all the models.
Data-centric approach to ML model development tends to lead to the best results as compared to algorithm-centric approach. Data is the critical piece in obtaining good performance if we have decent architecture.

Types of Data

First-party data: Data collected by companies about their customers
Second-party data: Data collected by another companies about their customers
Third-party data: Data collected on the public on who aren’t their customers
User data are the most messy and require heavy cleanups

Data Passing Modes

Databases. Not recommended for applications with strong latency requirements
Services using requests such as REST or RPC. Recommended for applications that rely more on logic than data. It is called request-driven/microservice architecture, which is the most common
Real-time transport such as Kafka. Recommended for applications that are data-heavy. It is called event-driven architecture.

Feature Types

Batch features (also called static features) are features that aren’t expected to change frequently, which are computed using batch processing.
Streaming features (also called dynamic features) are features that changes quickly, which are computed using stream processing.

Sampling Methods

Nonprobability sampling such as selecting the most recent data
Random sampling
Stratified random sampling
Weighted sampling
Reservoir sampling
Importance sampling

Model’s Feedback

Extracting labels from feedback: User feedback goes through different stages where each stage has different volume, strength of signal, and feedback loop length
There is a trade-off between accuracy and speed of the feedback loop window. The shorter the feedback loop window the less accurate the labels. But also the longer the feedback loop window the longer to notice and fix model problems.

Data Labeling

Weak supervision: Have no labeled data and use labeling functions to label data (Snorkel)
Semi supervision: Have limited labeled data that are used to generate more labels through different methods such as perturbation
Transfer learning
Active learning: Selectively pick the samples to train the model on such as the most confident wrong samples

Class imbalance

There are different degrees of class imbalance.

It affects learning algorithms in many ways:
- Algorithm didn’t have sufficient signals from rare classes
- Algorithm may use simple heuristics to always output majority class to get best metric
- Rare classes have typically asymmetric costs of errors such as if the X-ray is cancerous. Therefore, we might need to build a model that is more accurate on the rare classes and less accurate on the majority class(es)
The more complex the problem -> the more sensitive the algorithm(s) to class imbalance. If classes are linearly separable, class imbalance has no effect.
Solutions:
- Data resampling methods:
  - Undersampling
  - Oversampling
  - SMOTE: oversampling technique
  - Tomek links: undersampling technique where two similar samples from opposite classes are chosen and the one from majority class is dropped
  - Dynamic sampling: Undersample majority class to all classes have the same number of samples and train the model on the resampled data. Then fine tune the model on the original imbalanced data
- Algorithm methods:
  - Define cost sensitive matrix that will be used in the loss function
  - Class-balanced loss where each sample has weight of misclassification that is inversely proportional to the number of samples it belongs to
  - Focal loss: Focus on learning the samples that model has difficulty in classifying; i.e. has the low probability of being right
  - Ensembles sometimes help with class imbalance

Data Augmentation

Label-preserving transformations such as randomly flipping images or replace word with its synonym in NLP tasks
Perturbation that adds noise to the data but still preserve the label. BERT uses such technique where 15% of the tokens are chosen randomly and 10% of such chosen tokens will be replaced by random tokens.
Data synthesis: create data from existing data such as mixup or using templates in NLP

Feature Engineering

Having the right features give us the biggest performance boost compared to algorithmic techniques and hyperparameter-tuning. It requires domain-specific knowledge, SME, and algorithm-specific knowledge.

Missing values. Be careful that not all missing values are NaN. Some systems use predefined values such as empty string or 1999 for missing values.
- Reasons for missing values:
  - Missing not at random: When the value is missing due to related observed value such as some type of customers don’t disclose their age.
  - Missing at random: When the value is missing due to the value itself such as when people with high income mostly decide to not disclose their income.
  - Missing completely at random: When there is no pattern for missing values such as customer forgot to insert the value. This is very rare in practice.
- Handling missing values:
  - Deletion:
    - Columns deletion. Not recommended for almost all cases.
    - Row deletion.
      - It might be okay if the % of rows with missing values are very small especially for large datasets and it is missing completely at random
      - It is very risking to delete such rows if the missing are either not at random or at random
  - Imputation. The most common is median/mode imputation. Make sure that if the filled value is constant to not be a possible value for the feature such as -999
  - It is almost always a good idea to add an indicator for each feature that indicates whether a value is missing
Scaling. Most classical ML algorithms such as logistic regression and gradient-boosted trees as well as NN require that features are normalized.
- We can normalize to have range [0, 1] or [-1, 1] if we don’t want to make any assumption on the distribution of the data. This would change the distribution of the data.
- Standardization. This would preserve the distribution. Be careful that the statistics used from training data may be out of date when used during inference if the distribution of the feature changes dramatically. Therefore, we need to train the model frequently to update such statistics.
- Other transformation: log(1 + x) or sqrt(x)
Discretization/Binning. Convert the feature into small buckets (bins) which will make the feature categorical. It is useful if values within a bucket is not that different in terms of the feature itself or the behavior of the customer. It is rarely proven to be useful. It is a challenge to pick the number of bins or the bin boundaries. Quantiles are commonly used for bins boundaries.
Encoding categorical features. Some features have fixed categories such as marital status while others have categories that change all the time such as product names.
- It is risky to encode infrequent/unknown categories to one category.
- Hashing is pretty good especially if the hashing space is big enough where collision rate is low. We don’t have to worry about unknown categories in production.
Feature crossing. It helps learn nonlinear relationships between features especially for models that aren’t good at learning nonlinear relationships such as logistic regression and tree-based models. The caveat is that it makes the feature space of the new feature blow up and requires more data to learn all these possible values. For example, if two features have 10 unique categories each, then the new feature would possible have 10 x 10 = 100 possible values.

Data leakage

It refers to the case where some form of the label leaks into the feature(s) that aren’t available during inference. Common causes for data leakage:

Splitting time-correlated data randomly. This is due to the fact that trends are time-correlated and the model would have access to future things it wouldn’t have otherwise.
Scaling before splitting
Fill missing values before splitting
Poor handling data duplication before splitting. Check for duplicates or near-duplicates before/after splitting
Leakage from data generation process. Need to check with SME because this is the hardest cause of data leakage to uncover because it requires detailed knowledge about the process.
Group leakage: group of data have correlated labels but are spread into different splits such as images of the same person are spread into training and test
Feature importance is very important to avoid useless features as well as checking for data leakage. It is also helpful to interpret the model.
- Be careful of features that have very high importance
- It is better to remove features if they are useless even if the model can ignore them
Feature generalization:
- Features coverage (% of missing values). Check whether the coverage is similar between splits. Also, if the coverage is low, check if it adds value to the model; otherwise, delete it.
- Distribution of values between data splits. Aim for 100% overlaps of values for all features between the splits.
  - Build a model to predict whether a row is in train or valid using the features used to build the main model. If the new model is good enough, check the most important features because those will be the features that have different distributions between train and valid splits.
  - There is a trade-off between generalization and specifity. IS_RUSH_HOUR is more generalizable than HOUR_OF_DAY.

Model development

Baselines:
- Random
- Most common
- Simple heuristics
- Random Forest is also a good baseline
Evaluation methods:
- Perturbation tests. Ideally, we don’t want the output of the model to change much if inputs are close to each other. We can add random noise to the test data to see how the model behaves and check its sensitivity.
- Directional expectation tests. Check if changes in some features would lead to changes in the output in the expected direction; otherwise, the model might be learning the wrong thing.
- Model calibration. Unless the model is explicitly trained using LogLoss, the probabilities from the model won’t be calibrated. We need to calibrate the model so that the probability produced by the model should match the underlying probability distribution of the data.
- Confidence level measurement. We can avoid showing uncertain predictions and maybe involve human in the loop or ask the user for more info. For example, predict positive class if probability >= 80% and negative class if probability <= 20%.
- Slice-based evaluation. Evaluate the model on different slices of data to either make sure the model’s performance across different slices are acceptable OR if we care about critical slices we can confirm our requirements. This also gives us indication of how to improve the model. We can discover slices by:
  - Simple heuristics that come from domain knowledge and heavy EDA of the data
  - Error analysis
  - Slice finder algorithms

Model Deployment

Online prediction: Predictions generated on demand using either streaming features OR both streaming and batch features.
- Pros: Always adapt to new behaviors/trends. Also don’t have to compute predictions not needed and waste compute/storage
- Cons: Constrained by latency
Batch prediction: Predictions generated periodically using batch features.
- Pros: Not constrained by latency and utilize vectorization/parallel computations and allows to use very complex models that takes time to generate predictions
- Cons: Model is not responsive to changes in customers behaviors/trends
Model compression:
- Knowledge distillation
- Pruning
- Quantization
Cloud vs Edge:
- Cloud is much easier to setup and very flexible at the cost of privacy and network latency.
- Edge avoid network latency and privacy concerns but device must have enough memory, compute, and battery. Also, It takes more engineering work and model compression to make the model run on devices with a reasonable latency.

Causes of ML System Failures

Software system failures such 1) server is down, 2) hardware failures, 3) dependency failure, 4) deployment failure. Addressing such failures require more of SWE skills than ML skills.
ML-specific failures. They are hard to detect and fix. Examples of such failures such as 1) data processing/feature engineering errors, 2) wrong hyperparameter values, 3) data distribution shifts.

Data Distribution Shifts

Production data is usually different than training data:
- Production data is not the same as the training data
- Production data is not stationary. It can change suddenly/gradually/seasonally.
Sometimes data shifts are due to internal errors that have nothing the data such as bugs in data pipeline, missing values are incorrectly inputted, inconsistency between training and inference features, wrong model version, etc.
Edge case is related to model performance in which the model performs badly on such cases. ML systems need to take into account such edge/corner cases and decide what to do with them when the model face them in production.
Degenerate feedback loop is when the model’s output affect user behavior which affects the input to the ML model which again affects its output. This is why popular recommendations get more popular because the model first recommends them, then the users are more likely to click on them than the rest which will feed back to the model which will make it recommend them more with higher probabilities.
- We can use different metrics to detect degenerate feedback loop such as hit rate against popularity and the accuracy of the model for item groups with different popularities feedback loop such as hit rate against popularity and the accuracy of the model for item groups with different popularities.
- Correcting degenerate feedback loop can be through either randomization of recommendation until we gather enough feedback of an item that reflects its true quality OR change the positions of the recommendation
Types of data shifts:
- Covariate shift: When \(P(X)\) changes but \(P(Y/X)\) remains the same. For example, if age distribution changes but probability of success given a certain age remains the same. It can happen for many reasons during training such as over/under sampling and selection bias. It also can happen in production such as doing campaigns to attract certain group of users.
- Label shift: When \(P(Y)\) changes but \(P(X/Y)\) remains the same.
- Concept shift: When \(P(Y/X)\) changes but \(P(X)\) remains the same. Happens typically for seasonal/cyclic trends.
- Feature change such as possible values for a feature changes or the unit of measurement changes such as age is now in months instead of years.
- Label schema changes such as we have more classes or possible values changes in the case of regression.
We can monitor \(P(X)\), \(P(Y)\), \(P(Y/X)\), and \(P(X/Y)\). We can monitor spatially and temporally.
- Temporal monitoring is more complicated and requires monitoring at different granularity using different time windows as well as using sliding statistics and cumulative statistics.
Retraining: Either retrain from scratch or continue training from last checkpoint. Also, we need to decide which data to use for retraining: Last month, old data plus new data, only data that started to drift. We need to do some experiments to figure out which strategy works best for a given use case.
Monitoring:
- Monitor operations-related metrics such as uptime
- Monitor ML-specific metrics such as accuracy-related metrics, predictions feedback, prediction and feature distribution, feature validation (using libraries such as Great Expectations).

Continual Learning

How our deployed model gets (re)trained and the frequency.

Stateless retraining means training the model from scratch on every run using some predefined historical data window.
Stateful training means continue training using new data to update the model with new data. Much less compute needed and less complicated. It only works if the architecture is still the same and the features are still the same. If anyone changes, we may need to retrain from scratch.
- The best use cases for such training strategy are the ones that collect natural labels from users after making predictions; Otherwise, it will be bottlenecked by the availability of labeled data.
- The evaluation period also affects how fast we can deploy new models especially if we are dealing with imbalanced datasets.
- The main drawback of this approach is that the new model is susceptible to attacks that would result in the model outputting catastrophic predictions.
- A model update may be triggered based on time, performance, volume, or data drifts.
- Experiments with the importance of data freshness by training the model on different datasets from the past to predict current month and see how the performance deteriorates as we go back.

Testing in Production

Test on holdout data split
Test on the most recent data (backtest) such as last day
Shadow deployment. The main disadvantage is the compute cost because now we’re doing two predictions for each data point.
- Deploy the candidate model in parallel with the existing model.
- For each incoming request, route it to both models to make predictions, but only serve the existing model’s prediction to the user.
- Log the predictions from the new model for analysis purposes.
A/B Testing. We just need to figure out the sample size needed to achieve the needed statistical significance. It is the most common in the industry.
- Deploy the candidate model alongside the existing model.
- A percentage of traffic is routed to the new model for predictions; the rest is routed to the existing model for predictions. It’s common for both variants to serve prediction traffic at the same time. However, there are cases where one model’s predictions might affect another model’s predictions —e.g., in ridesharing’s dynamic pricing, a model’s predicted prices might influence the number of available drivers and riders, which, in turn, influence the other model’s predictions. In those cases, you might have to run your variants alternatively, e.g., serve model A one day and then serve model B the next day.
- Monitor and analyze the predictions and user feedback, if any, from both models to determine whether the difference in the two models’ performance is statistically significant
Canary release. Reduces the risk of introducing the model by gradually rolling it out to the users and monitor the performance of the new (candidate) model.
- Deploy the candidate model alongside the existing model. The candidate model is called the canary.
- A portion of the traffic is routed to the candidate model.
- If its performance is satisfactory, increase the traffic to the candidate model. If not, abort the canary and route all the traffic back to the existing model.
- Stop when either the canary serves all the traffic (the candidate model has replaced the existing model) or when the canary is aborted.
Interleaving experiment. We can randomly provide predictions/recommendations from different models to the same user and see how the user interact with these recommendations. It is different than A/B testing in the sense that we don’t split user base to receive recommendations from specific models but the same user receives recommendations from different models.
Bandits algorithm (least adopted). It is much more efficient than A/B testing in that it requires much less data to determine which model is better. It is stateful (as opposed to A/B testing which is stateless). It routes traffic based on exploitation/exploration criteria.