Retraining Machine Learning Models, how important it is ?
Table of contents :
- Retraining Machine Learning Models
- Model Drift
- Different ways to identify model drift
- Performance Degradation
- Distribution of training and Live data
- Correlations Between Features
- Examining the Target Distributions
- Model retraining
- How often model you should retrain ?
- How can you retrain your model?
- Reference
1. Retraining Machine Learning Models
Before going into more detail about retraining approaches of machine learning models, Let’s see the basic cycles of machine learning models.
Generally machine learning models will be trained by some learning between set of input features and dependent feature or target variable. The aim of the model is to minimize the prediction error by applying or optimizing cost functions, and when we found some optimized models, we will deploy into the production and the aim is that model will generate accurate predictions on future unseen data as well so the goal is that model will predict the future unseen data as accurately as data used during the training period.
Once machine learning models has been deployed to production, we assume that future unseen data will be similar like past data through which we trained our model. Moreover our assumptions are like the distribution of the data of all the features should remain constant, but this will not be the case always. Data distribution will change over time and our model should adapt those changes.
If we take an example of house price prediction, than prices of house will not remain same all the time. Data that has been used to train model which predict the prices of house some months ago will not give great prediction today. We need up to date information to train the models.
When developing any ML model, it is important to understand how data will change over time. There is a need of good architecture system or plan for maintaining our models updated.
Sometimes to know this types of changes or monitoring the model performance, there is a concept of Model Drift. Lets see what it is.
2. Model Drift
Model drift refers to several parameters that will change eventually. There are many cases which occur like predictive performance will degrade over time and it may occur due to some environmental changes and there is a need of determining these model drifts and it can be sort out using different Model Retraining Approaches.
When we say Model Drift, of course it’s not the model that is drifting. It’s the environment that is changing around the model.
There are different approaches for identifying model drifts and model retraining approaches but there is no one standard method for all kinds of problems. There will be different methods should be apply for different kind of problems.
3. What are the different ways to identify model drift ?
3.1 PERFORMANCE DEGRADATION
To identify model drift is to explicitly determine that predictive performance has deteriorated and to quantify this decline.
Consider a financial forecasting model which will predict the next quarter’s revenue. The actual revenue won’t be observed until that quarter passes so we will not be able to see how well our model performed till that point. This will give kind of sense that at what rate model’s performance will degrade.
As Josh Wills points out, one of the most important things you can do before deploying a model is try to understand model drift in an offline environment. Data scientists should seek to answer the question “If I train a model using this set of features on data from six months ago, and I apply it to data that I generated today, how much worse is the model than the one that I created untrained off of data from a month ago and applied to today?”
3.2 Distribution of training and Live data
Since there will be a degrade in model performance due to the distribution of data which deviate from the training data, comparing these will infer the model drift as we are expecting that this will occur.
We can monitor those things via the range of values, histograms, whether they have null values or not ans so on.
3.3 CORRELATIONS BETWEEN FEATURES
There are assumptions in model building is that the relationship remain fixed. So we will monitor pairwise correlations between individual features.
As mentioned in What’s your ML Test Score? A rubric for ML production systems, you can do this by:
- monitoring correlation coefficients between features
- training models with one or two features
- training a set of models that each have one of the features removed
3.4 EXAMINING THE TARGET DISTRIBUTIONS
If distribution of dependent variable changes very frequently or significantly, model performance will deteriorate.
The authors of Machine Learning: The High-Interest Credit Card of Technical Debt state that one simple and useful diagnostic is to track the target distribution.
4. Model Retraining
Some times model retraining refer to finding new hyper parameter of an existing model architecture. Let it be clear that it is not about finding new parameters, finding new models are adding/updating features of the model. So we need to think about our problem such that we can reduce model drift on our productionize model.
Entire machine learning model building process goes through set of cycles from feature engineering to model selection to error estimation. Then the optimal model will be chosen and deploy into production.As model drift refers to degradation of model performance due to some reasons like variation of data or change in environment, model retraining should not result in a different model generating process.Rather retraining simply refers to re-running the process that generated the previously selected model on a new training set of data.The features, model algorithm, and hyperparameter search space should all remain the same. One way to think about this is that retraining doesn’t involve any code changes. It only involves changing the training data set.
Its not like should not change the parameters and add features, but if we do that then there is a need of entirely new model building process and that needs to be tested before deploying to production environment.
5. How often model should be Retrained ?
There is no standard rule that we should retrain model weekly or monthly or quarterly. It all depend on the use case and vary problem to problem.
If we are looking for student model that will predict the return of the students in next semester then there is no use to retrain model daily as there will be mode data semester wise, so we can retrain quarterly or as on new semester.
This is an example of a periodic retraining schedule. It’s often a good idea to start with this simple strategy but you’ll need to determine exactly how frequently you’ll need to retrain. Quickly changing training sets might require you to train as often as daily or weekly. Slower varying distributions might require monthly or annual retraining.
6. How can you retrain your model ?
- For retraining model in machine learning is directly related to how often we decide to retrain our model.
- If we decide to retrain our model periodically, then batch retraining is perfectly sufficient. This involves scheduling model training processes on a recurring basis using a job scheduler such as Jenkins or Kubernetes Cron Jobs.
- If you’ve automated model drift detection, then it makes sense to trigger model retraining when drift is identified.For instance, you may have periodic jobs that compare the feature distributions of live data sets to those of the training data. When a significant deviation is identified, the system can automatically schedule model retraining to automatically deploy a new model. Again this can be performed with a job scheduler like Jenkins or by using Kubernetes Jobs.
- It may make sense to utilize online learning techniques to update the model that is currently in production. This approach relies on “seeding” a new model with the model that is currently deployed. As new data arrives, the model parameters are updated with the new training data.
Apart from these, there are various approaches which we can follow like,
- Retrain model using bigmler ( BigML is a third party services which provide automatic ML model building and bigMLer is the command line tool which can help to retrain model using their APIs)
- DeltaGrad: Rapid retraining of machine learning models (This is based on research paper, references are given in the reference section)
- automated-model-retraining-with-kubeflow-pipelines (Kubernet is also providing automatic kubefloe pipelines for retraining models)
- machine-learning-automated-model-retraining-sagemaker (Amazon providing cloud services for automatic retraining and deployment of models. Implementation part of this you will find in different notebook)
- Continuous learning with Watson ML (IBM also providing watson ML cloud service for the same retraining purpose)
7. References
https://arxiv.org/pdf/2006.14755.pdf
https://medium.com/kubeflow/automated-model-retraining-with-kubeflow-pipelines-691a5f211701
https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45742.pdf
https://www.inawisdom.com/machine-learning/machine-learning-automated-model-retraining-sagemaker/