2 words code to compare 20 ML regression models with Pycaret

Shashank Shekher
6 min readApr 20, 2020

--

Machine Learning is about experiments. Even after you understood the data pretty well, it never harms to test it with several models and analyse the result before finalizing the best model for that data. PyCaret is a shortcut, an efficient and easy one, to do the same. It literally needs two words to produce the results from 20 different models for regression. Of course, the similar is also true for classification, clustering, NLP, anomaly and preprocessing.

Pycaret is an open-source python library which is a wrapper over several other machine learning libraries and APIs like sklearn, SpaCy, XGBoost etc. It aims to reduce the time and effort it takes to make the data ready for the analysis and finding the result of a machine learning model. Its first version was released recently and the main contributor to this library is Moez Ali. I have used this library in one of my projects and I am very impressed, that is why I decided to write this article. I would demonstrate most of the basic features of PyCaret regression here, in this article.

Installing PyCaret:

For Local Jupyter Notebook:

!pip install pycaret or !pip3 install pycaret

Note: I assume you have basic knowledge of regression for the rest of the Article.

Let’s Begin:

Data: Swedish Agriculture-related data with the following columns:

import pandas as pd

We will need pandas to import our data file.

df = pd.read_csv(‘agriculture_data.csv’)

First row of the dataset I used for the demonstration of the following codes

This is a column of that dataset so that you can get the idea of the range of data in each column. Now you import the Pycaret’s regression module as shown below:

from pycaret.regression import *

After that, use the setup() function, it needs to be called before you want to do anything on Pycaret. This function initializes the environment inside Pycaret and prepares the data for modelling and deployment. You can run it as under:

caret_df = setup(data = df, target = ‘kg_per_hectare’, session_id=55, categorical_features = [‘county’], ignore_features=[‘total_production’])

Inside the setup() function, only two parameters are mandatory, namely ‘data’ that takes the name of the dataframe and ‘target’ taking the name of the target column.

Session_id can be any number and for some reason, if you are encountering an error at this step, changing the session_id helps.

Categorical_features is also optional as the Pycaret self detects the datatypes, but if you want to explicitly declare a column as categorical, insert it into the list as shown in the code above.

ignore_features takes a list argument to instruct the setup() environment to not to include those columns and therefore ignore from the further analysis.

After running this statement you will be presented the features with their detected datatypes. If those are current press enter in your Jupiter notebook else type ‘quit’ in the empty box presented at the end of the cell.

List of columns in our dataset and their detected datatypes

For me, it was correct so I pressed the ‘Enter’ key. Thereafter, we will get further analysis:

List of several preprocessing steps that Pycaret performed and prompted ‘False’ if not performed.

These are some of the preprocessing steps that Pycaret considers. It prompts ‘False’ if those preprocessing steps were not performed. Some of the examples are PCA, Normalization, Binning, onehot encoding etc. It also tells you if there is some missing value in the data-set.

The two words:

Now we will use the ‘two words’, written in the title to compare 20 regression models results. Use the following code:

compare_models()

This function will list the twenty regression models with the MAE, MSE, RMSE, R2, RMSLE, MAPE values in the descending order of R2 (R²). This function assumes the K-fold cross-validation to be 10. The results you see are the average of the 10 experiments with different test sets. You can explicitly define the value of k-fold by using ‘fold parameter’: compare_models(fold=10)

This will give us the following results:

Results from 22 regression Models

This is a great starting step to compare 22 models and decide the best ones to proceed with.

Create individual model:

To look at a single model, let us say ‘random forest’, use create_model() function. You need to use it with the few of your top-performing models as per your criteria. Use it as follows:

forest_reg = create_model(‘rf’)

‘rf’ stands for the random forest. Use the following list to get use create_model() with other regressors of your choice:

Estimator                     Abbreviated String
Linear Regression 'lr'
Lasso Regression 'lasso'
Ridge Regression 'ridge'
Elastic Net 'en'
Least Angle Regression 'lar'
Lasso Least Angle Regression 'llar'
Orthogonal Matching Pursuit 'omp'
Bayesian Ridge 'br'
Automatic Relevance Determ. 'ard'
Passive Aggressive Regressor 'par'
Random Sample Consensus 'ransac'
TheilSen Regressor 'tr'
Huber Regressor 'huber'
Kernel Ridge 'kr'
Support Vector Machine 'svm'
K Neighbors Regressor 'knn'
Decision Tree 'dt'
Random Forest 'rf'
Extra Trees Regressor 'et'
AdaBoost Regressor 'ada'
Gradient Boosting 'gbr'
Multi Level Perceptron 'mlp'
Extreme Gradient Boosting 'xgboost'
Light Gradient Boosting 'lightgbm'
CatBoost Regressor 'catboost'

Tune the Hyperparameter:

After we have analysed the individual models. PyCaret also helps us tune a model. I will tune the random forest as this is the best model for my dataset. But you can refer to the above table and use the abbreviation opposite to the estimator name, replace that with ‘rf’ in below code.

tuned_freg = tune_model(‘rf’)
print(tuned_freg)

Visualize your model:

Another big plus of the pyCaret is that it gives us the powerful interactive output to visualize our models and look at different graphs like Residual plots, Prediction error plots, feature importance etc. Use the below-mentioned code for interactive visualization:

evaluate_model(tuned_rf)

Interactive visualization plot
Interactive visualization Plot

Use the grey buttons at the top to look at different graphs. Note that all graphs are not available for all estimators. For example, feature importance is not available for linear regression estimator. The following is the residual plot which I got after clicking on the grey button with ‘residual plot’:

Similarly, you can check other plots as well. Personally, the feature importance plot helped me a lot to finalize my best model.

Look at the feature importance (SHAP):

Except for the feature importance plot that you can see from the above interactive design, we can also look at the SHAP values’ pictorial representation to further analyze the feature importance. The SHAP value tells us the impact of features on the model target. Use this code to get the SHAP values plot.

interpret_model(tuned_rf)

Nice! the values county_9, county_10 etc are the dummy variables create at the setup() phase by PyCaret by using onehot encoding on the column ‘county’. I have analyzed other models too but the random forest is the best because it uses the relevant feature. Therefore, this plot along with the feature importance plot is very useful to choose the correct model.

Summary:

In this whole article, we never saw two lines of code written at once and we already:

Got the summary from 22 estimators

Tuned the hyper-parameters

Plot 8 different graphs in interactive mode

There is a lot of Pycaret to be covered, it was just the basics. I will love to present more such useful tutorials, just let me know by giving several claps and following me on medium.

Stay Blessed, Stay Healthy!

--

--

Shashank Shekher

Machine learning, Blockchain, Front-end development, Data structure enthusiast. Sometimes, I write motivating articles too. Feel free to contact me. I ❤ queries