5 lines of code with PyCaret & XGBoost that got me into the top 8% of data science hackathon.
I recently participated in a Hackathon organised by Analytics Vidya, a data science platform similar to kaggel, based in India. Meantime, I was preparing for my Azure certification. I had limited time in my schedule. Yet, I wanted to participate in this hackathon to get a flavour of hackathon competitions.
To my surprise, with this combination of PyCaret and XGBoost, I managed to get into the top 8%. This article is about how I implemented this combination in five simple steps.
Moreover, this will not only be useful for a data scientist who is exploring the world of data science competitions. It will also be useful if you want to participate in a data science competition and you do want to automate the process of trying different machine learning algorithms to your data.
In this article, I will not go into the introduction of PyCaret and XGBoost specifics. The links are attached in the reference section if you want to have a look at these libraries. I have also attached my GitHub link, which has a detailed notebook that I submitted for Analytics Vidya Hackathon. If you want to take a look at a problem that I was solving.
Step 1: Create a Python virtual environment and Install PyCaret, XGBoost libraries.
Creating a new python environment for every new project is one of the best practices that can save a lot of time for a data scientist who is juggling different project. If you want to know how to create a python virtual environment that keeps your python environments separate for each project. Then check my previous blog as listed below:
Python Virtual Environment for Data Scientist in 3 steps.
Python is one of the popular choices for beginners who want to learn Data Science. Due to its simplified syntax with…
The command pip install pycaret[full] will install the PyCaret library. The command pip install xgboost will install the XGBoost library in the python virtual environment.
Step 2: Import the data and Initiate PyCaret, XGBoost Libraries.
The step is to import the data and libraries. Initialise the libraries, that will be used for the model building process. The command import xgboost — imports the xgboost library.
The command from pycaret.classification import * — imports the classification module of PyCaret. As I was solving a binary classification case, PyCaret’s classification module is initialized. Depending on what case we want to solve, for instance, either regression or clustering respective modules can be initialized.
Now, importing the data with pandas. The command import pandas as pd— will import the panda's library. The command df_train = pd.read_csv(‘train.csv’) — will import the train data set.
Step 3: Setting up Environment in PyCaret
We must set PyCaret environment before we begin any machine learning environment. The command setup— initializes the environment, data — is where you provide your dataset .i.e. going to be your training dataset. ignore_features— is where columns that are to be ignored by the model while training and prediction are specified. categorical and numerical features — are the categorical and numerical columns respectively.
Apart from the parameters that are used, there are many different parameters that can be used based on the requirements of the problem at hand. PyCaret documentation provides a detailed explanation, links are attached in the reference.
Step 4: Train and tune the XGBoost model.
It takes only one parameter to create the model, create_model—will create the model and ‘xgboost’ — the model id is given in the form of string. Similarly, it takes one parameter to tune the model, tune_model — this will perform hyperparameter tuning for the model.
However, you can use different models that PyCaret has to offer and compare the scores of these models. Check the PyCaret documentation for further details.
Step 5: Make predictions and upload the submission dataset.
For prediction, the command predict_model — will initialize the prediction environment, with one parameter like previous steps. There you mention your tuned model (‘tuned_xgboost’) that you want to use and the data — test dataset that you want to make predictions on.
As most hackathons want you to submit a submission.csv file. In my case, I had to predict the probabilities of class 1. Therefore I used raw_score=True — this gives the predicted probability scores for both the classes. Later I filtered the probability score for class 1 and made my final submission.
Finally, I was amazed to see how well the model predicted, with just 5 lines of code. I ranked in the top 8% of the leaderboard of the entire competition, where more than 8000+ data scientist participated. Moreover, the predictions scores of the top rankers were not much higher compared to my prediction scores, to my surprise, the difference in score was only 0.0090.
In conclusion, based on my experience of using Pycaret and XGBoost, the combination can save a lot of time. Either, if you are someone, who is new to data science and wants to have a flavour of data science competitions. Or, you want to keep your code low and want to spend more time in feature engineering and feature selection. Then PyCaret is a place to go. I will not be wrong if I say XGBoost is the model that can give you top scores, especially when you are dealing with tabular data.
It is also important to note, the approach that I used may not win you the competition. However, this will encourage many budding data scientist to gain confidence in participating in hackathons. Take the initial step of the first model submissions that makes decent predictions.
PyCaret documentation for binary classification: https://www.pycaret.org/tutorials/html/CLF101.html
GitHub link that has the jupyter notebook: https://github.com/javedhassans/Predicting-upselling-for-creditcard