Elo Merchant Category Recommendation
--

Table of Contents:
- Introduction
- Business Problem
- Project Goal
- Data Dictionary
- Exploratory Data Analysis
- Data Preparation/Feature Engineering
- Feature Selection
- Modelling
- Result
- Submission at Kaggle
- Conclusions and Future Research
- Profile
- References
1. Introduction
Imagine being hungry in an unfamiliar part of town and getting restaurant recommendations served up, based on your personal preferences, at just the right moment. The recommendation comes with an attached discount from your credit card provider for a local place around the corner!
Right now, Elo, one of the largest payment brands in Brazil, has built partnerships with merchants in order to offer promotions or discounts to cardholders. But do these promotions work for either the consumer or the merchant? Do customers enjoy their experience? Do merchants see repeat business? Personalizations are the key.
Elo has built machine learning models to understand the most important aspects and preferences in their customers’ lifecycle, from food to shopping. But so far none of them is specifically tailored for an individual or profile. This is where we come in.
In this competition, I have developed algorithms to identify and serve the most relevant opportunities to individuals, by uncovering signal in customer loyalty. My input will improve customers’ lives and help Elo reduce unwanted campaigns, to create the right experience for customers.
The case we discuss here is a real-life marketing strategy study based on the simulated data set that mimics customer behaviour on the Elo customers database.
2. Business Problem
The motivation behind the case study is to develop ML (machine learning) models that identify and prescribe the most relevant opportunities for individuals. Helping payment brands to achieve this can improve their customers’ payment experience to create the right experience for its customers.
The problem is about helping understand customer loyalty using machine learning. Using the customers’ transaction data given and some engineered features, we have to determine loyalty score for each card-id.
Provide credit card users with personalized recommendations about local merchants (restaurants, shops, etc.), then give discounts, promotions for the recommended merchants.
3. Project Goal
1. Predict loyalty score to improve customer’s lives and help merchants reduce unwanted campaigns.
2. Minimize the difference between predicted and actual Score (RMSE).
3. Given a Customer Id with its features predict the loyalty score. (Regression)
4. Data Dictionary
The data is simulated and fictitious and is based on real consumer data. Many of the features provided in the data are anonymized which makes it difficult to interpret the importance of these. The problem has 5 datasets/files:
train.csv: it has 6 features, first_active_month, card-id, feature1, feature2, feature3 and target
test.csv: the test set has the same features as the train set without targets.
historical_transactions.csv: up to 3 months’ worth of historical transactions for each card id at any of the provided merchant ids.
new_merchant_transactions.csv: two months’ worth of data for each card id containing all purchases that card id made at merchant ids that were not visited in the historical data.
merchants.csv: additional information about all merchants/merchant ids in the dataset.
1. train & test Dataset overview:

2. merchants dataset overview:

3. Historical & New Merchants Transaction Dataset overview

5. Exploratory Data Analysis
In order to analyze the problem better in the next sections, we first need to explore the datasets which include checking the missing value, visualizing the data distribution, etc. In that way, we can have a better understanding of how the dataset looks like and how we can featurize the data to make it ready for modelling.
Reduce_mem_usage: we have written this function in order to facilitate the reduction in memory usage. What we have done is we have reduced the data type of each column to the least possible.
1. Exploring Train & Test Dataset

Observation:
There is one line with missing data in the test. I’ll fill in with the first (minimum value of date) data, having the same values of features.
Target Value Distribution

Observations:
- The shape of the distribution — it’s not normally distributed. To me, it looks more like a log-ratio distribution. This suggests that the loyalty score may actually be the log of the ratio of two numbers, for example, “number of new purchases” to “number of historical purchases”.
- The outlier values then may be cases when the denominator was zero. Since the log of zero is negative infinity, a small positive number may have been added to the denominator to prevent infinite loyalty scores. A small number like 1e-14 ends up being pretty close to the outlier values when the log is taken:
In [ ]:
np.log(1e-14)
Out[ ]:
-32.23619130191664
- We can see almost all of our loyalty scores in train data lie between -20 to 20. We can also observe that there are outlier points in our dataset which is roughly 1% of the entire train set and its value is -33.219281. Moreover, the loyalty score also seems to be a normalized column.

Exploring Feature 1, 2 & 3

Observations:
As we could see from the data, it is highly anonymized data with feature 1 values lying between 1–5, feature 2 values lying between 1–3, feature 3 values being boolean 0 or 1.
Let us see if these features in the training dataset have good predictive power in finding the loyalty score

Are the distribution of data in both train and test set similar?

Observations:
We can see from the above plots that test and train data are distributed similarly. From all the above plots, we can observe that plots show an important idea: while different categories of these features could have varied counts, the distribution of target is almost the same. This could mean, that these features aren’t really good at predicting target we’ll need other features and feature engineering.
Can we clearly classify outlier and non-outlier data using the features in the trainset?

Observations:
We can see that there are some slight differences between outliers and non-outliers, but they don’t seem to be that big and they certainly can’t explain the difference between the target values, at least based on the features in the training dataset.
2. Exploring Historical Transactions


Exploring Number of Historical Transactions per Card ID

Observations:
It looks like there are a few customers (card_ids) with a very high number of transactions but most of the card_ids have less than 500 or so transactions.
Does the number of historical transactions per card have an effect on loyalty score?

Observation:
As it can be seen from the above image that of course number of historical transactions have an effect on loyalty score. More transactions means a better loyalty score.
Does the sum amount of historical transactions per card have an effect on loyalty score?

Observation:
We can infer something from this that a higher transaction sum has a better loyalty score. The loyalty score seems to increase with the “sum of historical transaction value”. This is expected.
Exploring Authorized Flag Column

Does the percentage of authorized transactions have an effect on loyalty score?

Observations:
It seems that there are some cards, for which most of the transactions were declined. Were these fraud transactions?

Observations:
A clear Observation from the above scatter plot is that when the percentage of authorized transactions are more, loyalty is higher.
Exploring Installments Column

Observations:
• It contains values from 1 to 12. This column also contains values -1 and 999, perhaps this might have been used to fill in missing data. However, on further exploration, we found that 999 could mean fraud transactions, considering only 3% of these transactions were approved/authorized.
• One more interesting thing is that the higher the number of installments is, the lower is the approval/authorization rate.
• It is clearly evident from the above plot that when the installments are higher(either mean or sum) loyalty score tends to decrease.
Exploring Purchase Date Column
The purchase_date column contains when the purchase occurred. This will likely be an important feature for our predictive model. Card owners who use their cards regularly (or whose use is increasing) are probably more likely to have a higher loyalty score.
Also, in general, our model will need to know when each transaction occurred! Let’s take a look at how many transactions occurred as a function of the date:

3. Exploring New Merchants Transactions
In this section, let us look at the new merchant transactions data and do some analysis. Mostly new merchant transactions and historical transactions datasets have the same kind of behavior.

Exploring Number of New Merchants Transactions per Card ID
Does the number of historical transactions per card have an effect on loyalty score?

Observations:
Loyalty scores seem to decrease as the number of new merchant transactions increases except for the last bin.
Does the sum amount of historical transactions per card have an effect on loyalty score?

Observations:
Loyalty scores seem to increase with the increase in the sum of new merchant transaction values but for the last bin.
Exploring Authorized Flag Column

Observations:
All transactions in this table are authorized, so we can safely drop this column.
Exploring Month Lag Column

Observations:
All purchases here were made within 2 months after the reference date.
Exploring Purchase Date Column

Observations:
And the new transactions don’t start (for most cards) until mid-march, but the number of transactions is far less. The second interesting thing to notice is that there’s a pretty significant weekly cycle (creating the high-frequency ups and downs in the plots above).
Exploring Number of transactions as a function of day of the week:

Observations:
Looking at the number of transactions as a function of day of the week, we can see that the number of transactions ramps up over the course of the week, plummets on Sunday, and then starts climbing again on Monday.
4. Exploring Merchants Dataset
The merchants' dataset has information about every merchant that any credit card account made a transaction with.

- The merchant_group_id column contains what group the merchant belongs to. Presumably, this corresponds to a business group (e.g. “Walmart”, and individual merchants are individual stores), and not some sort of business sector identifier.
- The merchants' dataset also contains two anonymized features: numerical_1 and numerical_2. The two distributions are very similar and are both very skewed.
- The avg_{sales, purchases}_lag{3,6,12} columns store normalized sales and purchases in the past 3, 6, and 12 months.
Exploring most_recent_sales_range and most_recent_purchase_range
The most_recent_sales_range and most_recent_purchase_range contain what bin the merchant falls into in terms of recent sale amounts and purchase amounts, respectively. There are a lot more merchants with higher category values here, suggesting the lower the category value, the higher the amount (assuming the merchant-sales relationship is a log-like relationship, which is common).

Observations:
Most of the merchants have most_recent_sales_range and most_recent_purchase_range as E. Not surprisingly, the two categorical columns are correlated — a merchant’s profits likely correspond at least somewhat close to their expenses.
5. Mismatch between Merchants and Transactions Data
There are some similar columns in the transaction dataset and merchants dataset. On exploration, it appears that there is a pretty respectable mismatch between the merchant-specific values in the transactions data and in the merchants' data. This is probably because the merchant’s properties may have changed in the time between the transaction and the time the merchant's dataset was compiled. Therefore, we’ll use the values in the transactions table when creating features for our predictive model, because the values in the transactions table are more likely to reflect the merchant at the time when it was important i.e. when the transaction occurred.

6. Data Preparation/Feature Engineering
In order to predict more accurately than the baseline, we need to find features that are strongly correlated with the target value. But all of the existing features are weakly correlated with the target. Thus, we must engineer new features.
There’s not a whole lot of information in the cards table as is, so we had to engineer features of each card using the information in the other three tables. The idea is that information about how often an individual is making transactions, when, for how much, with what merchants, with what types of merchants — and so on — will be informative as to how likely that individual is to have a high loyalty score.
Data Preparation:
In merchants dataset, we did label encoding for all of our categorical variables such as category_4, most_recent_sales_range, most_recent_purchases_range. We also replaced all infinity values with nan as it didn’t give any significant information. We also found that there are duplicate merchant ids in merchants dataset and hence in order to remove duplicates, we did mean encoding for numerical columns and mode encoding for categorical columns grouped by merchant ids.
For train and test dataset, no such preprocessing was required. For historical and new merchants transaction dataset, we also first convert all infinity values to nan values as they are of no significance. We then label encode categorical variables such as 0/1 for authorized_flag column. We then impute missing values for a categorical column with a new category and instalments columns outlier data with nan values. We also converted the first_active_month column (a DateTime) to a month. This way at this point all of our data will be in a numerical format.
After all the preprocessing was done, we created features using the data we have. We merged merchants dataset with both historical transactions and new merchants transaction data before engineering new features.
Features Engineered
- Time-based features about the card and purchases.
- For example, the hour of the day the purchase was made, the day of the week, the week of the year, the month, quarter, whether the purchase was made on a weekend, the time of the purchase relative to when the card owner was first active, etc.
- Month, year, day etc. for the first active month of card.
- Elapsed Time: the difference between the date of a card’s first active month and reference date.
2. Feature Aggregation / Statistical Feature (grouped by card_id)
- Entropy: A feature corresponding to entropy could be informative — for example, it could be that card accounts with high entropy over the merchants they use their card with are more likely to be more loyal card users than those who only use their card with a single merchant (and therefore have low entropy).
- Mean Difference between consecutive items in a series: This could conceivably be a good predictor of how likely an individual is to be a loyal card user: individuals who use their cards regularly and frequently are probably more likely to be loyal.
- The period of a sequence of transactions: It could also be a useful feature. For example, customers who have been making purchases over a long period of time (the difference between the date of their first and last purchases is large) may be more likely to be loyal card users.
- Mode
- Count, Sum, Mean, Number of Unique values
- Min, Max, Std, Skew
3. Other Features Engineered
- Outlier Feature(Binary)
- Response/Target Encoding for some categorical features
- Mean Encoded Feature using target column and categorical feature 1,2,3 for trainset
- Difference between max and min purchase date of the card, first buy and last buy of the card.
- Total number and amount of historical and new transactions.
- Price (purchase_amount/installments)
4. Some features were also created by the aggregation of similar columns from both historical transactions and new merchant transactions data such as purchase_amount_total, purchase_amount_max, purchase_amount_min etc.
The same transformations were applied to just about every feature in every dataset wherever applicable. This creates many combinations of features. Finally, after all the features were engineered, all data were merged into a single data frame/table that can be fed into a model for training. Now we have one, giant table, where each row corresponds to a card account for which we want to predict the loyalty, and each column corresponds to a feature of that account. Unfortunately, what with all the aggregations we performed, we now have well over 500 features! We hence go for feature selection techniques in the next step to avoid overfitting.
7. Feature Selection
Unfortunately, what with all the aggregations we performed, we now have well over 500 features! The more superfluous features we give our model to train on, the more likely it is to overfit! To prune out features which could confuse our predictive model, we’ll perform some feature selection.
Ideally, we’d fit our model a bunch of different times, using every possible different combination of features, and use the set of features which gives the best cross-validated results. But, there are a few problems with that approach. First, that would lead to overfitting to the training data. Second, and perhaps even more importantly, it would take forever.
There are a bunch of different ways to perform feature selection in a less exhaustive, but more expedient manner. Forward selection, backward selection, selecting features based on their correlation with the target variable, and “embedded” methods such as Lasso regressions (where the model itself performs feature selection during training) are all options. However, here we’ll use two different methods: the mutual information between each feature and the target variable, and the permutation-based feature importance of each feature.
At first, after blindly doing a bunch of aggregations on a dataset, we will check for non-informative columns. That is, columns which are all NaN, only contain one unique value, etc. We removed such non-informative columns at first from our dataset. Next after the above step and engineering various features, some of the features had NaN and infinity values. After appropriate exploration, we replaced them with the mean/median value of the feature.
Mutual Importance
Mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. The mutual information represents the amount of information that can be gained about one variable by knowing the value of some other variable. This is very relevant to the task of feature selection: we want to choose features which knowing the value of will give us as much information as possible about the target variable.
The nice thing about using mutual information is that it is sensitive to nonlinear relationships. We’ll be using nonlinear predictive models (like gradient boosted decision trees), and so we don’t want to limit the features we select to be only ones which have a linear relationship to the target variable.
We’ll use the mutual information of the quantile-transformed aggregation scores (just so outliers don’t mess up the mutual information calculation). We then select the best 300 features from this to train our model on.
Permutation Based Feature Importance:
A different way to select features is to try and train a model using all the features, and then determine how heavily the model’s performance depends on the features. We had used a model which can handle a lot of features without overfitting too badly. So, we had used a gradient boosted decision tree, specifically CatBoost.
We can measure how heavily the model depends on various features by using permutation-based feature importance. Basically, we train the model on all the data, and then measure its error after shuffling each row. When the model’s error increases a lot after shuffling a row, that means that the feature which was shuffled was important for the model’s predictions.
The advantage of permutation-based feature importance is that it gives a super-clear view and a single score as to how important each feature is. Then, we plotted the importance scores for each feature. These scores are just the difference between the model’s error with no shuffled features and the error with the feature of interest shuffled. So, larger scores correspond to features which the model needs to have a low error. Finally, we saved the top 300 most important features so that we can use them to train a model to predict the loyalty scores.

Now that we’ve engineered features for each card account, the next thing to do is create models to predict the target value from those features.
8. Modelling
We’ll build models to predict the customer loyalty given the features we engineered. The models we tried are:
- Baseline Model (Mean)
- Some single regression models (E.g: Light GBM, CatBoost, XGBoost, Bayesian Ridge)
- Some models which make use of outlier prediction.
- Ensemble Models
- Stacking Models
Best Hyperparameter Selection Technique used:
- We used Bayesian hyperparameter optimization, which uses Gaussian processes to find the best set of parameters efficiently for our model.
Cross-Validation Technique used while Training:
- K-Fold
- Repeated K-Fold
- Stratified K-Fold
1. Baseline Model (Mean)
Before we start creating predictive models, it’s good to have a baseline with which to compare our models’ performance. That way, if our models perform worse than the baseline, we’ll know something’s broken somewhere!
Our predictive models should definitely be able to do better than if we just predict the mean loyalty score for each sample.
We achieve an RMSE of 3.81507 for the baseline model. OK, so our models should for sure be getting RMSE values lower than 3.81507.
2. Gradient-boosted decision tree regression
We opted to train our data on Boosting Models with the base learner as Decision Trees as they could handle non-linear relationships between the data very well.
We tried GBDT’s with top 300 features obtained from Permutation importance and top 300 features obtained from Mutual Information and former proved to give a better result.
At first, we tried fitting model without any parameter tuning or anything fancy. So just with this, we can do better than just predicting the mean. But not a whole lot better! We could definitely improve our model after some tuning on hyperparameters. To do that we’ll use Bayesian hyperparameter optimization, which uses Gaussian processes to find the best set of parameters efficiently.
Now that we’ve found hyperparameter values which work well for each model, let’s test the performance of each model individually before creating ensembles. The models we tried are:
- LightGBM
- XGBoost
- CatBoost
We used repeated K-Fold cross validation technique for model performance on CV data. Out of all GBDTs’, LightGBM proved to give the best results on our data with best tuned hyperparameters.
LIGHT GBM:
Best tree-based model we used is Microsoft’s LightGBM with tuned hyperparameters that we got from Bayesian optimization on top 300 features obtained from permutation importance. We used Repeated K-Fold Cross Validation while training the model.. LightGBM is a gradient-boosting framework that is based on decision tree algorithms. It is highly flexible, capable of tasks such as regression, classification, ranking and more.
Main mechanisms are:
- Gradient-Based One-Side Sampling (GOSS)
- Exclusive Feature Bundling (EFB)
Some of the Advantages are:
- Much faster than other similar algorithms.
- Can handle a larger amount of data better.
- Takes lower memory to run.
- Provides feedback on feature importance.

3. Predictions without outliers
As we have seen during EDA, one weird thing about the target in our dataset is that there’s a lot of outliers. Those outliers are probably having an outsized effect on the training of the model. We might just try to train the model w/o including the outliers.
We fitted a tuned Light GBM only on non-outliers dataset with Kfold CV. However, that didn’t seem to help at all. The model still performs better than just guessing the mean, but doesn’t beat simply using LGBM on all the data by a long shot.
4. Outlier prediction model
We experimented training one model to try and predict whether a sample will be an outlier or not, and then train another model on the non-outlier data.
We trained LGBM Classifier for classifying outlier vs. non-outlier. We trained a regression model (LGBMRegressor) to estimate the values of non-outlier data. Then, we used the classification model to predict the probability that each test sample is an outlier, and the regression model to estimate the value for non-outlier values.
After this, we experimented with two approaches:
- For samples which the classifier predicts are outliers, we just set the predicted value to be the outlier value (-33.2…). For samples which the classifier predicts are not outliers, we used the regression model to predict the values.
- Blending to create new target :
train[‘new_target’] = train[‘binary_prediction’] * (-33.21928) + train[‘regression_prediction’] * (1-train[‘binary_prediction’])
For second approach, we then trained two regression models on the new target that we generated (one with simple KFold CV on entire data and another using StratifiedKFold CV stratified on outliers column). Finally we stacked both these model and trained using meta learner as Bayesian Ridge to generate final predictions.
First Approach performed worse than the baseline while the second approach performance was good but not better when compared to results we got from stacked GBDT models.
5. Ensemble Models
Often a way to get better predictions is to combine the predictions from multiple models, a technique known as ensembling. This is because combining predictions from multiple models can reduce overfitting. The simplest way to do this is to just average the predictions of several base learners. Ensembles tend to perform better when the predictions of their base learners aren’t highly correlated. The models we tried are:
Ensemble 1:
Mean Ensemble with:
- XGBoost
- LightGBM
- CatBoost
Ensemble 2:
Procedure:
First, I splitted the whole data into train and test(80–20). Then in the 80% train set, I splitted the train set into D1 and D2.(50–50). Then from this D1, I performed sampling with replacement to create d1,d2,d3….dk(k samples). After that I created ‘k’ models and trained each of these models with each of these k samples.
Now I passed the D2 set to each of these k models, and get k predictions for D2, from each of these models. After that, using these k predictions I created a new dataset and we already know the target values for D2, so now I trained a meta model with these k predictions.
Finally for model evaluation, I used the 20% data that I have kept as the test set. Then, passed that test set to each of the base models and get ‘k’ predictions. Now I created a new dataset with these k predictions and passed it to the trained metamodel and got the final prediction. Now using this final prediction as well as the targets for the test set, I calculated the models performance score.
6. Stacking Models
Another ensembling method is called stacking, which often performs even better than just averaging the predictions of the base learners. With stacking, we first make predictions with the base learners. Then, we use a separate “meta” model to make predictions for the target based of the predictions of the base learners. That is, we train the meta learner with the features being the predictions of the base learners (and the target still being the target).
When stacking it’s important to ensure we train the meta-learner on out-of-fold predictions of the base learners (i.e. make cross-fold predictions with the base learners and train the meta learner on those).
Models Tried:
Stacked Model Variant 1
- Model 1: Light GBM with KFold CV enumerated on target
- Model 2: Light GBM with StratifiedKFold CV enumerated on outliers
- Model 3: XGBoost with KFold CV enumerated on target
- Meta Model: Bayesian Ridge Regression
Stacked Model Variant 2
- Model 1: Light GBM with KFold CV enumerated on target
- Model 2: CatBoost with KFold CV enumerated on target
- Model 3: XGBoost with KFold CV enumerated on target
- Meta Model: Bayesian Ridge Regression
9. Result
Below is the snap of different models and their RMSE

10. Submission at Kaggle


11. Conclusion & Future Research
- In the face of novel problem types, intuition is still quite important as a guide.
- Understanding what the data features represent is crucial for feature engineering.
- Training with and without outliers helps to make predictions less prone to them.
- Feature selection and cross-validation help to make the models less likely to overfit and more robust to new data.
- For all models, appropriate data preparation and feature selection are very important. Most need a transformation of ordinal categorical features by methods like One Hot Encoding.
- Due to time constraints, I was unable to explore more but It is important in the problem to separate the outliers from the dataset and try to identify these as a different distribution. Trying to create a reasonably accurate classifier for outliers can greatly enhance the performance of models on the Elo dataset.
- We can try linear stacking as described in the 1st place solution which can give 0.015 boosts in local cv compare with same feature train directly
# linear stacking by 1st placetrain['final'] = train['binpredict'](-33.21928)+(1-train['binpredict'])train['no_outlier']0.015 boost in local cv compare with same feature train directlyI think this trick works because binary is better than regression even though metric is rmse when label is 1&0
12. Profile
Github:
Will Upload it soon.
LinkedIn:
13. References
EDA, featurization, and Model building:
- https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-elo
- https://github.com/alvarorgaz/Kaggle-Elo-Merchant-Category-Recommendation/blob/master/0.%20Data%20Dictionary.xlsx
- https://www.kaggle.com/roydatascience/elo-stack-with-goss-boosting
Compute Platform:
https://colab.research.google.com/
Course: