In this blog post I summarize my findings of the Udacity Data Science Nanodegree Capstone Project — Sparkify.
Suppose, you are a Data Scientist at a popular digital music service similar to Spotify or Pandora. Let’s call it Sparkify. Users of Sparkify can stream their favorite songs either by using a free tier that places advertisements between the songs or by using the premium subscription model. The latter one is a flat rate account where they stream music as free but pay a fixed price on a monthly base. Users can churn or cancel their service at any time. So it’s crucial to make sure that web users love the service.
A user can interact with the Sparkify service in multiple ways, e.g. playing songs, liking a song with a thumbs up, listen to an advertisement, downgrade the service etc. Every time a user interacts with the service data is generated. All this data contains information about keeping your users happy.
Thereby a user who churned submitted a Submit Downgrade or Cancellation Confirmation event.
A data set of 12GB (provided by Udacity) is the base for this project. To tackle such a Big Data task a data deployment on a distributed computing cluster, preferably by using a service like AWS or IBM Cloud, would be beneficial. In the following, for prototyping purposes, a small 128MB subset (‘mini_sparkify_event_data.json’) was analyzed. Thereby Apache Spark’s pyspark DataFrame and Machine Learning tools were implemented.
Now your Data Team receive the task to predict which users are at risk to churn either downgrading from premium to free tier or canceling their services completely. If you could accurately predict these users before they leave your service you could offer them for example discounts. Finally, this could increase your business revenue significantly. So it is worth to develop an accurate churn prediction.
Please take a look at my Github repository to find all necessary coding steps for a churn prediction.
The data set is small with only 225 unique users. Furthermore, disappointed churned (label=1) and satisfied non churned (label=0) users are not evenly distributed, with 52 churned user 173 non-churned users. The metric accuracy (for example) works best when we have 50/50 distribution of binary labels. Hence in this project, by only checking accuracy the model performance of predicting churn could be misinterpreted. In contrast, F1 score as the harmonic mean of precision and recall levels out extreme metric value results. Therefore due to the small data set size and a non-uniform label distribution F1 score was applied as a metric to measure the model performance.
The Spark’s DataFrame loaded from mini_sparkify_event_data.json consists of 18 columns. In Table 1 an overview of column names, column variable types and short column descriptions is provided. The most important columns used for feature engineering and than model evaluation are: userId, sessionId, registration, gender and the 22 categories of page.
After loading the json file data into a Spark DataFrame the following results were found:
- 286500 entries in the data set
- 226 unique users, 189 unique last names and 56 unique first names
- 1 unique user has an empty string as user_id
- 2354 unique sessionIds
- 0 null userIds
- 0 null sessionIds
- 52 churned user
- 173 non-churned user
Abnormalities in the data set were found: Interestingly, for userId and sessionId there were no missing values. However, for userId there are 8346 entries with an ‘empty string userId’ and without first, last name and gender. Due this undefined userId state it is not clear how many users triggered these entries. As a consequence, those entries had to be dropped from the DataFrame before modeling.
In the following high risk, disappointed churned users will be identified as churn-1 user and satisfied non-churned users as churn-0 users.
A user can interact with the following 22 service web pages which are the 22 categorical values in the DataFrame column page:
Does the churn behavior depend on gender?
In Fig. 2 (left) the churn distribution is shown as a bar plot. In the subset of unique users which consists of 225 entries exist 52 churn-1 and 173 churn-0 users. Hence, 23% of all users churned and the churn event is uneven distributed in the data set.
In order to determine significant model features it was interesting to know if the tendency to churn is related to gender. The 225 unique users divide into 104 female and 121 male users, i.e. 46% are women and 54% are men. In Fig. 2 (right) the churn distribution is shown divided by gender. 26% of men and 19% of women have the tendency to churn, i.e. churn triggered slightly more by men than by women and hence gender could be a useful feature to predict churn.
Does the membership duration play a role for the churn rate?
Intuitively one could state that a longer membership indicates that the user has a lower tendency to churn. In fact, those effects can be significantly measured. Fig. 3 (left) shows the membership duration of churn-1 an churn-0 users. The median value for satisfied churn-0 users results in 75 days, whereas for churn-1 users in 49 days. Hence, a shorter membership is somewhat related to a higher risk to churn. Especially, in the starting phase of a membership it could be beneficial in terms of business revenue to offer the user special discounts or incentives or to make surprising recommendations. In Fig. 3 (right) a distplot of the membership duration in days is depicted. Interestingly, the distribution intersects into three peak regions: a short-term (days~20) a medium-term (days~75) and a long-term (days~175) membership duration. The peak regions of medium and long term membership duration seem to be fed by low risk churn-0 users. Especially the high medium-term peak coincides with the median value of churn-0 users. However, the peak area between days ~0 and days ~50 are fed by approximately 50% of users with higher risk to churn (see churn-0 users in the box plot of Fig 3 left). The membership time range between day 0 to day 50 could be a critical in terms of a higher churn tendency. Try to keep your users as happy as possible during the starting phase of their memberships.
Are there useful tendencies for feature extraction based on user web page activities?
The most interesting column in the Spark DataFrame with regard to churn is the column page. There are 22 different categorical values which refer to web page titles. These web pages allow the users to interact with the Sparkify services. Those interactions can be used as data for churn prediction. Fig. 4 (upper) shows the overall web page activities of all (unique) users in a descending order. It is not a surprise that the web page NextSong is the winner regarding web page traffic. Page NextSong offers an audio player where users can listen to a song. Fig. 4 (lower) divides those traffic counts in churn-0 and churn-1 users. There are small differences for some count distributions for example for Thumbs Up, Thumbs Down or Add Friend, etc.
Such differences are more evident by using box plots where data spreading is evaluated by median and quantiles calculations. In Fig. 5 box plots are depicted exemplary for the features ‘Avg number of songs played per session’ (1), ‘Number of thumbs down per day and user’ (2), ‘Number of thumbs up per day and user’ (3) and ‘Number of “Add Friend” per day and user’ (4). These features were gained by DataFrame aggregations from web page interactions.
Thereby, all counts are split again into churn-0 and churn-1 users.
This feature set describes how intensive a user interacts with the Sparkify services, i.e. higher counts are related to a more intensive web service usage. Features marked with ‘per day and user’ were normalized by the membership duration of the corresponding user. This normalization step was necessary because total counts of web page activities for a certain user correlate with the duration of the user’s membership, i.e. a longer membership results naturally in higher web page activity counts.
In case of features (1), (3), (4) a higher web page activity count measured as a median value would mean a lower tendency to churn. As one can see in Fig. 5 this is confirmed by higher median values for churn-0 users except for feature (4).
In contrast, for feature (2) ‘Number of thumbs down per day and user’ a lower median value would mean a lower tendency to churn because the user is less disappointed. As one can see in Fig. 5 this is confirmed. churn-0 users show a lower median value with median=0.10 / day than churn-1 user (median=0.15 / day).
In summary, features (1), (2) and (3) seem to be interesting features for modeling as they confirm intuitively expected trends. Hence, these features should be added to the feature list and used to predict churn.
Further features were evaluated with similar data preparation techniques based on aggregations as shown in Fig 5.
Finally, the following set of 9 different features seem to be promising with regard to predict ‘churn’. Each feature is placed in an Apache Spark DataFrame (col1 = userId, col2 = feature), i.e. each row contains the corresponding feature data for one unique user.
- gender: The gender of the user
- membership_duration: The duration of the membership
- rate_songs_played: The total number of songs played per day and user
- avg_num_songs_session: The average number of songs played during one session
- rate_thumbs_up: The total number of thumbs up per day and user
- rate_thumbs_down: The total number of thumbs down per day and user
- rate_errors: The total number of occurred errors per day and user
- rate_add_playlist: Number of songs added to the playlist per day and user
- rate_add_friends: Number of friends added per day and user
As shown above abnormalities were found in the data set for userId. 8346 entries with an ‘empty string userId’ had to be dropped from the DataFrame during the phase of data preprocessing.
Furthermore as explained in the last section, features wich describe the users web page activity were normalized by the user’s membership duration.
This normalization step was necessary because total counts of web page activities for a certain user correlate with the duration of the user’s membership, i.e. a longer membership results naturally in higher web page activity counts. (see above).
In addition, feature value ranges must be analyzed. Table 3 provides those information. As one can see minimum and maximum values of the chosen feature set spread in a wide range. In order to give each feature an equal chance without being overruled by other features, we need to make sure that they span across the same range. This can be achieved via standard or minmax scaling or normalization.
Please open the notebook in my Github repository to follow the code explanations below.
The normalization of web page activities with the user’s membership duration has been implemented in the function norm_feature_by_membership_duration. It takes as inputs: a feature DataFrame, the membership_duration DataFrame, the feature_col_name, the returning name of the feature column name new_col_name, and drop (bool value which removes unwanted columns after normalization). Return: feature_norm, the normalized feature DataFrame.
For feature engineering purposes a function called create_dataframe_for_modeling was created. It takes the path to the dataset ‘mini_sparkify_event_data.json’ as an input and returns a DataFrame for modeling. This is the main function of feature preparation and calls different child functions to get the data raedy. The workflow of this function is the following:
- Read the DataFrame from a json file
- Clean the DataFrame (i.e. drop entries with empty string userIds)
- Define churn and create a label column called ‘churn’
- Create feature columns (via feature function calls)
- Normalize features by membership duration
- Join features together in df_model
- Remove columns which are not needed anymore
- Fill NaN with 0 (if there would be any)
- Cast features to float
- Rename churn with ‘label’
- Return a model DataFrame called df_model
Before we can feed df_model to a Machine Learning model, the feature columns of the cleaned DataFrame must be vectorized. Otherwise Spark’s Machine Learning algorithms would break. After vectorization features have to be scaled (as discussed above). Last but not least the vectorized and scaled DataFrame must then be split into training and validation sets for model evaluation.
One problem here is the small size of df_model with only 225 rows. The split parameters were set to 90/10 for training and validation resulting in 209 rows for training and only 16 for validation. An additional testing set has not been implemented due to the small amount of data. This could be optimized when we switch to larger data sets.
Vectorization, scaling and splitting is done within the function prepare_model_dataframe which includes the following steps:
- Convert numeric columns to Spark’s vector type via VectorAssembler
- Scale vectorized data via Spark’s MinMaxScaler
- Split DataFrame into training and validation sets via randomSplit
After features were successfully extracted and prepared it was then time to build a model.
The function build_model takes the training and testing data as well as a string key value named classifier to choose different classification or regression models as an input and returns the model results:
- Three models were chosen: RandomForestClassifier, LogisticRegression and GBTClassifier. These models can be addressed in the function with the function argument classifier. It takes three string values: rf for RandomForestClassifier, lr for LogisticRegression and gbt for GBTClassifier. Via if statements the corresponding classifier can be enabled.
- Thereby, the model was constructed by using Spark’s Pipeline module. This module chains feature vectorization, feature scaling (both are transformations) and than fitting (training) the classification or regression model.
- By using Spark’s ParamGridBuilder different classifier/regression model hyperparameters were tuned, e.g. maximum number of iterations or number of trees.
- Via Spark’s CrossValidator the different pipeline estimators were evaluated. It defines the Parameter grid and the Evaluator. Thereby the number of folds was fixed to 3.
- The Evaluator defines the metric to evaluate the results on the validation set. Here I have chosen MulticlassClassificationEvaluator. It can evaluate accuracy, precision, recall and F1 score. Its default metric is the F1 score. This is the score needed for this project.
- The output of build_model is cvmodel, containing the fit model parameters, and results with the best model prediction (on validation data).
The model evaluation was then done by using the function evaluate_model. This function takes cvmodel and results as inputs and returns two Pandas DataFrames. df_f1: containing metrics parameters (F1 score) from cvmodel, and df_feature_importance: containing most important coefficients (regression) and weights (classification), respectively. Both DataFrames were built by extracting information from the model parameters in cvmodel. The dataframe df_f1 lists all hypertuning settings together with F1 score results, whereas df_feature_importance shows the result for the best trained model.
Some refinement work was done in the function create_dataframe_for_modeling. After a first data exploration approach the feature gender was not used in this function as its feature importance was not clearly seen during that phase. Indeed, as you can see in the result part the importance of gender is low compared to other features. However, with regard to business decisions it is important to understand both worlds: features with high AND low importance.
In addition, in a first approach the total count of user’s web page activities were not normalized by the membership duration. This led to wrong interpretations of some web related features.
What has been done so far?
- Features were extracted and prepared by cleaning, membership_duration normalization, vectorization and scaling.
- Machine Learning models based on Logistic Regression, Random Forest Classification and Gradient Boosting Trees were implemented.
- Evaluation metrics based on F1 score has been inserted via 3-fold crossvalidation.
- The training process has been started with the aim to predict churn.
Now it is time to evaluate the results. The metric results (F1 score) are shown in the inset of the RandomForestClassifier plot in Fig. 6. The highest F1 score was found for the RandomForestClassifier close to 74% , followed by GBTClassifier and LogisitcRegression both with a score close to 71%. Of course, there is still room for improvement. This will be discussed in Justification and Improvement (see below).
In Fig. 6. the feature importance is shown for all three investigated classification/regression models in descending order. As stated before in the Data Preparation Part membership_duration seems to play a crucial role for predicting churn. Indeed, it has the highest feature importance for RandomForestClassifier as well as GBTClassifier and is the second one for LogisticRegression. Gender, however, has been evaluated with the lowest model coefficient for RandomForestClassifier and LogisticRegression. Also for GBTClassifier gender does not coincide with the highest importance levels. The confirmed feature importance based on different models is an indication for a certain model robustness at least for these two features.
Now let’s focus on user web page activities: Especially the total number of thumbs down events per user seem to have an impact on predicting churn. If a user clicks frequently on a Thumb Down Button the user seems to be disappointed with the service.
In contrast, for feature (2) ‘Number of thumbs down per day and user’ a lower median value would mean a lower tendency tu churn. As one can see in Fig. 5 this is confirmed churn-0 users show a lower median value with median=0.10 / day than churn-1 user (median=0.15 / day).
The median value for satisfied churn-0 user regarding the total number of thumbs down per day and user is median=0.1 / day (see section visualization). In contrast for disappointed churn-1 user it is higher with median = 0.15 / day. This means, if a user is clicking in 100days more 15 times the Thumb down button, this user is at a high risk to churn. One could use this median value as a simple threshold or metric for churn indication. If the rate of thumbs down events is close to 0.15 dislikes per day analyse this user and try find the reason why he/she is unhappy.
For LogisticRegression and GBTClassifier max_iter, the number of iterations were tuned via the ParamGridBuilder. Thereby 5, 8 and 10 iterations were tested. Interestingly, for GBTClassifier the F1 score showed its maximum value already after 5iterations (71%) and for LogisticRegression after 8 iterations (71%). Overfitting effects seem to be evident already after a few iterations.
For the RandomForestClassifier hyperparameter tuning was done for num_trees, the number of trees in the forest. This parameter is usually the most important setting value for this classifier. It seems that an increase of num_trees could enhance the prediction result. However, an increase of num_trees increases also prediction time and memory usage. For 10 trees an F1 score of 74% was observed.
It is obvious that not all features show the same level of feature significance in the chosen set of classifiers. The main problem seems to be the small size of the data set used for modeling. The small amount of data (signal) leads to noise and therefore to variances in the feature importance between the classification/regression models as well as in the predictions. Furthermore, this noise could lower the F1 score due to overfitting. In future investigations it would be interesting to see if one could enhance the model robustness by using the full 12GB data set and keeping the same feature and model setting as used in this work.
In this article different models for web user churn prediction were developed by using Apache Spark’s pyspark framework for DataFrame operations and Machine Learning.
These are the most important take away messages:
- The winning models so far are Spark’s RandomForestClassifier with a F1 score of approximately 74%.
- It has been shown that membership duration and web page interaction intensity is crucial for a successful churn prediction.
- The most important feature is membership duration. The level of importance has been confirmed by all investigated Machine Learning models. Interestingly, gender has a significantly lower importance. That means as a consequence for your business: It is more important to make the user as a happy as possible in the starting phase of their membership than to care too much about gender related optimizations of your digital services.
- In addition, keep an eye on how your user interact with the web pages, especially with Thumbs Down event. The median value for satisfied churn-0 user regarding the total number of thumbs down per day and user is median=0.1 / day. In contrast for disappointed churn-1 user it is higher with median = 0.15 / day. This means, if a user is clicking in 100days more 15 times the Thumb down button, this user is at a high risk to churn. One could use this median value as a simple threshold or metric for churn indication. If the rate of thumbs down events is close to 0.15 dislikes per day analyze this user and try find the reason why he/she is unhappy.
It is assumed that the small size of the dataset leads to a certain amount of noise which reduces F1 scores and leads to fluctuations in the feature importance between different classification/regression models. It would be interesting if this is still true for the large 12GB data set.
However, there is even still more room for improvement for all three trained models. For future studies it is assumed that higher F1 scores could be achieved by:
- Tuning the model hyperparameters in wider ranges
- Implementing more hyperparameters for model optimization: In case of RandomForestClassifier for example maxDepth (max number of levels in each decision tree), in case of LogisticRegression elasticNetParam (the ElasticNet mixing parameter, to switch between L1 and L2 regularization).
- Using larger data sets eventually on distributed computing systems like AWS or IBM platforms
- Implementing more features for model training
- Integrating other databases in order to link song names with genre. Maybe disappointed churn-0 user may like to listen to a special kind of genre which is only weakly supported by Sparkify at the moment. This result could be also used as a baseline for a Recommendation Engine in future optimizations.
I find it interesting that a broad variety of documentations for special Spark modules like MulticlassClassificationEvaluator or GBTClassifier is not available at the moment. It seems that the Spark community is still a small one compared to the Pandas or Scikit-Learn community, for example. I hope that Spark will keep on growing and I am looking forward to the further development of Spark’s Machine Learning modules.
Take a look at my Github page to follow the coding process and to get insight into more interesting observations based on churn prediction with Spark’s Machine Learning libraries.