DSND Capstone Project

Project Overview

Sparkify data consist of users interaction with music stream data. There are two user service levels, paid and free. Users can also upgrade and downgrade their service. Also, events such as playing a song, like or dislike a song, are all recorded.

Sparkify data consists of users interaction with music stream data. There are two user service levels, paid and free. Users can also upgrade and downgrade their service. Also, events such as playing a song, like or dislike a song, are all recorded
There is also an optional cloud deployment either on Amazon Aws or IBM Watson in the project. Both cloud deployment was tried and IBM Watson was comparable easy for deployment. It does not require any default module installation like pandas, plotly, or no session termination in any code line.

Problem Statement

The first task to predict the churned group from data is data exploration to define the churn and to idenntify the features. Especially page data that gives info about the users visit has relevant information about the churn.

Identifying category columns also helped for defining some features. Main focus definning features is the users main events in the system. So some questions should be answered like how many songs and in which hour they are listening in a day, how they interact with other users or pages, do they have any error in the system.

Metric

The problem is to identify users who are about to churn, so we should prevent false negative results and focus on recall (true-positive). F1 score is also included as a metric, because precision can also be considered after identifying possible churn users.

Feature List

There are 18 features in dataset.

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)

Users and Realated Pages

There are 22 different pages in data.

                About|
|          Add Friend|
|     Add to Playlist|
|              Cancel|
|Cancellation Conf...|
|           Downgrade|
|               Error|
|                Help|
|                Home|
|               Login|
|              Logout|
|            NextSong|
|            Register|
|         Roll Advert|
|       Save Settings|
|            Settings|
|    Submit Downgrade|
| Submit Registration|
|      Submit Upgrade|
|         Thumbs Down|
|           Thumbs Up|
|             Upgrade|

226 users created 2354 sessions. In total,There are 286500 rows in dataset.

There are also some pages below that has no user information.

|              About|
|Submit Registration|
|              Login|
|           Register|
|               Help|
+-------------------+

There are 8346 unregistered users that have almost no data. This data should be discarded for further    analysis.

User Related Data

There are two types of levels; free and paid.

Male and female ratio is almost same in data.

User Song Data

When analysing total number of songs during 24 hours, users are mostly listenning music in the afternoon

The first three of artists are "Kings of Leon", "Coldplay", "Florence and The Machine".

-----------------+-----------+
|              Artist|Artistcount|
+--------------------+-----------+
|       Kings Of Leon|       1841|
|            Coldplay|       1813|
|Florence + The Ma...|       1236|
|       Dwight Yoakam|       1135|
|            BjÃƒÂ¶rk|       1133|
|      The Black Keys|       1125|
|                Muse|       1090|
|       Justin Bieber|       1044|
|        Jack Johnson|       1007|
|              Eminem|        953|
|           Radiohead|        884|
|     Alliance Ethnik|        876|
|               Train|        854|
|        Taylor Swift|        840|
|         OneRepublic|        828|
|         The Killers|        822|
|         Linkin Park|        787|
|         Evanescence|        781|
|            Harmonia|        729|
|       Guns N' Roses|        713|
+--------------------+-----------+

Defining Churn

Cancellation Confirmation events to define your churn, which happen for both paid and free users. This event can be define in page data.

Feature Engineering

The following features used for churm prediction

1. userId - initial id of the user
2. gender - user's gender
3. songs_total - total songs of events per day for the user
4. thumbs_up - total of thumbs up for each users
5. thumbs_down - total of thumbs down for each users
6. downgrade_total - total numbers of downgrade
7. error count - total number errors the user had
8. hour - last login hour
9. day - last login date
10. month - last login date
11. year - last login year

Modeling

Data was splitted 20% for trainning and 80% for testing,

Featured date are separated into %80, %20 for trainning and testing.

The features "hour","day","month","year","songs_total", "thumbsup_Total", "thumbsdown_total", "downgradeTotal" , "error_count" are assempled to one features with Vectorassemble method and the data scaled with Standartscalar.

So we have three columns left for churn prediction for each users.

+------+-----+--------------------+
|userId|label|            features|
+------+-----+--------------------+
|   124|    0|[2.15046215163503...|
|    51|    1|[1.00354900409634...|
|    15|    0|[0.57345657376934...|
|    54|    1|[2.72391872540437...|
|   155|    0|[1.57700557786569...|
+------+-----+--------------------+

Result

Sparkify data has variance of user expereince like song play, thumbs-up, thumbs-down, error. Random Forest did a better job by splitting nodes in each tree considering a limited number of the features.

Navie Bayes produces better in categorial variables, whereas there are more numerical features in Sparkify data. That might be the reason that Navie Bayes has lower results than Random Forrest.

Logistic regression might be expected a better model because of a binary outcome churn and not churn, the data better modelled with both classification and regression on a Random Forest model

It initially produced F1 Score and recall is %86.6. After refinement, the scores "Recall" and "F1 score" improved to %%93.3. With this result, it is time to continue with the refined model on a larger dataset. This will raise the robustness of the model and improve customer management.

Conclusion

Although Sparkify project implementation initially started on Amazon Aws, there was a session termination in StandartScalar code. The same code works fine on IBM Watson and Udacity workspace. Some modules like pandas, plotly are also needed to install on Amazon Aws before starting the project. Whereas on IBM Watson, after adding an initial code lines for dataset connection, no need any installation and all code lines are executed without session termination.

Sparkify data features are mostly system created like page info, time, user level, error; thus, not much effort needed to re-formatting its data or dealing with feature's null values. So, this project also shows that how application's logging processes should be taken into consideration to support machine learning algorithms during its development.

Future Improvements

- More features can be included like more user events such as "Add Friend", "Add a Playlist" and location, status can also be included.
- Different user agents can be analysed in its own data because they might have different user experience.
- Artist and songs popularity can also be analysed.
- Add a recommendation engine for users.
- Use larger data for better results

Search This Blog

Udacity Sparkify Project