Machine Learning Final Project: Identify Fraud

Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those? [relevant rubric items: "data exploration", "outlier investigation"]

Intro:

The Enron scandal around 2000 was so severe that even most Germans (where I live) have heard about it. Many executives at Enron were indicted for being heavily interwoven in the scandal due to hiding debt and/or manipulating financial balances.

The dataset at hand is a subset of Enron employees with some of them being part of the scandal. The goal is to identify these suspects. There are:

Here's the full list from Udacity's Project Details:

POI label : ['poi'] (boolean, represented as integer)

Outliers:

Via a brief exploratory data analysis and a double check with the file enron61702insiderpay.pdf (given by Udacity), the following outliers were detected:

Both are aggregated observations and thus no employees. Besides,

can be considered as an outlier was removed as all his values were NaN.

What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values. [relevant rubric items: "create new features", "intelligently select features", "properly scale features"]

New features

I experimented with two new features, which are ratios (a financial and a communicational one):

Selection Process

I ranked all the features available (incl. AND excl. the two I created, except the feature email) via select k-best the method feature_importances_ for Decision Trees and Random Forest. These were the TOP-10 ranked by select kBest:

Feature Score
exercised_stock_options 24.815079733218194
total_stock_value 24.182898678566872
bonus 20.792252047181538
salary 18.289684043404513
deferred_income 11.458476579280697
long_term_incentive 9.9221860131898385
restricted_stock 9.212810621977086
total_payments 8.7727777300916809
shared_receipt_with_poi 8.5894207316823774
loan_advances 7.1840556582887247

The 10th spot, 'loan_advances', was ditched, as it only had 3 (!) non NaN-values. Besides, its score is 15% lower than the spot above, which seemed a good arbitrary cut. So, I chose to work just with the TOP 9. Except of shared_receipt_with_poi, all features were financial.

Eventually, shared_receipt_with_poi was substituted with my own ratio feature poi_messages_ratio. The same goes for salary and total_stock_value features, which are part of my financial ratio. However, my financial ratio wasn't successful at all, so I removed it and added salary and total_stock_value again to the feature list.

Scaling

Scaling was not necessary for all tested algorithms (e.g. Random Forest and Decision Tree). However, it was still used as the scales for financial features dwarfed the communicational ones. If no scaling occurred, the financial features would get more weight.

What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms? [relevant rubric item: "pick an algorithm"]

I tried the following algorithms, with the following final results:

*(This was out of curiosity, as Kmeans is primarily designed for clustering)

These are the results without my own feature poi_messages_ratio, but just taking the best ranked features from above:

*(This was out of curiosity, as Kmeans is primarily designed for clustering)

So, in the end, my feature selection didn't work out for the tree algorithm but rather made the GaussianN B classifier to pass the test.

What does it mean to tune the parameters of an algorithm, and what can happen if you don't do this well? How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier). [relevant rubric items: "discuss parameter tuning", "tune the algorithm"]

The goal of parameter tuning is to improve the performance of the chosen algorithm. Hyperparameters are parameters whose values are set before the beginning of the learning. The optimization of hyperparameters can be done manually or with sklearn's GridSearchCV. If it's not done well, the performance of the chosen classifier might be under expectations in practice.

Although, I eventually used GaussianNB, I experimented with tuning the other classifiers to get some experience. The hyperparameters for two classifiers are given here as an example:

What is validation, and what's a classic mistake you can make if you do it wrong? How did you validate your analysis? [relevant rubric item: "validation strategy"]

Validation tests the performance of your model. A classic mistake is to forego splitting the data into a training and testing set and thus using the whole dataset for training your model. Consequently, the model will be over fitted and perform poorly on new data. Therefore, I split the data into a train (70% of the data) and a test set (30% of the data).

Furthermore, StratifiedShuffleSplit from tester.py was used. As we are dealing with a small and imbalanced dataset in this project, working with smaller datasets is hard and in order to make validation models robust, we often go with cross-validation. This cross-validation method splits data in train/test sets by shuffling randomized folds. Per defaults, it reshuffles and splits the data 10 times. Moreover, when dealing with small imbalanced datasets, it is possible that some folds contain almost none (or even none!) instances of the minority class. The idea behind stratification is to keep the percentage of the target class as close as possible to the one we have in the complete dataset.

Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm's performance. [relevant rubric item: "usage of evaluation metrics"]

The simplest evaluation metric is accuracy, which divides the correctly labeled observations by all observations. However, a high score might be due to the accuracy paradox. This happens when one class is very rare in the population. For this dataset, where there are only few POIs this might happen, so the recall and precision rates might be more appropriate.

A good precision means that whenever a POI gets flagged in the test set, we know with a lot of confidence that it's very likely to be a real POI and not a false alarm. On the other hand, a good recall means that, nearly every time a POI shows up in the test set, we are able to identify the person. We might have some false positives here, but pulling the trigger to early in this case just means further investigations, which is not equal to being directly indicted. So, recall might be more important the precision.

References