Detecting phishing websites using a decision tree

This repository is a tutorial explaining how to train a simple decision tree classifier to detect websites that are used for phishing. Typically, phishing websites disguise as trustworthy websites in order to gain the trust of their victims, and malicious parties use them to obtain sensitive information from their victims: e.g., passwords or credit card numbers. In this tutorial, we train a decision tree to detect such websites, with a success rate of 90.5%.

Installation

To get started, you should first clone this repository by running the following command from a UNIX terminal.

git clone https://github.com/shubham-pawar/phishing-detection

This will download the code that trains the phishing detector, as well as the training data required for that operation.

You should also install scikit-learn, which is a collection of tools for machine learning written in Python. You can find instructions on how to install it here. On a UNIX machine configured with pip, the simplest way is to run:

pip install -U scikit-learn

Once you have installed scikit-learn, you can check whether the library is correctly setup by typing the following in a Python shell:

import sklearn

If the command runs with no error, you are ready to train the phishing detector!

Phishing Website Dataset

In this tutorial, we use a dataset of phishing website publicly available on the machine learning repository provided by UCI. You don't have to download the dataset yourself as it is included directly in this repository (dataset.csv file) and was downloaded on your machine when you cloned this repository.

The dataset was collected by analyzing a collection of 2456 websites among which some were used for phishing and others not. For each website included in the dataset, 30 attributes are given.

Each website in the dataset is labeled by -1 if it is not a phishing website and by 1 if it is a website used for phishing.

python decision_tree.py

This will first train the decision tree on 2,000 websites, then use the trained model to predict whether 456 websites are used for phishing or not (these websites were not analyzed during training). The model should make predictions that are about 90.5% correct, i.e. the accuracy of the model on the testing data should be 90.5%. Here is a dump of the output made by the script.

Tutorial: Training a decision tree to detect phishing websites
Training data loaded.
Decision tree classifier created.
Beginning model training.
Model training completed.
Predictions on testing data computed.
The accuracy of your decision tree on testing data is: 0.906129210381

To understand how this was done, you can read the line by line comments in the decision_tree.py file.

What next?

You can try improving the accuracy of this simple classifier by changing some of the default parameter values for the model. This is done by modifying file decision_tree.py. To learn more about the parameters that you can define when calling DecisionTreeClassifier(), take a look at the scikit-learn documentation.

Credits

The credits for this code go to npapernot. I've merely created a wrapper to get people started.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Detecting phishing websites using a decision tree

Installation

Phishing Website Dataset

What next?

Files

README.md

Latest commit

History

README.md

File metadata and controls

Detecting phishing websites using a decision tree

Installation

Phishing Website Dataset

What next?