URLTran

PyTorch implementation of Improving Phishing URL Detection via Transformers Paper

Data

The paper used ~1.8M URLs (90/10 split on benign vs. malicious). There are few places to gather malicious URLs. My recommendation is to do the following:

Malicious URLs

OpenPhish will provide 500 malicious URLs for free in TXT form. You can access that data here.

Likewise, PhishTank is an excellent resource that provides a daily feed of malicious URLs in CSV or JSON format. You can gather ~5K through the following link.

Finally, there is an excellent OpenSource project, Phishing.Database, run by Mitchell Krog. There is a ton of data available here to plus up your dataset.

Benign Data

I gathered benign URL data via two methods. The first was to use the top 50K domains from Alexa.

Next I used my own Chrome browser history to get an additional 60K. It was pretty easy to do on my Macbook. First, make sure your browser is closed. Then in your terminal run the following command:

/usr/bin/sqlite3 -csv -header ~/Library/Application\ Support/Google/Chrome/Default/History "SELECT urls.id, urls.url FROM urls JOIN visits ON urls.id = visits.url LEFT JOIN visit_source ON visits.id = visit_source.id order by last_visit_time asc;" > history.csv

Tasks

Parameters were all gathered from the URLTran paper.

Masked Language Modeling

Masked Language Modeling (MLM) is a commonly used pre-training task for transformers. The task consists of randomly selecting a subset of tokens to be replaced by a special ‘[MASK]’ token. Then we seek to minimize cross-entropy loss corresponding to the prediction of correct tokens at masked positions. The original BERT paper uses the following methodology for [MASK] selection:

15% of the tokens were uniformly selected for masking
Of those
- 80% are replaced
- 10% were left unchanged
- 10% were replaced by a random vocabulary token at each iteration

# Input from mlm.py:
    url = "huggingface.co/docs/transformers/task_summary"
    input_ids, output_ids = predict_mask(url, tokenizer, model)

# Output:
    Masked Input: [CLS]huggingface.co[MASK]docs[MASK]transformers/task_summary[SEP]
    Predicted Output: [CLS]huggingface.co/docs/transformers/task_summary[SEP]

Fine-Tuning

Access the fine-tuning step in classifier.py

ToDo

There are a few different variations I need to complete:

Vary the number of layers between {3, 6, 12} for URLTran_CustVoc.
Vary the number of tokens per input URL sequence between {128, 256}.
Use both a byte-level and character-level BPE tokenizer w/ 1K- and 10K-sized vocabs.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
models		models
.gitignore		.gitignore
README.md		README.md
classifier.py		classifier.py
data_prep.py		data_prep.py
mlm.py		mlm.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

URLTran

Data

Malicious URLs

Benign Data

Tasks

Masked Language Modeling

Fine-Tuning

ToDo

About

Releases

Packages

Languages

bfilar/URLTran

Folders and files

Latest commit

History

Repository files navigation

URLTran

Data

Malicious URLs

Benign Data

Tasks

Masked Language Modeling

Fine-Tuning

ToDo

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages