Apoorva Joshi
FireEye
Using Lexical Features for Malicious URL Detection- A Machine Learning Approach (pdf, paper, video)
Background: Malicious websites are responsible for a majority of the cyber-attacks and scams today. Malicious URLs are delivered to unsuspecting users via email, text messages, pop-ups or advertisements. Clicking on or crawling such URLs can result in compromised email accounts, launching of phishing campaigns, download of malware, spyware and ransomware, as well as severe monetary losses.
Method: A machine learning based ensemble classification approach is proposed to detect malicious URLs in emails, which can be extended to other methods of delivery of malicious URLs. The approach uses static lexical features extracted from the URL string, with the assumption that these features are notably different for malicious and benign URLs. The use of such static features is safer and faster since it does not involve crawling the URLs or blacklist lookups which tend to introduce a significant amount of latency in producing verdicts.
The dataset consists of a total of 5 million malicious and benign URLs which were obtained from various sources including online feeds like Openphish, Alexa whitelists and internal FireEye databases. A 50-50 split was maintained between malicious and benign URLs so as to have a good representation of both kinds of URLs in the dataset. Compact feature vector representations were generated for the URLs, consisting of 1000 trigram-based features encoded with MurmurHash and 23 lexical features derived from the URL string. The tools used to generate the feature representations were NLTK (a popular NLP Python package), mmh3 (a MurmurHash Python package) and urrlib (a Python library for parsing URLs).
The lexical features used for modelling include length of (URL, domain, parameters), number of (dots, delimiters, subdomains, queries) in the URL, presence of suspicious Top Level Domains (TLDs) in the URL, similarity of the domain name to Alexa whitelist domains, to name a few. It was observed that the feature vectors of malicious URL strings so obtained were significantly different from those of benign URL strings.
The goal of the classification was to achieve high sensitivity i.e. detect as many malicious URLs as possible. URL strings tend to be very unstructured and noisy. Hence, bagging algorithms were found to be a good fit for the task since they average out multiple learners trained on different parts of the training data, thus reducing variance. Therefore, Random Forest with Decision Tree estimators was used as the machine learning model of choice for classification.
Results: The classification model was tested on five different testing sets, consisting of 200k URLs each. The model produced an average False Negative Rate (FNR) of 0.1%, average accuracy of 92% and average AUC of 0.98. The model is presently being used in the FireEye Advanced URL Detection Engine (used to detect malicious URLs in emails), to generate fast real-time verdicts on URLs. The malicious URL detections from the engine have gone up by 22% since the deployment of the model into the engine workflow.
Conclusion: The results obtained show noteworthy evidence that a purely lexical approach can be used to detect malicious URLs.