Phish Language Processing (PhishLP)

Santhosh Kumar Ramachandran

As attacks get more sophisticated, detecting these threats pose innumerable challenges. Spear Phishing is one such attack which targets a specific organisation or individual. Spear phishing attacks are launched by highly trained individuals with good knowledge about the target. Most of the spear phishing attacks are multi-staged, where a reconnaissance email without malicious link or attachment is sent as the first step followed by the actual malicious email. Conventional signature-based engines will fail the detect the spear phishing emails due to frequent change in patterns.

PhishLP is a Natural Language Processing(NLP) based signature-less engine that can understand the context of the email and categorize the risk accordingly.

Traditional text classification approach practices tokenizing words and vectorizing it to obtain the feature vector. But the feature vectors don't capture the positional information about the word. The classical approach also misses in ordering the feature vectors of sentences with a similar meaning close to each other in vector space. Our solution is to encode the whole sentences as a fixed-length vector using an open source sentence encoder algorithm and classify each sentence in the email body into one of the four categories.

• Info
• Threat
• Action
• Spam

Emails containing sentences with a high probability of Threat or Action and received from an external/less-credible domain are marked as phishing/spam.

Using historical spam and phishing emails, we have trained a DNN based model. The model was able to classify the sentences with an accuracy of 99.5%. Our experiment helps us to develop a context-based phishing classification engine which can adapt itself for future threats.