Philip Tully

FireEye,

Matthew Haigh

Mandiant,

Jay Gibble

FireEye,

and

Michael Sikorski

FireEye

Learning to Rank Relevant Malware Strings Using Weak Supervision (pdf, slideshare, code, blog, video)

In static analysis, one of the most useful initial steps is to inspect a binary's printable characters via the Strings program. However, filtering out relevant strings by hand is time consuming and prone to human error. Relevant strings occur disproportionately less often than irrelevant ones, larger binaries produce upwards of thousands of strings that can quickly evoke analyst fatigue, and the definition of "relevant" can vary significantly across individual analysts. Mistakes can lead to missed clues that would have reduced overall time spent performing malware analysis, or even worse, incomplete or incorrect investigatory conclusions.

To address these concerns, we present StringSifter: an open source machine learning-based tool that automatically ranks strings. StringSifter is built to sit downstream from the Strings program; it takes a list of strings as input and returns those same strings ranked according to their relevance for malware analysis as output. StringSifter makes an analyst's life easier, allowing them to focus their attention on only the most relevant strings located towards the top of its predicted output.

StringSifter is trained on a sample of the 3.1 billion individual strings extracted from Strings program outputs of the 400k malicious PE files in the EMBER dataset. Strings are labeled with ordinal ranks obtained from Snorkel, a weak supervision procedure that trains a generative model over SME-derived signatures, i.e. labeling functions, to resolve their underlying conflicts. For each string, Snorkel produces a probabilistic label, which represents a lineage that takes into account the correlation structure of its source labeling functions. This data programming approach allows us to cheaply and rapidly annotate our large corpus of strings and incorporate the subject matter expertise of reverse engineers directly into our model.

Weakly supervised labels together with transformed string features are then fed into several competing discriminative models with learning-to-rank (LTR) loss functions. LTR has historically been applied to problems like web search and recommendation engines, and we evaluate our models in this same lens by using the mean normalized discounted cumulative gain (nDCG). In our presentation, we will discuss how StringSifter helps us achieve generalizable nDCG scores on holdout Strings datasets, and proceed to demonstrate the tool’s predictions live in action on sample binaries. More broadly, we argue for weak supervision as a promising path forward for other label-starved problems plaguing cybersecurity.