Xigao Li
Stony Brook University,
David Krisiloff
Mandiant,
and
Scott Coull
Mandiant
Lightweight, Emulation-Assisted Malware Classification (pdf)
Antivirus systems provide the first line of defense against malware with on-device detection, prevention, and remediation using rules, a machine learning (ML) model, or a combination of both. Typical on-device ML-based detection systems use features derived from static file analysis since it enables pre-execution prevention of malware. However, malware often uses various techniques, such as packing, to avoid static detection, and there are several tools for automating these evasions. Alternatively, dynamic analysis in a sandbox inspects and records program execution during runtime, but it is usually performed in the cloud on a virtual, which requires significantly more time and computational resources than static analysis. Binary emulation provides an important middle ground allowing for scalable, on-device program execution while remaining robust to static obfuscation or packing methods.
In our work, we explored using an open-sourced lightweight emulation tool, SpeakEasy, to collect data for ML-based malware tasks. Running a PE file in an emulator allows us to observe the sequence of external API calls being made, memory access and allocations, files written, and network activity. This data provides rich features for modeling, assuming the emulation runs without issue. At the same time, the emulation framework provides a great deal of flexibility to control the trade-off among high-fidelity emulation, computational requirements, and detection speed. To this end, our experimental results explore several important challenges in operationalizing emulation data, including normalizing emulation runs, handling unsupported API calls, featurizing a diverse array of execution information, and exploring several novel modeling options.
Our experiments focused on two classification tasks: malware detection and malware family prediction. Each file from the EMBER 2017 and EMBER 2018 datasets was emulated, and features were derived from the API call sequence and memory access telemetry. Two different modeling approaches were considered: gradient boosted trees and neural networks. We constructed a bag of words/n-gram representation for the API call sequence and encoded the memory utilization as tabular features for the gradient boosting trees, while our neural network architecture treats the API call sequence as a sentence of words input to a 1D CNN. The tabular memory features are passed through a dense layer before being merged with the output of the CNN before final classification.
Overall, the results of our experiments demonstrate that emulation-based ML models can correct up to 50% of the classification errors induced by static analysis models, and that the combination of the two feature sets can provide even better performance than either one in isolation. Furthermore, we show that we can control the performance-detection trade-off by adjusting the number of emulation steps allowed, that creating a standardized emulation regime (e.g., number of steps run) is key to training consistent models, and that ‘faking’ unsupported API responses is a reasonable approach for eliciting continued program behavior. Finally, we discuss the computational cost of the emulation-based methods and compare it to competing approaches of static-only and sandbox-based dynamic analysis, which help put our results and the emulation-based approach into context – offering an alternative option for on-device detection of otherwise evasive malware samples.