David Krisiloff

and

Scott Coull

Structure and Semantics-Aware Malware Classification with Vision Transformers (pdf, video)

Research on static malware classifiers has generally explored two extremes: (i) hand-crafted features painstakingly created by experts and (ii) deep learning architectures that operate directly on the raw-byte representation of the binary. Broadly speaking, byte-based approaches have struggled to achieve the performance of traditional machine learning models leveraging expert features despite extensive exploration of architectures and training regimes. In this paper, we suggest that there exists a rich, unexplored continuum of expert knowledge that lies between entirely human-driven features and data-driven representation learning using deep neural networks, which can be leveraged to achieve better trade-offs between architecture flexibility and development costs. Specifically, we consider whether providing the model with explicit structural and semantic hints, at varying degrees of specificity, increases the performance of deep learning-based classifiers. To evaluate the impact of the structural and semantic information, we consider three distinct Windows PE malware datasets, ranging from 800K samples (i.e., EMBER) to a full production-grade malware dataset containing more than 100M unique samples. The results of our analysis indicate that incorporating lightweight structural information, such as PE file sections, directly into the architecture allows deep learning-based models to match the performance of traditional malware classifiers for the first time -- achieving performance equivalent to a commercial malware classifier deployed to millions of endpoints. Our evaluation further analyzes the impact of semantic information, such as parsing errors, training set size, and robustness to adversarial evasion, revealing novel insights into the value of integrating expert knowledge into the architecture of deep learning systems.