C. Bayan Bruss

Capital One

Applying Deep Graph Representation Learning to the Malware Graph (pdf, video)

Malware is widespread, both increasing in its ubiquity but also growing in diversity. This poses significant challenges for detecting, classifying and understanding new malware as it is observed. Static and dynamic attributes about the specific malware only tell you so much. The sheer scale of the problem makes in-depth investigation impossible. However, when malware are viewed as nodes in a heterogeneous graph, where edges can be connected to IP addresses, URLs, domains, etc., then topological information can provide added context beyond just the individual node attributes. While this can be a powerful visualization and investigation tool, graph structures are very sparse and high dimensional making it challenging for ML task such as node classification, similarity search and clustering. In recent years, a variety of graph embeddings techniques have gained popularity for using machine learning to learn lower dimensional vector representations of graphs in a way that encodes topology, node and edge attributes and neighborhood statistics. In this presentation we apply common graph embedding techniques (e.g. DeepWalk, GraphSage), to the malware. We investigate the outputs of these models to gauge their ability to learn meaningful representations in this domain with this graph and share learnings for how to incorporate Graph ML techniques to large malware networks.