Detecting Unexpected Network Flows with Streaming Graph Clustering

Andrew Fast

When assessing questionable network traffic, network security practitioners focus on answering the classic investigative questions such as What? When? and Where?. Traditional IDS rules and malware file analysis (both static and dynamic) address the question of “What is in the traffic (and is it malicious)?”, with file analysis being a prominent target of machine learning approaches. Network traffic analysis is a complementary approach to threat detection which answers the question: “Where did this traffic come from (and is it malicious)?”.

One natural approach to answering the question of "Where?" is viewing network traffic as a mathematical graph. There is a vast literature on graph processing that is valuable for network security including algorithms for finding connected components and community detection. In a network security context, these approaches can add evidence to a hypothesis of malicious traffic by first grouping individual nodes based on shared traffic patterns and then identifying unexpected connections between hosts in different sub-regions of the network.

Traditionally, graph processing has been performed using batch algorithms, which allow the data to pool up before processing the entire, finite sample of data. Batch processing of graphs can require a significant amount of engineering effort including the addition of a graph database, Spark’s GraphX or other specialized data store to the standard processing pipeline. Regardless of the chosen platform, the required engineering complexity often grows as the graphs grow in size.

Streaming processing is an alternative computational paradigm with reduced storage requirements compared to a batch approach. In a streaming model, processing occurs one data point at a time, limiting the amount of data that needs to be stored. Unfortunately, this reduction in storage space means a trade-off of needing new algorithms and approaches specifically designed for streaming data. In this talk, we describe a streaming approach to graph clustering. Motivated by the challenge of using machine learning to selectively record full network packets, we show how streaming graph clustering can be applied to network traffic to identify unexpected and potentially malicious flows in near-real time.