View Ahead of the Curve: A Deeper Understanding of Network Threats Through Machine Learning
By Joy Nathalie Avelino, Jessica Patricia Balaquit, and Carmi Anne Loren Mora
Network threats are industry-agnostic when it comes to the risks they pose to enterprises. Now that cybercriminals are increasingly using evasion tactics to bypass rule-based detection methods, proactive techniques are needed to discover a malware infection before it leads to financial loss, reputational damage, or disruption of business operation. One approach to consider when addressing this concern is through network flow clustering enabled by the power of machine learning.
A flow is a “unidirectional stream of Internet Protocol (IP) packets that share a set of common properties: typically, the IP-five-tuple of protocol, source and destination IP addresses, source and destination flows."1 To discover and analyze different kinds of network anomalies, flow data needs to be looked at as they contain information useful for analyzing traffic composition of different applications and services in the network.
Machine learning is then applied to cluster malicious network flows. This will help analysts obtain insights that can show them relationships between different malware families, and how they differ from one another.
Network Threat Clustering Results on Exploit Kits
In its research using a semi-supervised model to cluster similar types of malicious network flows from the raw byte stream augmented with handcrafted features, Trend Micro was able to filter and classify a cluster comprised entirely of exploit kit detections.
To show how the machine learning model sees the network flow, Figure 1 displays the different colors that correspond to the structural attributes determined by the features passed to the model. In a rule-based detection environment where one rule is created for each malware family to address the varying flow characteristics present in the network, it is important to note that a change in network traffic can render the rule unusable (unless modified). Thus, machine learning can be a key tool in successfully clustering network threats and providing insights on different network patterns from malicious traffic.
NOTE: Each color represents one characteristic.
Figure 1. Raw network data of each malware family
As we can see, the machine learning model was able to find similarities in the malicious network flows. From the multiple characteristics seen in each malware family, the model identified which ones make up a certain profile that correlates among the similar samples. Figure 2 shows an analogy of how the model sees the similar characteristics among the malware families.
Figure 2. Malicious network flows as seen by the clustering model
Initially, Blacole seems like an outlier, as it was categorized as a Trojan and not specifically as an exploit kit in the dataset labelling. However, upon examination of its network traffic, it became clearer that the key similarity that links Blacole to the other exploit kits is that its malware routine took advantage of JS vulnerabilities. This means that in certain cases, we can arrive at a more specific description (exploit kit) than what the initial labelling provided (Trojan), and exploit kits can be identified without tailoring features to a specific attack instance.
Making Sense of the Insights Formed from Clustering via Machine Learning
As seen in our analysis of exploit kit detections, insights on different network patterns from malicious traffic can be obtained through clustering malicious network flows. Such insights can be useful to augment rule creation for detecting network malware.
The use of machine learning in this study showed how the technology can speed up the process of organizing large amounts of data, and offer explanation to help analysts form conclusions and time-zero protection.