Is Big Data Enough for Machine Learning in Cybersecurity?

By Jon Oliver

Data is more pertinent and prevalent now than it has ever been — as consumers, we’re now at a 2.5-quintillion-bytes-of-data-per-day level. Threat data is no exception: Cybercriminals add to its abundance as they continuously up their game by tweaking old and creating new threats to evade detection. To address the vast amounts of threat data, security providers turn to machine learning to automate processes and improve security solutions.

With the great diversity and volume of threat data available, machine learning is necessary to efficiently go through a dataset, learn from it, and help reinforce defenses against cyberthreats. The importance of the quantity of threat data is evident. But is data quantity the be-all and end-all of effective machine learning? Is a big dataset enough to strengthen cybersecurity defenses?

Data and machine learning in cybersecurity

The vast quantity of threat data available in the wild is due to the continuous growth in both quantity and quality of cybercriminal activity. Last year alone, the Trend Micro™ Smart Protection Network™ security infrastructure prevented over 65 billion threats from disrupting our customers’ environments.   

Cybersecurity runs on threat data. Just like how businesses are able to analyze what their customers want based on a study of sales data, cybersecurity vendors and researchers require threat data to know how best to handle incoming new information — such as to determine whether an unknown file is benign or malicious.

Fundamentally, machine learning requires data to be operational. Threat data is necessary to combat cyberattacks at zero-time, as in the case of far-reaching ransomware attacks that swept the globe last year and continue to affect organizations around the world. Ransomware variants already existing in the wild should be in a cybersecurity firm’s repository of threat data. Such historical threat data allows cybersecurity systems to predict and defend against future similar or modified threats.

Machine learning enables the clustering and analysis of colossal volumes of data that would be otherwise impossible to do using traditional means.  Threat data — enough of it — is critical to a machine learning system’s success in cybersecurity solutions.

The threat data question: What makes big data better?

Big data and machine learning go hand in hand in cybersecurity. Threat data provides the necessary information for cybersecurity solutions to work effectively. A large threat dataset enables a machine learning system to spot a wider variety of threats — even variants — and to decide how to best mitigate them before they infect endpoints and networks. It appears that the more data a security vendor has, the better the threat intelligence it uses in defending against cyberattacks. This assertion warrants a closer look and we have to ask, Are all datasets created equally?

While big data is essential for analysis, collection and processing might not only be difficult to do — it could also be ineffective especially if the large number of data proves to be “dirty.” Dirty data refers to data that has incomplete or erroneous information. Data cleansing, or data wrangling, is often necessary before big threat data can be analyzed: If a dataset has flawed formatting or labeling, or if it contains redundant or inaccurate data, it may not be processed by machine learning systems optimally. The goal is for data to be utilizable by a system; and this task requires considerable threat expertise. 

Data cleansing is one of the issues in big data analysis. It is laborious to clean dirty data before it can be used for accurate data analysis. According to some estimates, 50 percent to 80 percent of a data scientist’s time is used in data cleansing. And uncleaned, low-quality data is not just time-consuming to work with — it’s also uneconomic. One estimate puts the cost to the United States economy alone at US$3.1 trillion per year. It therefore needs to be emphasized: Machine Learning is far more effective when it is provided sanitized data.

Trend Micro understands this threat data fact. That is why we are focused on both the quality and quantity of datasets that we collect and analyze using machine learning. Our years of security research provided us with extensive and accurately labelled threat and malware data, as well as the expertise to continue accurately understanding and labelling new data. We focus as well on ensuring the quality of training datasets to further optimize the performance of our machine learning systems.

One example of our work on bettering our big data is what we do for support vector machines (SVM) for emails. For machine learning technology to properly identify spam from legitimate emails, our machine learning models need to be trained using correctly labeled emails. Training and testing datasets are carefully processed to ensure emails are correctly classified and duplicates are removed. Duplicate data may result to skewed data, thereby influencing the resulting model and, in turn, causing false negatives and false positives. It is paramount that a constructed dataset satisfactorily represents the current email landscape and contains samples from all relevant sources.

The Trend Micro Smart Protection Network infrastructure correlates over 16 billion threat queries and analyzes more than 100 terabytes of threat data. To further our efforts in ascertaining the quality of our datasets while addressing the challenges of a huge amount of data, we have been exploring projects that revolve around clustering. Clustering — the grouping of similar objects together using machine learning algorithms — allows us to automatically group malware threat families. The resulting clusters can then be converted to actual solutions/patterns to protect our customers and even used as high-quality datasets for further research. These use cases apply to files and network packets. Furthermore, the resulting clustered data provides valuable threat data/intelligence that we use to improve our existing solutions.

Trend Micro machine learning solutions

Even before the hype (specifically, since 2005), we have been using machine learning for our security solutions. From detecting spam emails to even detecting business email compromise by analyzing a user’s writing style, machine learning has been an integral tool in our cybersecurity products. Our aim has been to create smarter, more accurate machine learning systems — ones that learn from a wide range of sources and samples. 

As a security provider, our threat data comes from multiple points in the threat lifecycle, with layers including email and web gateways, sandboxing, network packet scanning, exploit and endpoint protection, as well as C&C protection. This multilayered approach allows us to collect threat data from a variety of independent locations, thus providing us with threat data diversity that contributes to the accuracy and precision of our machine learning solutions.

Machine learning serves as an effective layer to bolster the cybersecurity posture of enterprises. Our big and better datasets lead to higher detection rates, lower false positives, and overall stronger protection for endpoints as well as virtual and cloud infrastructure. Ultimately, the level of cybersecurity protection a security company can provide with machine learning is not only determined by the quantity of threat data but the quality of it too.

Trend Micro™ XGen™ security provides a cross-generational blend of threat defense techniques to protect systems from different types of threats. It features high-fidelity machine learning that secures the gateway and endpoint, and protects physical, virtual, and cloud workloads. With capabilities like web/URL filtering, behavioral analysis, and custom sandboxing, XGen protects against today’s threats that bypass traditional controls, exploit known, unknown, or undisclosed vulnerabilities, either steal or encrypt personally identifiable data, or carry out malicious cryptocurrency mining. Smart, optimized, and connected, XGen powers Trend Micro’s suite of security solutions: Hybrid Cloud Security, User Protection, and Network Defense.

With contributions from Brian Cayanan.

Like it? Add this infographic to your site:
1. Click on the box below.   2. Press Ctrl+A to select all.   3. Press Ctrl+C to copy.   4. Paste the code into your page (Ctrl+V).

Image will appear the same size as you see above.