Naming the Unknown: Labeling Unknown Files Through Machine Learning

April 13, 2018

A study by Trend Micro researchers showed that more than 83 percent of all downloaded software files are unknown or unclassified, even two years after they were first observed in the wild. And because most malware threats come from software download events, they subsequently developed a human-readable machine learning system that successfully classifies unknown files into either benign or malicious in nature.

The study involved a dataset of 3 million anonymized web-based software download events gathered in a seven-month period. These events were studied and analyzed using multiple sources of ground truth both from internal and proprietary Trend Micro systems and publicly available ones. However, less than 17 percent of the dataset were labeled using traditional means.

Despite these unknown files having very low prevalence, the research highlighted that 69 percent of the machine population involved in the study downloaded one or more unknown software files which could potentially be malware.

Anonymous No More Through Machine Learning

To scale down the number of unknown software downloads, Trend Micro researchers created a machine learning system that would automatically produce detection rules based on observations of software file information and features. This intelligent actionable system analyzed the following information in each software download file:

Signer, certification authority, and packer of the downloaded file
Signer, certification authority, and packer of the downloading process
Class of the downloading process (browser, Windows, Java, etc.)
Popularity of the download domain

Using the PART rule learning algorithm, the researchers were able to create a human-readable machine learning system with human-readable classification rules. It allowed the researchers to easily observe, understand, and analyze results, as opposed to using other machine learning algorithms such as support vector machines (SVMs) and neural networks.

The developed machine learning system managed to generate 1,500 new detection rules per month, which reduced the number of unknown downloads by 28 percent.[1]

The Malicious Use of Code Signing and PUAs in Malware Distribution

Unknown files are unlabeled files. And file labeling is important as it allows malware detection systems to effectively protect endpoints against malicious files. In malware research, most of the proposed systems for malware detection and classification are assessed based only on samples for which ground truth is available, possibly limiting their ability for large-scale application. This emphasizes the need for a more efficient file labeling of currently unknown files.

One practice used by most operating systems and browsers and heavily utilized in the cybersecurity community is the identification of software files through code signing, the cryptographic signing of software that allows the differentiation between benign and malicious software.

However, malicious actors take advantage of code signing to distribute malicious software files, and this practice has further intensified in the past few years. Last year, it was reported that code signing certificates were for sale on the dark web for up to $1,200 — a price point higher than other dark web offerings such as stolen credit card information and fake IDs.

Study results corroborated this widespread abuse as 66 percent of malicious software were found properly code signed as opposed to 30.7 percent for legitimate software apps. These malicious files, therefore, are able to bypass code signing validations and infect endpoints.

The research also looked into potentially unwanted applications (PUAs) and how these are more damaging than they appear. According to the research, PUAs transform into more advanced forms of malicious software on the same day machines are infected. Recently, the Trend Micro Cyber Safety Solutions team tracked down a PUA distribution campaign which began in the latter part of 2017. After installing PUA software downloaders like ICLoader, they were found pushing malware along with PUAs.

These observed behaviors of PUAs and data on code signers were used for the machine learning system to label unknown files.

Machine Learning for Cybersecurity

Trend Micro researchers successfully labeled 28.30 percent of 1,436,829 previously unknown files — a 233 percent increase in comparison to the available ground truth — via the machine learning system. This result can further enhance the evaluation of future malware detection systems, helping secure endpoints better.

This system joins the roster of machine learning innovations Trend Micro has been utilizing as early as 2005 to adapt to the ever-changing cybersecurity landscape.

Advances in machine learning for cybersecurity solutions are much needed as threats continue to emerge or evolve. But while machine learning is evidently effective in identifying and analyzing unknown files, as well as catching new ransomware types and detecting new malware variants, it is not a cybersecurity silver bullet. Machine learning is more potent when it is part of a multilayered approach to protecting against a variety of threats.

Trend Micro™ XGen™ security provides a cross-generational blend of threat defense techniques to protect systems from different types of malware. It features high-fidelity machine learning to secure the gateway and endpoint, and protects physical, virtual, and cloud workloads. With capabilities like web/URL filtering, behavioral analysis, and custom sandboxing, XGen protects against today’s threats that bypass traditional controls, exploit known, unknown, or undisclosed vulnerabilities, either steal or encrypt personally identifiable data, or carry out malicious cryptocurrency mining. Smart, optimized, and connected, XGen powers Trend Micro’s suite of security solutions: Hybrid Cloud Security, User Protection, and Network Defense.

[1]Averaged based on seven months’ worth of data.

HIDE

Like it? Add this infographic to your site:
1. Click on the box below. 2. Press Ctrl+A to select all. 3. Press Ctrl+C to copy. 4. Paste the code into your page (Ctrl+V).

Image will appear the same size as you see above.

Posted in Security Technology, Machine Learning