Using Machine Learning to Detect Malware Outbreaks With Limited Samples

Generative Malware Detection View Generative Malware Outbreak Detection

Catching malware on the onset is integral to keeping users, communities, enterprises, and governments protected. With the advent of machine learning (ML) technology for cybersecurity, detecting malware outbreaks has been made relatively more efficient. Machine learning helps analyze large amounts of data to find patterns and correlations in malware samples as well as helps train systems to detect future similar variants as they emerge. But can machine learning aid in analyzing a malware outbreak given a small dataset? We, in collaboration with Federation University Australia researchers, conducted a study titled “Generative Malware Outbreak Detection,” which showed the effectiveness of the latent representations obtained through adversarial autoencoder for such situations. This ML model for malware outbreak detection uses generative adversarial network (GAN) to obtain smooth approximated nearby distributions from a small number of OS X training samples.

Big, Bad Malware Outbreaks

In today’s threat landscape, malware outbreaks have become overt events that have adversely affected users and companies globally and caused billions of dollars’ worth of damages. The NotPetya ransomware, which came out in 2017, became “the most destructive and costly cyberattack in history,” according to the U.S. government. It wreaked havoc on governmental organizations and private companies, with Danish shipping giant Maersk being one of its more popular preys.

Meanwhile, in 2018, the VPNFilter malware broke out and infected 500,000 routers used in homes and small businesses, with the majority of its victims located in Ukraine. This multistage malware has been found in 54 countries and had code overlaps with the BlackEnergy spyware.

In both cases, a sizeable amount of data was readily available for analysis. But for outbreaks with only a handful of available data, we propose a method that detects malware samples similar to other variants. This method uses just one malware sample for training with adversarial autoencoder and has a high detection rate for similar malware samples and a low false positive rate for benign ones.  

For this research, we collected 3,254 in-the-wild OS X malware samples and 9,981 benign, randomly chosen OS X Mach-O samples. In order to replicate a malware outbreak, 175 out of the 3,254 malicious samples that showed unique instruction sequence patterns were handpicked by a human malware expert. These selected malware samples are the core malicious training samples and were assigned unique labels.

An Important Feature: Program Instruction Sequence

Malware authors usually use custom tools to automatically generate mutated or modified malware samples. And these samples are basically a collection of malware that has the same functionality but made to look different through obfuscation, the purpose of which is to evade static signature-based detection by security products. Though the malware samples are obfuscated, we observed that they still have one feature that’s relatively unchanged: the distribution of the program instruction sequence.

  1. MAC.OSX.CallMe.A (3 samples)
  2. MAC.OSX.CallMe.E (1 sample)
  3. MAC.OSX.CallMe.F (1 sample)

To illustrate, we analyzed different variants of the MAC.OSX.CallMe malware using the following samples:


Note: Each row represents a per-sample feature, which is a sequence of instructions of a malware sample. Each normalized instruction is rendered as a vertical bar with a unique color to differentiate between different instructions. The X-axis represents the feature while the Y-axis represents the sample number.
Visual analysis of three unique variants of MAC.OSX.CallMe family

In Figure 1, we noticed that all variants of the MAC.OSX.CallMe malware have had the same identical instruction sequences until a variation was introduced at instruction 5250. Even after this variation was introduced, parts of the instructions for samples 2 and 3 simply moved from the instructions for the rest of the samples. This shows us that the instruction sequences for these different MAC.OSX.CallMe variants remain very much alike. This proves that the program instruction sequence of malware samples is a vital component in identifying malware variants during outbreaks.

An example of a malicious cluster detected by adversarial autoencoder with semantic hashing (aae-sh) is also discussed in our research paper, specifically, the model detected Flashback variants of different lengths but containing similar instruction sequences.

Our research paper “Generative Malware Outbreak Detection” provides a deep dive on how we used the instruction sequence as the sole feature for the malware outbreak detection model we proposed. Our full paper also discusses how our proposed model captures the program instruction sequence in the presence of code transposition and integration metamorphism through adversarial autoencoder. It was presented at the IEEE International Conference on Industrial Technology (ICIT) 2019. An updated version will be available in the IEEE Xplore Digital Library. 


Like it? Add this infographic to your site:
1. Click on the box below.   2. Press Ctrl+A to select all.   3. Press Ctrl+C to copy.   4. Paste the code into your page (Ctrl+V).

Image will appear the same size as you see above.