Grouping Linux IoT Malware Samples With Trend Micro ELF Hash
We created Trend Micro ELF Hash (telfhash), an open-source clustering algorithm that effectively clusters Linux IoT malware created using ELF files.
Save to Folio
The internet of things (IoT) has swiftly become a seemingly indispensable part of our daily lives. The IoT devices in pockets, homes, offices, cars, factories, and cities make people’s lives more efficient and convenient. It is little wonder, then, that IoT adoption continues to increase. In 2019, the number of publicly known IoT platforms grew to 620, which was double the number of platforms in 2015. This year, 31 billion IoT devices are expected to be installed globally. Consequently, cybercriminals have been developing IoT malware such as backdoors and botnets for malicious purposes, including digital extortion. As reported in Trend Micro’s latest annual security roundup, the number of brute-force logins made by IoT botnets in 2019 was triple the corresponding number in 2018.
Through the years, cybersecurity researchers have developed various helpful algorithms to identify large numbers of malicious files quickly and accurately as an effective measure in the fight against malware. But on the IoT front, as threats and attacks geared toward web-connected devices continue to grow exponentially, cybersecurity experts need to have a means to make their defensive measures systematic, accurate, and strong.
To that end, we created Trend Micro ELF Hash (telfhash), an open-source clustering algorithm that effectively clusters malware targeting IoT devices running on Linux — i.e., Linux IoT malware — created using Executable and Linkable Format (ELF) files.
Existing algorithms for file clustering
Through the years, malware researchers have created algorithms to help them cluster malicious files in large numbers efficiently and accurately. One example of this is our very own Trend Micro Locality Sensitive Hash (TLSH), a type of fuzzy hashing technique that highlights the locality-sensitive nature of a file instead of its similarity, and can be used in machine learning extensions of whitelisting. In 2018, we used TLSH to analyze 2 million signed files to uncover a massive certificate signing abuse by a marketing adware plug-in called Browsefox.
Another example is import hashing (ImpHash), which is primarily used in identifying malware binaries belonging to the same malware family. It analyzes similar malware files by getting the imported functions of a Portable Executable (PE) file (from the import directory) and its related library names, and creating a comma-separated list. Afterward, the list will be hashed using the MD5 checksum algorithm. In the example shown in Figure 1, we took a sample of Lokibot, a malware variant that is able to steal sensitive data from victim machines, to illustrate how ImpHash works.
Figure 1. Imported functions from a Lokibot sample as seen on the import directory
From the KERNEL32.DLL library, this sample imports GetTempPathA(), GetFileSize(), GetModuleFileNameA, and other functions. The imported functions from all of the imported libraries are considered in generating the ImpHash. This way, similar files, regardless if new data is added, would have the same ImpHash value — unless the developers changed its features by using (and therefore importing) a new function or removing a previously used one.
Although there are many algorithms available for Windows PE files, there is still no specific clustering algorithm for Linux IoT malware, which is mostly created using ELF files. Thus, we saw a need to create telfhash, which can be viewed as “ImpHash for IoT malware” since it uses ImpHash techniques in analyzing ELF executables.
Telfhash at work
Our goal with telfhash was to create imported functions of ELF files and use them to feed a similarity digest algorithm for clustering similar files. Here is a short video that shows telfhash at work:
Although it is based on ImpHash’s techniques, telfhash uses TLSH instead of MD5 for its hash. This is to take advantage of TLSH’s locality-sensitive nature without losing the structural approach of using a function list as an input for the algorithm. Therefore, even if malware creators add new functionalities to their malicious samples by adding or importing new library functions, the telfhash digest would still remain close to the original and would still infer whether malware samples belong to the same family.
To see if telfhash works on real malware, we collected samples of Momentum, a botnet that affected IoT devices running on Linux and compromised devices to conduct distributed denial-of-service attacks, and ran telfhash against them, as shown in Figure 2.
Figure 2. Momentum botnet samples compiled for different architectures
By using telfhash and the TLSH distance measure (with a threshold set to 50), we were able to cluster the Momentum botnet samples in three similar groups, as indicated in Figure 3.
Figure 3. Clustering Momentum botnet samples in three groups (telfhash values redacted for brevity)
Currently, telfhash supports x86, x86-64, ARM, and MIPS, which are architectures that cover the majority of IoT malware samples.
Telfhash is now publicly available on Github. We are offering it as a Python library so that it can be easily integrated in Python scripts in order to generate a similarity digest for ELF files.
We are optimistic that new features will be discussed, improvements will be made, and bugs will be fixed with the support of the cybersecurity community. We hope that telfhash proves to be an essential tool in combating Linux IoT malware.
Read the tech brief, which discusses in detail how we developed telfhash and how it works.