Rising Above Spam and Other Threats via Machine Learning

By Jon Oliver

In 1978, some 400 ARPANET (Advanced Research Projects Agency Network) users received an email about a product viewing of new computer models. Gary Thuerk — a marketer working for the Digital Equipment Corporation (DEC) — thought it would be a good idea to email people on the network to sell computer products. While it drew interest from some of the recipients, a portion expressed annoyance at the then-unnamed intrusive advertising. Several years later, the cybersecurity industry called emails of a similar nature “spam,” describing it as unwanted bulk email advertising products or services.

Unfortunately, Thuerk’s incidental infamy in e-marketing decades ago has been surpassed by today’s cybercriminals: In 2002, spam distribution reached 2.4 billion per day; today, it has reached more than 300 billion. And while previously the repercussions of the overwhelming volume of spam simply included system performance issues, they now include more serious complications especially for businesses as cybercriminals use spam for phishing and other malicious purposes.

Early solutions to combat the growing number of spam mails

A few decades after the first spam email was sent, motivations and methods of delivery have evolved. The early 2000s saw simple spam mails that were in plain text, sometimes boldly under the guise of Viagra advertisements, and were either directly emailed by the spammer or sent through open relays. Recipients’ inboxes lagged, and their internet speed slowed down. To solve this problem, antispam companies created solutions using a combination of hashing and “spam signatures,” which were manually crafted rules written in operation centers.

This approach produced two different results. On one hand, there was a positive outcome: around 50 percent of spam mails were blocked. On the other, there was a realization that the approach was not going to be effective if spam averaged 2.4 billion daily (2002). Imagine this: If you are already 30 feet deep, removing 50 percent of the water will not save you from drowning.

Catching 50 percent of spam mails could be ineffective
if they average billions per day

Antispam solutions must deal with spam before it enters the network; it’s impractical, not to mention, risky, to allow the delivery of potentially malicious emails and only react to it after the fact.

Addressing the need for a more efficient way to defend against spam, the antispam industry turned to machine learning (ML), a tool that analyzes large volumes of information or training data to discover unique patterns. The effect: Overall cyberdefense was enhanced to catch about 95 percent1 of spam, making ML a key technology for blocking junk emails.

Detecting and blocking billions of spam using machine learning

Trend Micro started using machine learning to detect and block spam in 2005 via the Trend Micro Anti-Spam Engine (TMASE) and Hosted Email Security (HES) solutions. As spam continued to evolve, from using plain text to utilizing images, email backscatter, document attachment, and CAPTCHA, among others, Trend Micro countered with an efficient machine learning model backed by quality datasets. Employed alongside other antispam protection layers (for example, Email Reputation Services, IP Profiler, antispam composite engine), machine learning algorithms were used to correlate threat information and perform in-depth file analysis to catch and keep spam off enterprise networks.

Machine learning combined with other antispam methods
helped solutions catch approximately 95 percent of spam

TMASE and HES allow enterprises to preserve bandwidth, storage, and other resources. These products with built-in machine learning and other defensive layers beat traditional on-premises security engines restricted by hardware capacity limits (due to large spam volumes) that, in turn, slow down email delivery. TMASE and HES work well even without hardware or software to install and maintain. This means all email-based threats are completely blocked from the network, while saving IT staff time and maximizing end user productivity, network bandwidth, mail server storage, and CPU capacity.

When spam distribution skyrocketed to 200 billion per day in 2010, TMASE and HES, as well as security solutions featuring machine learning, sailed through a gargantuan number of junk emails that hashing and spam signatures alone would not have been able to counter — in the early 2000s and even more so in 2010.

An effective strategy to stop other threats

As the game plan to combat spam via machine learning proved to be effective, Trend Micro gained momentum early on to prepare for the arrival of newer email threats and malware. Just like in the case of spam, catching these threats at time zero is critical to network security. Machine learning has become integral to a security solution’s efficiency in detecting threats because of its ability to predict the maliciousness of an unknown file entering a system. That being said, the success rate of a machine learning model relies on not just a large quantity of data but also quality or accurately labeled data.

The strategy for using machine learning in antispam engines involved the use of state-of-the-art models, trust on an iterative method to improve the model’s accuracy, and the collection of accurately labeled data, which is a crucial part of the process. With Trend Micro’s longevity in the cybersecurity business, Trend Micro™ engines are supported by massive, high-quality datasets. They are not only large in volume, having been gathered from hundreds of millions of sensors across the globe, but are also of good quality, having been studied and categorized by threat experts for 30 years.

The same strategy is used by solutions that defend enterprises from business email compromise (BEC) scams. The latest anti-BEC security technology that adopted this approach is Trend Micro™ Writing Style DNA. This feature prevents email impersonation by recognizing a legitimate email user’s writing style based on past written emails and comparing it to suspected forgeries. It analyzes email behavior, intention, and authorship by using expert rules and machine learning. The machine learning model behind this feature is fueled by accurately labeled datasets containing 7,000 writing characteristics, such as capital letters, short words, punctuation marks, function words, word repeats, distinct words, sentence length, and blank lines.

Machine learning: a boost to overall cyberdefense

For the past several years, we have seen businesses suffer from financial loss, reputational damage, and disruption of operations brought on by a wide variety of threats — including those borne by spam emails. Modern and more complex threats will continue to proliferate, and older ones like spam will continue to evolve. Given machine learning’s promise, the cybersecurity industry will continue to use it to better protect systems and networks.

However, while machine learning does improve detection and block rates, it is important to note that it works best as part of interoperating security layers. As spam mails and other forms of threats diversify and proliferate, enterprises must employ as many cybersecurity technologies as they can. Relying on a single solution cannot solve all security problems; a multilayered approach is still the most effective at providing a defense that is capable of deflecting varying kinds of threats.


Like it? Add this infographic to your site:
1. Click on the box below.   2. Press Ctrl+A to select all.   3. Press Ctrl+C to copy.   4. Paste the code into your page (Ctrl+V).

Image will appear the same size as you see above.