Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Threat Intelligence

2/11/2016
12:30 PM
Giora Engel
Giora Engel
Commentary
Connect Directly
Facebook
Twitter
LinkedIn
Google+
RSS
E-Mail vvv
100%
0%

3 Flavors of Machine Learning: Who, What & Where

To get beyond the jargon of ML, you have to consider who (or what) performs the actual work of detecting advanced attacks: vendor, product or end-user.

The great promise machine learning holds for the security industry is its ability to detect advanced and unknown attacks -- particularly those leading to data breaches. These range from traditional uses -- such as malware detection -- to new areas like attack detection for hackers who have circumvented preventative security.

Unfortunately, machine learning , which is rapidly becoming a popular marketing term, has lost much of its meaning because virtually all vendors define it differently. One way to get beyond the jargon is to look at ML from the perspective of who actually performs it, and where. But first, some basic concepts and definitions.

The strength of any ML algorithm is only as strong as the data modeling behind it; the actual algorithm in use only plays a secondary role. If the selected data parameters do not contain parameters that can predict the result, you can use fancy algorithms, but the accuracy of the results will be very low. They will also generate a lot of noise when used outside of a lab environment.

A basic principle in data science is that simple schemes with the right data modeling work better than complex schemes. So in evaluating options, it’s wise to look for vendors that have real domain expertise rather than a large staff of PhDs. That’s because understanding the parameters and various scenarios is more important than the development of an algorithm for correlating data. Domain expertise directly affects the quality of the data modeling. Consequently, if it’s hard to understand how ML is used, it probably means that it is not relevant to the way the product works.

As for understanding the various flavors of ML, one approach is to divide products into categories based on who (or what) actually performs the machine learning work: the vendor, the product or the end-user.

The Vendor
The vast majority of cases using the term machine learning actually describe one of the tools that the vendor uses to develop their product or generate threat intelligence. In these cases, the vendor is actually performing ML in their lab, rather than the product doing it on premise.

A typical example: AV and URL filtering vendors that perform ML behind the scenes. In order to keep their signatures (or threat intelligence) reasonably current and to process heavy loads of malware and viruses that have been encountered, vendors need to leverage ML in their labs to automate the classification and signature creation process. This use of ML occurs in the vendor’s lab and results in signatures or threat intelligence that the product then uses to detect specific patterns or artifacts.

Typical products: AV, sandboxing, anti-bot, whitelisting and rule-based event correlation.

Advantage: the products are deterministic and will always operate in the same way, regardless of the environment.

Disadvantage: the products are rule-based and can leverage only known artifacts, which leads to low detection accuracy (e.g. AVs inherently don’t detect new malware well). Attackers can circumvent detection and test against the product.

The Product
Some products perform ML as an integral part of their function, typically for behavioral detection. In this case the product “learns” the specific environment and uses that information for detection. For example, observing a user or machine starting to access resources it never accessed before and ones that the user’s peer group doesn’t typically access. There is no predetermined rule, signature or pattern that can detect this. You can only achieve an accurate detection by profiling normal behavior in the particular network and applying that knowledge to detect anomalous behavior.

“Behavioral analysis” by itself doesn’t mean machine learning. Many products look at behaviors and apply rules or signatures. For example, sandboxing products typically run a malware in a sandbox environment, examine its behavior and then compare the behavior against a list or rules previously developed by the vendor in their lab (using different methods, including machine learning). In this case the product itself does not perform any ML. A product that performs ML must have a self-training/learning/profiling period. Products that don’t operate this way do not belong in this category, even if they are said to perform “behavioral analysis” or “detection”.

A relatively new security application for machine learning is detection of attacks that have evaded preventative security. While malware detection doesn’t necessarily need ML-capable products, more general behavioral attack detection is usually based around the activities of a human attacker or insider. The system has to essentially customize its logic to the environment in order to accurately detect the activities. This area represents a substantial break from traditional security in that the goal is to identify unknown anomalous behaviors that neither the end user nor the vendor specified in advance, rather than evaluate against known, already-defined technical artifacts.

Typical products: fraud detection, anomaly detection, attack detection, behavioral detection. A product in this category has to have a self-learning/profiling period, so other “behavioral analysis” products are not included here.

Advantage: Leveraging ML, these products can obtain higher detection accuracy and a lower rate of false positives. They automatically optimize their detection to every specific environment and could detect unknown things that the end-user or vendor would not need to specify in advance. Additionally, these can’t be “gamed” by hackers in the way a statically defined technical artifact can be known and thus circumvented by an attacker.

Disadvantage: The detection depends on the profile of the specific environment, making the process less predictable. The products are less optimized for generic queries on the data, but more on automated detection.

The End-user
This category includes products that are are toolkits used by data scientists to perform ML. For example, business intelligence (BI) tools enable the end user to define datasets, run correlations, regressions and clustering algorithms. In this case the end user is the data scientist who leverages ML, and the product is only a tool at his or her disposal. The end user decides which data to process, what parameters to use and how to interpret the results.

Typical products: Business intelligence products, mathematical/statistical analysis toolkits, SIEM products with analytics toolkits.

Advantage: Lets the user perform custom analytics on custom datasets.

Disadvantage: Can only be leveraged if the security team has data scientists. The responsibility is on the analyst rather than the tool to define the problem, the input data and the conclusions. The analyst would not be able to see patterns that he or she wasn’t looking for. In order to allow custom analytics the collection of data is a heavy task that requires additional products and storage.

 More on this topic:

Interop 2016 Las Vegas

Find out more about security trends and technologies at Interop 2016, May 2-6, at the Mandalay Bay Convention Center, Las Vegas. Register today and receive an early bird discount of $200.

Giora Engel, vice president, product & strategy at LightCyber is a serial entrepreneur with many years of technological and managerial experience. For nearly a decade, he served as an officer in an elite technological unit in the Israel Defense Forces, where he initiated and ... View Full Bio
 

Recommended Reading:

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
JouCTO
50%
50%
JouCTO,
User Rank: Apprentice
2/14/2016 | 9:38:22 AM
Outstanding
A refreshingly accurate and honest review of machine learning. Thank you, Giora!

 
COVID-19: Latest Security News & Commentary
Dark Reading Staff 7/9/2020
Russian Cyber Gang 'Cosmic Lynx' Focuses on Email Fraud
Kelly Sheridan, Staff Editor, Dark Reading,  7/7/2020
Why Cybersecurity's Silence Matters to Black Lives
Tiffany Ricks, CEO, HacWare,  7/8/2020
Register for Dark Reading Newsletters
White Papers
Video
Cartoon
Current Issue
Special Report: Computing's New Normal, a Dark Reading Perspective
This special report examines how IT security organizations have adapted to the "new normal" of computing and what the long-term effects will be. Read it and get a unique set of perspectives on issues ranging from new threats & vulnerabilities as a result of remote working to how enterprise security strategy will be affected long term.
Flash Poll
The Threat from the Internetand What Your Organization Can Do About It
The Threat from the Internetand What Your Organization Can Do About It
This report describes some of the latest attacks and threats emanating from the Internet, as well as advice and tips on how your organization can mitigate those threats before they affect your business. Download it today!
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2020-15105
PUBLISHED: 2020-07-10
Django Two-Factor Authentication before 1.12, stores the user's password in clear text in the user session (base64-encoded). The password is stored in the session when the user submits their username and password, and is removed once they complete authentication by entering a two-factor authenticati...
CVE-2020-11061
PUBLISHED: 2020-07-10
In Bareos Director less than or equal to 16.2.10, 17.2.9, 18.2.8, and 19.2.7, a heap overflow allows a malicious client to corrupt the director's memory via oversized digest strings sent during initialization of a verify job. Disabling verify jobs mitigates the problem. This issue is also patched in...
CVE-2020-4042
PUBLISHED: 2020-07-10
Bareos before version 19.2.8 and earlier allows a malicious client to communicate with the director without knowledge of the shared secret if the director allows client initiated connection and connects to the client itself. The malicious client can replay the Bareos director's cram-md5 challenge to...
CVE-2020-11081
PUBLISHED: 2020-07-10
osquery before version 4.4.0 enables a priviledge escalation vulnerability. If a Window system is configured with a PATH that contains a user-writable directory then a local user may write a zlib1.dll DLL, which osquery will attempt to load. Since osquery runs with elevated privileges this enables l...
CVE-2020-6114
PUBLISHED: 2020-07-10
An exploitable SQL injection vulnerability exists in the Admin Reports functionality of Glacies IceHRM v26.6.0.OS (Commit bb274de1751ffb9d09482fd2538f9950a94c510a) . A specially crafted HTTP request can cause SQL injection. An attacker can make an authenticated HTTP request to trigger this vulnerabi...