The human retina can transmit visual information to the brain at roughly the rate of an Ethernet connection, while reading text transmits information at roughly the rate of a dial-up modem.
Obviously, relying on text for the presentation of data has drawbacks, especially in the field of security research, which depends on the monitoring and analysis of large-scale, constantly evolving data sets. Meanwhile, using smart data visualization combined with intelligent data mining can allow researchers to draw connections between data points even in loosely related data, skipping the gradual comprehension of text files otherwise needed to reach the same results. Observations and conclusions can also be made through visualization that may not be obvious in text.
The security field offers an endless number of applicable uses for the visualization of loosely related data. Firewall, intrusion detection and prevention systems (IDS/IPS), and malware infection alerts could, for instance, be visualized to expose a malicious actor’s previously unrecognized activity patterns. By processing and analyzing very large log files, data visualization can help summarize and simplify the current state of a complex IT system in an accurate and elegant fashion.
To get from data to visualization, semantic networks are a key. Also called frame networks, semantic networks can represent any desired relationship between any defined concepts or entities, and can be applied to nearly any problem.
Such networks consist of nodes (also called vertices) that represent the entities being examined, and edges (the connections between the nodes) that describe the relationships between the entities. A semantic network representing a company’s IT environment might consist of nodes that represent various types of server characteristics and environments (HTTP, Mail, NTP, SSH ...), and edges that specify relationships and their attributes (Channels, Ports, Traffic, Bandwidth, etc.)
But during the creation of any semantic network it is up to the user to define the entities and relationships. The nodes and edges of a semantic network, taken together, are called its domain and represent the model of the underlying information.
Of course, there is more than one way to model any given problem, but it is always best to approach the problem with the available data in mind. When a model has been decided upon, the source data should be parsed so as to populate a relational data set that follows the model.
With the model and the data in hand, the next logical step is to derive insights from the shape of the resulting semantic network. A common method is to use force-directed layouts, where the data drives its own layout.
To get results, the semantic model is treated as a particle physics experiment. Each node is treated as a particle, and each edge is treated as an attracting or repelling force. Connected nodes will attract each other, and unconnected nodes will repel each other.
Many physics variables can be used to control the movement of the nodes (gravity, charge, mass, temperature, etc.) and bring the forces on the nodes into equilibrium. The result is usually a molecule-like layout where relational clusters are aggregated in the same areas.
The general concept is relatively simple, and by implementing a physics engine we can transform relational data, however loosely related, into a 2D or 3D structure (a visualization). Since the structure will be defined by the relationships of the data, previously unnoticed clusters or patterns can, basically, highlight themselves. Consider the following example:
This image represents a graph of all email communication inside a company. All the nodes represent employees and the connections signify that an email was sent between them. This visualization instantly exposes three conditions: First, three main central clusters can be identified. This could
mean that the company is organized in three offices or countries. Second, "data dust" is present throughout the image. One interpretation could be that some email addresses are trying to reach nonexistent or old ones that aren't connected with anything else. Spam, for example. And finally, we can also easily see that certain nodes are connected in a group, displaying some sort of hierarchy in the communication (for example: managers, help desks, or mailing lists).
Visualization involves increased complexity as the size of the domain grows. In this era of big data, databases with millions and billions of entries from security devices are increasingly common. But complexity can be kept manageable through either entity grouping, sampling, or parallelization.
With entity grouping, the researchers create nodes that represent groups of entities rather than individual entities, such as team nodes instead of employee nodes. Perhaps more importantly, the level of detail, or drill-down depth, can vary according to the type of security data being used, firewall logs, traffic files, etc. This would give access to the whole of the model without having to load all its constituent information up-front.
Sampling is another way to limit the size of the data set without losing sight of the big picture, by using a random or focused subset of the data set. Using the previous example, the designer could drop a random half (or some other fraction) of the employee emails. The results would then need to be interpreted with the understanding that sampling was used.
In today’s interconnected, data rich IT environments, passive inspection of information is not enough. Smart data visualization, combined with intelligent data mining, holds the key to better understanding and solving the complex problems security researchers face today.
Thibault Reuille is a security researcher at OpenDNS and creator of OpenGraphiti, an open-source 3D data visualization engine. Prior to OpenDNS, he was a software engineer for Nvidia, where he helped develop the Nvidia Parallel Nsight integrated development environment for ... View Full Bio