Inexpensive Causality Analysis for Big Data

Understanding NodeMerge: a template-based efficient data reduction for big-data causality analysis[1]


An advanced persistent threat is an attack in which an unauthorized user gains access to a system or network and remains there for an extended period without being detected. The traditional way to counter APT attacks is to discover the initial penetration point and from there, identify previously unknown attack steps. You need to store colossal amounts of data (from ubiquitous system monitoring) to apply this technique. The issue with this technique is that it requires a massive amount of data, which is quite expensive.

A data dependency in computer science is a situation in which a program statement refers to the data of a preceding report. In compiler theory, the technique used to discover data dependencies among accounts is called dependence analysis.[source: wikipedia]


Data being the new oil, attacks on data-centric enterprises are more sophisticated than ever. One kind of such attacks include Advanced Persistent Threat(APT). The traditional way to counter APT attacks is to discover the initial penetration point and from there, identify previously unknown attack steps. The issue with this technique is that it needs a massive amount of data, which is quite expensive. This research paper proposes a template-based reduction technique that either reduces the cost of storage or improves the performance of causality analysis under the same budget. After reading this blog post, you should be able to learn how to categorize different kinds of system-events, how to gather data online instead of storing it in cold-storage, and, finally, how to reduce the storage amount without degrading the quality of causality analysis.

This technique reduces raw data by 75.7 times and state-of-the-art kind[2] of data by 32.6 times. After experimenting with different types of workloads, the authors were able to obtain these results.

I found this paper when I was looking for some recent work done in the area of big-data security. I was amazed to look at a solution that reduces cost and preserves the quality of detecting breaches via causality analysis.


I wanted to address the issues in big-data security because we are ignorant of how important security is. We, as a community, are not addressing how to make sure that our data is safe and secure and that breaches don’t occur. In India, thousands of citizens had their Unique Identification Number(AADHAAR) leaked[3], TARGET(US retailer) had leaked 400 million users’ credit card information[4], and the list is endless. I think the reason for our ignorance is friction involved in detecting breaches. The most significant resistance is time and cost. Enterprises do not keep privacy and security as their top priority because they deploy their best and almost all resources in storage cost, driving business intelligence, etc.

The paper focuses on removing this friction: cost for causality analysis. It does so by reducing the storage space required for storing system events. With decreasing costs, enterprises and even startups can have resources to conduct causality analysis at a reduced price and a faster rate. Also, if one enterprise in an industry is trying to focus on privacy, it tends its competitors targeting on confidentiality as well. The challenge comes in templating the data.

The proposed algorithm categorizes data and places similar data in templates. Data-reduction is a challenge not just because of categorization but also because of the preservation of data-dependency. There has been some research done in data reduction and pattern recognition, but none addresses maintaining data-dependency.

A data dependency in computer science is a situation in which a program statement refers to the data of a preceding statement. In compiler theory, the technique used to discover data dependencies among statements is called dependence analysis. [5]

Enterprises use pattern recognition in some low-sensitive entry-level data systems. Enterprises don’t want to use it in all of their data systems because maintaining data-dependency is an issue for them. Data classification is also in practice. Enterprises categorize system-data based on the sensitivity level and sometimes based on the data warehouse schema as well. However, none can discard data and reduce the actual cost for performing causality analysis. Mostly categorize or classify data for enforcing more access controls. There have been many indexed-based databases built over time to query system-events, but without data reduction, using them is expensive. Hence, a concoction of all such solution is needed: online(for speed), reduced storage(reducing cost), preserving the quality of causality analysis.

The template-based reduction is not accessible due to the following reasons:

  • Randomness of system behaviors
  • The large volume of data

NodeMerge provides a solution that learns about the system data online and then places it in templates. It works on read-only files because such files are a dead-end in causality analysis. The solution starts forming data templates and learning classification from the data stream.

One example from the paper itself:

A typical commercial bank can have more than 200,000 hosts [46], which implies that such places may need about 140PB storage to host the data for a year! This amount of data reflects the urgency of effective data compression algorithms. However, since causality analysis is frequently used by security experts, traditional data compression techniques, such as 7-zip [35], is not an efficient solution. Using the data compressed by techniques like 7-zip requires an explicit, nontrivial decompression process. Whenever people need to use the data, they need to decompress the data. It is unacceptable. Instead, a decompression-free data reduction technique is required.


Use an online reduction system. As a user, we have to choose the limit to cache the original data. Thus, we make our system an online system that directly reduces the data from the stream.

Authors have chosen a decompression free data schema to store the data, and so, there is no explicit decompression process in the proposed solution. In this technique, data decompression is done on-the-fly without slowing down the causality analysis. An expensive decompression may significantly affect the speed of causality analysis. A decompression free schema can avoid such an impact.

How to avoid breaking system dependencies?

Authors chose to merge read-only files as their primary reduction method. Read-only files are “dead-end” in causality analysis. So, if we combine multiple read-only files, the system dependency does not get affected. With this read-only method, the system ensures that the dependencies are maintained before they have been retired.

How to deploy NodeMerge?

There are two distinct options: i) Centralized deployment ii) distributed implementation.

Maintaining UX and security requires NodeMerge to be deployed on a centralized server. If used on a distributed network, each peer will have to bear the extra load, and template learning will not be possible online(as far as the objective is to learn the template reduction for the whole network).

NodeMerge, on average, uses 1.3 GB memory during the reduction for processing 1,337 GB data from 75 machines of 27 days. It took 60 minutes in each training cycle to learn the templates. This result confirms that NodeMerge could reduce the data with a reasonable amount of cost. The storage systems requirement got decreased by 75.7 times(highest) over real-world enterprise data.


Authors have used multiple problems and their corresponding solutions to reach their proposed solution. Hence, we need to understand the threat and trust model of their algorithm.

Adversaries have full knowledge of the reduction algorithm and can gain control over the hosts

An adversary cannot compromise the backend system as it remains in the trusted domain.

My take would be: Yes.

I am a software developer at the core, and I believe in a microservice architecture. It comes from the idea of decoupling problems and objectifying them. Thus, the decoupling network, network-host communication, and hosts are essential. However, in reality, all systems work together, and if one of any three areas get compromised, the whole system is affected.

But, our objective is not to prevent but to perform causality analysis. In that sense, speed and cost reduction, if done in silos is okay. I have based my opinion on two observations: i) takes care of the centralized server(which is very hard for APT attacks to enter).

If I were in their place, I would have come up with an access-based or dynamic access-based solution and then use it in data discovery. And I would have reduced the logs(system data) using Canonical Logging.


I want to present the end as a graph:

Learns fixed patterns of read-only files → Merge such files into one template and thereby, completes the reduction step and reduce the volume of data → evaluated the approach against real-world attacks and enterprises that proved 11.2 times improvement in the storage capacity as compared to the baseline approach → 1.3GB extra average memory used for this reduction → accuracy of causality analysis preserved with much lesser amount of data.

Future Work

I am looking to test this theory in cloud services and then develop a proof-of-concept actually to prove it. Once done, I shall be making this open-source and plan to provide maintenance&support as a charged service for this implementation.


[1] Actual Research Paper

[2]a level of development reached at any particular time as a result of the standard methodologies employed at the time.

[3]Indian state government leaks thousands of Aadhaar numbers

[4] Target Breach

[5]Data Dependency — Wikipedia



Code + Data.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store