Big Data Fundamentals needed for a Data Engineer

Aviral Srivastava
11 min readDec 8, 2019


Working as a Data Engineer, you are directly connected to your data science teams and some senior-level personnel who give you business use-cases. You have to be involved in the complete process to gain awareness of why a particular action should be taken. Hence, I have been reading Ross&Kimball and Krishnan for Big Data concepts.

This blog post will cover some fundamentals of the engineering involved in Big Data.

Working with Big Data

Big Data analytics comprises of two steps:

  • acquisition
  • storage

While building pipelines for a data stream, we capture data in both structured and unstructured storage. Keeping track of the data in unstructured storage is what makes your integration good and keeps the data warehouse(lake as well) alive and useful. This approach is called as data virtualization.

Balancing between Traditional Analytics and Data Mining

Traditional analytics cover large subsets of a data collection that confirm gross relationships in data, the origin of those relationships, and of what those relationships mean. Data Mining uncovers outliers and data that doesn’t fit in with the general flow.

Now, outliers may be just outliers and the gross relationship view can ignore the subtle details. You have to balance between these two to derive the highest value from your BI system.

Structuring and Analyzing

After capturing data in a more or less raw state, it gets structured and re-structured repeatedly.

You get or build a hypothesis and using that, you structure and re-structure the data. The objective is to help(allow) everyone to explore and test that hypothesis. So, in a nutshell, structuring → data mining and analytics to test hypotheses → feedback on structuring. These three are followed iteratively to serve the BI team and draw KPIs on the way.

Presenting the Answers

Remember the objective: allow executives or analysts to derive actionable business intelligence.

Presenting the information with visualization and reports is not enough. You got to provide the BI team a platform/tool that allows them to see your results in the context of their organizational data and mission.

Big Data should present itself.

This is your objective now. For example, consider Tweets. They often come with a hashtag that describes what the tweet is about. Similarly, if you attach the information to your structured dimensional data, it” ll become more useful for Big Data discoveries.

Visualizing Big Data

Problem: Visualizations that are useful today, maybe gone tomorrow.

Big Data varies like hell. Your structured data in a data warehouse has dimensions and measures that let you and others draw some visualizations but even then, the structures are transient and so the visualizations that are useful today, maybe gone tomorrow.

Kimball and Ross mention five practices for approaching this problem.

Structural Feedback Through Visualization: We often think of visualizations to serve our BI team whereas we can use the same visualizations to refine our structures(storage structures mentioned above). I have received the same statement from experts — Big Data visualizations will help you to refine your data lake and then they” ll be thrown away.

Using Maps: Geospatial data is important and it gives your precision. So, either start using it if you have it or make sure that you have the capability of storing geospatial data soon. This is something that the BI team won’t be able to come up with but as a Data Engineer, you gotta think outside the box!

Dimensions: Dimensions are those quantities that let you correlate Big Data. One dimension that doesn’t vary as Big Data does is time, actually, its the calendar dimension. See how important role time plays to understand your Big Data, click here. Just layout time on x-axis and measures(number of tweets, etc) on the y-axis.

Categorization and Correlation: We all have data lakes that contain data which require multi-way categorization and correlation. Rookies jump to apply Machine Learning(no offense) but we may have a workaround here. Consider connections on your LinkedIn, Facebook, etc. Such type of data doesn’t fit neatly with measures and axes. Now think about the large volume of such data, you are dead! You can solve this using a network graph. A network graph looks like this:


This allows effective exploration among connections that otherwise would be senseless. Make sure you check out supernodes.

Garbage In, Garbage Out

The first point, Structural Feedback Through Visualization, was for structural data. This same technique applies and is required much more for the Big Data. So, you use your visualizations to disprove your hypothesis and then you re-structure the storage of data to generate visualizations. It’s a vicious but a required cycle.

Role of Data Science

I’m still confused between Data Science and Statistics. However, I do know (and I am told by my boss) that Data Science is used to develop and provide actionable insight.

Applying data science algorithms or statistics requires your large data sets to be in a particular structure(not storage structure) throughout the pipeline(ETL). Data Scientists ensure that certain metrics and attributes are consistent throughout the dataset. So, we can say that Data Science is used in data conformance.

Predictive Analysis

The life of a data engineer would be a lot easier if the predictive analysis would start working.

Predictive analysis is used to identify trends in historical data and then point out to possible causes to these trends.

It involves three things:

  1. Identifying Causation: I will start with why identifying causation is important. You must have seen the investment fund advertisements, “past performance is not a guarantee of future returns.” That’s true in all other business cases as well. The problem is in identifying the causes of the results with less relevant data, or with not knowing whether the data used for predictive analysis was relevant at all. Enter Big Data. With Big Data, our objective is to capture more and more comprehensive information around recorded performance and results. So, in a nutshell, i) We have to identify potential causal factors for what we have measured and what we hope to predict, and, ii) We have to set up the infrastructure to capture and index this information.
  2. Predictive Modeling: Why Predictive Modeling is needed? Because every prediction that your team is going to make is based on a model derived from analysis. Once you identify the possible causal factors, you have to choose algorithms and analytical models. As a Data Engineer, mostly you will only have to perform regression analysis using linear regression so that you have a direct relationship between causal factors and the measure of interest. Once you make the model, you start fitting data into it. In case you don’t have data, you extract more. One more thing that you can leverage is time. The three most common ways to use time in the regression model are i) using time as a causal factor for the variable, ii) duration or lifecycle, iii) event triggering and lag.
  3. Prescriptive Analytics: After performing predictive modeling, you have to apply regressions to derive probabilities for certain actions under the control of the organization.

Machine Learning

To perform the analysis of Big Data and other large data sets, machine learning is required. The objective is to define parameters for your analysis algorithm and then have them auto-adjusted by the results of the algorithm.

  • Machine Learning is intended to regularize and automate the adjustments and corrections you make to your application as you go.
  • Machine learning can be used to identify adjustment points in your models.
  • You should avoid making “success” a combination of values, rather than a single value. You should know what you want, and be able to measure it, for machine learning to optimize it.
  • Algorithms for feedback into your models: Linear regression is the most straightforward. Other approaches include Bayesian classification, K-means clustering, neural networks, and decision trees.
  • You should collect historical data of predictive factors and observed outcomes so that you can train your machine learning algorithms and models.

Addressing Velocity and Volume

Some common ways to address the velocity and volume of Big Data:

  • Throwing hardware at it: For dealing with Big Data, we need Big systems. The options are i) a DBMS clustered on big processors, ii) a disk farm swallowing all the data it can, iii) a distributed file system. You have to decide the tradeoffs.
  • No DBMS: relational DBMSes are not in the equation for Big Data and not just for performance reasons, Big Data has no fit for integrity constraints and defies predefined structure.
  • Don’t Lose the Data: Durability and Consistency are two of the RDBMS qualities needed for Big Data as well. RDBMSes won’t be able to solve these two in the case of Big Data because it’s not your traditional world where you have a transactional system having a consistent state. In the cases of Big Data, we don’t even know what a consistent state would look like, not to forget the bottleneck transactional consistency becomes in data acquisition.
  • Play it Cheap: commodity computers, commodity storage, and free and open-source software. The distributed network works best!

The CAP Theorem

The problem statement is to be sure that we have a data loss. Why? Because parallelizing gives us speed and can also give us redundancy in case of failure. So, we need to apply some scrutiny to validate whether we have data loss or not.

Using a distributed network i.e. distributed data storage systems, you have to take care of the three qualities:

  • Consistency: redundancy is inevitable but it should always be controlled. That is to say, “all of the copies of each piece of data are consistent with each other.”
  • Availability: to always be available for the read/write operations.
  • Partition Tolerance: In a massively parallel system with hundreds or thousands of nodes, partitioning occurs regularly. In these systems, the quality of being able to work correctly despite intermittent partitioning is highly desirable.

The CAP theorem suggests for selecting any two of the above three qualities. In 1999, UC Berkeley’s Eric Brewer proposed a conjecture that Consistency, Availability, and Partition Tolerance could not be simultaneously guaranteed in a distributed data system. This leaves us with having to choose which of these qualities is important and necessary, and which can be dropped or deemphasized.

Eventual Consistency

In the context of Big Data, availability and partition tolerance are of higher value than consistency. Do you know about the fine-grained consistency that is essential in transaction processing? Yeah, that’s not needed in a high-velocity Big Data world. The mission statement then becomes: “Sooner or later, the data readout to all requestors of a particular data element will contain the last updated value.” HDFS and other distributed file systems and databases (eg Cassandra) follow this concept.

Using Hadoop for Massively Parallel Processing

The fundamental technique unique to Big Data is massively parallel processing. But why? Because of the following:

  • Massively Distributed Data: To process large amounts of data in parallel, we first need a massively parallelized storage system. The premise is to have a distributed storage, not around a small number, but instead a huge number of commodity computers with directly attached storage. Hadoop File System was built on this premise.
  • MapReduce: MapReduce can be considered as a relational GROUP BY with aggregate functions. Big Data processing involves two steps: i) Map, ii) Reduce. You ingest the data to the Map, you get it shuffled to Reduce it and then you collect the results. The Map operation takes a single piece of data and processes it into one or more key-value pairs. Each input value comes with a key, as well, but this is not essential to the Map process. Mapping happens in parallel on all nodes in the cluster. After mapping, all of the output is “shuffled,” which means it is grouped by shared output key values and distributed to more nodes to be reduced. The Reduce operation takes as input the one key and the list of values emitted for that key across the Hadoop nodes. It writes a single result for that key; essentially, this is an aggregation function. You should also think of having less shuffling, as it slows down your parallel processing.

PySpark replacing Map-Reduce For Big Data

I won’t enter details but to give you a gist, PySpark replaced Hadoop for many operations due to the following reasons:

  • In-Memory operation
  • High-level interface via Dataframes
  • Spark SQL

Now, I will talk about Data-Driven Architecture for Managing Big Data and Business Intelligence

Metadata in the Business Intelligence System

The business intelligence system is data-driven, which is to say that it is metadata-driven. Unlike most of the data we have discussed to this point, we will be storing this in a database, and fully normalizing it. This is also referred to as BI Data Dictionary.

Krishnan has listed out 9 types of metadata:

  • Technical Metadata: The base level of our dictionary describes all of our data layouts, including names and type specifications for all columns. Eg tables used for ETL or forwarding data. Technical metadata also includes a vague or less-specific characterization of big data sources
  • Business Metadata: While technical metadata gives us table names, business metadata describes the business entities which reside in those tables. This is also where constraints and rules reside that dictate the formats and units of the business data we collect and analyze along with authority for that data. Many businesses include in this metadata a “Data Element Dictionary.” A data element dictionary defines the uses of field-level data types apart from a table. These can also be termed as domains. Besides, business metadata often have a history of the definition and redefinition of its contents.
  • Contextual Metadata: This type of metadata helps you to identify the source or the circumstances of data acquisition. Context can also include publication date, mail or web server, retrieval dates, edition, and references to that data, twitter hashtags, etc.
  • Process-Design-Level Metadata: BI systems are always moving around data and so the processes involved in this movement are identified and specified in the process metadata. It lets you ask the following questions: i) What is the source system for the data? ii) Is there a schema, or just a table or file name? iii) What is the destination of this processing?
  • Programming Metadata: Part of our metadata management will be to capture, describe, and report the software we have built for our BI system. We need to provide source version references, the systems where we run it, and dependencies with other software, machine learning algorithms, etc. The best part is that the programs themselves can reference this data to record their progress and results.
  • Operational Metadata: It consists of scheduling information, sequencing, and run history. Take a look at Airflow’s DAG runs and what data it entails.
  • Infrastructure Metadata: It consists of the level of hardware used and operating system platform, and includes network addresses, network connectivity, and network connection information.
  • Algorithm Metadata: You need more specific and detailed information on analysis, even while it carries less intrinsic structural information. For the data warehouse, in contrast, analysis metadata is more like parameters and filters and such for known types of reports and visualization. Algorithmic metadata includes the name and category of the algorithm — pattern processing, regressions, machine learning, predictive analytics, etc.
  • Business Intelligence Metadata: This metadata stores how you ran the OLAP cube builder, the clickstream analyzer, the reporting system, the data mining.