Big Data Fundamentals needed for a Data Engineer

Working as a Data Engineer, you are directly connected to your data science teams and some senior-level personnel who give you business use-cases. You have to be involved in the complete process to gain awareness of why a particular action should be taken. Hence, I have been reading Ross&Kimball and Krishnan for Big Data concepts.

This blog post will cover some fundamentals of the engineering involved in Big Data.

Working with Big Data

Big Data analytics comprises of two steps:

While building pipelines for a data stream, we capture data in both structured and unstructured storage. Keeping track of the data in unstructured storage is what makes your integration good and keeps the data warehouse(lake as well) alive and useful. This approach is called as data virtualization.

Balancing between Traditional Analytics and Data Mining

Traditional analytics cover large subsets of a data collection that confirm gross relationships in data, the origin of those relationships, and of what those relationships mean. Data Mining uncovers outliers and data that doesn’t fit in with the general flow.

Now, outliers may be just outliers and the gross relationship view can ignore the subtle details. You have to balance between these two to derive the highest value from your BI system.

Structuring and Analyzing

After capturing data in a more or less raw state, it gets structured and re-structured repeatedly.

You get or build a hypothesis and using that, you structure and re-structure the data. The objective is to help(allow) everyone to explore and test that hypothesis. So, in a nutshell, structuring → data mining and analytics to test hypotheses → feedback on structuring. These three are followed iteratively to serve the BI team and draw KPIs on the way.

Presenting the Answers

Remember the objective: allow executives or analysts to derive actionable business intelligence.

Presenting the information with visualization and reports is not enough. You got to provide the BI team a platform/tool that allows them to see your results in the context of their organizational data and mission.

Big Data should present itself.

This is your objective now. For example, consider Tweets. They often come with a hashtag that describes what the tweet is about. Similarly, if you attach the information to your structured dimensional data, it” ll become more useful for Big Data discoveries.

Visualizing Big Data

Problem: Visualizations that are useful today, maybe gone tomorrow.

Big Data varies like hell. Your structured data in a data warehouse has dimensions and measures that let you and others draw some visualizations but even then, the structures are transient and so the visualizations that are useful today, maybe gone tomorrow.

Kimball and Ross mention five practices for approaching this problem.

Structural Feedback Through Visualization: We often think of visualizations to serve our BI team whereas we can use the same visualizations to refine our structures(storage structures mentioned above). I have received the same statement from experts — Big Data visualizations will help you to refine your data lake and then they” ll be thrown away.

Using Maps: Geospatial data is important and it gives your precision. So, either start using it if you have it or make sure that you have the capability of storing geospatial data soon. This is something that the BI team won’t be able to come up with but as a Data Engineer, you gotta think outside the box!

Dimensions: Dimensions are those quantities that let you correlate Big Data. One dimension that doesn’t vary as Big Data does is time, actually, its the calendar dimension. See how important role time plays to understand your Big Data, click here. Just layout time on x-axis and measures(number of tweets, etc) on the y-axis.

Categorization and Correlation: We all have data lakes that contain data which require multi-way categorization and correlation. Rookies jump to apply Machine Learning(no offense) but we may have a workaround here. Consider connections on your LinkedIn, Facebook, etc. Such type of data doesn’t fit neatly with measures and axes. Now think about the large volume of such data, you are dead! You can solve this using a network graph. A network graph looks like this:

Source: https://commons.wikimedia.org/wiki/File:Presocratic_graph.svg

This allows effective exploration among connections that otherwise would be senseless. Make sure you check out supernodes.

Garbage In, Garbage Out

The first point, Structural Feedback Through Visualization, was for structural data. This same technique applies and is required much more for the Big Data. So, you use your visualizations to disprove your hypothesis and then you re-structure the storage of data to generate visualizations. It’s a vicious but a required cycle.

Role of Data Science

I’m still confused between Data Science and Statistics. However, I do know (and I am told by my boss) that Data Science is used to develop and provide actionable insight.

Applying data science algorithms or statistics requires your large data sets to be in a particular structure(not storage structure) throughout the pipeline(ETL). Data Scientists ensure that certain metrics and attributes are consistent throughout the dataset. So, we can say that Data Science is used in data conformance.

Predictive Analysis

The life of a data engineer would be a lot easier if the predictive analysis would start working.

Predictive analysis is used to identify trends in historical data and then point out to possible causes to these trends.

It involves three things:

  1. Identifying Causation: I will start with why identifying causation is important. You must have seen the investment fund advertisements, “past performance is not a guarantee of future returns.” That’s true in all other business cases as well. The problem is in identifying the causes of the results with less relevant data, or with not knowing whether the data used for predictive analysis was relevant at all. Enter Big Data. With Big Data, our objective is to capture more and more comprehensive information around recorded performance and results. So, in a nutshell, i) We have to identify potential causal factors for what we have measured and what we hope to predict, and, ii) We have to set up the infrastructure to capture and index this information.
  2. Predictive Modeling: Why Predictive Modeling is needed? Because every prediction that your team is going to make is based on a model derived from analysis. Once you identify the possible causal factors, you have to choose algorithms and analytical models. As a Data Engineer, mostly you will only have to perform regression analysis using linear regression so that you have a direct relationship between causal factors and the measure of interest. Once you make the model, you start fitting data into it. In case you don’t have data, you extract more. One more thing that you can leverage is time. The three most common ways to use time in the regression model are i) using time as a causal factor for the variable, ii) duration or lifecycle, iii) event triggering and lag.
  3. Prescriptive Analytics: After performing predictive modeling, you have to apply regressions to derive probabilities for certain actions under the control of the organization.

Machine Learning

To perform the analysis of Big Data and other large data sets, machine learning is required. The objective is to define parameters for your analysis algorithm and then have them auto-adjusted by the results of the algorithm.

Addressing Velocity and Volume

Some common ways to address the velocity and volume of Big Data:

The CAP Theorem

The problem statement is to be sure that we have a data loss. Why? Because parallelizing gives us speed and can also give us redundancy in case of failure. So, we need to apply some scrutiny to validate whether we have data loss or not.

Using a distributed network i.e. distributed data storage systems, you have to take care of the three qualities:

The CAP theorem suggests for selecting any two of the above three qualities. In 1999, UC Berkeley’s Eric Brewer proposed a conjecture that Consistency, Availability, and Partition Tolerance could not be simultaneously guaranteed in a distributed data system. This leaves us with having to choose which of these qualities is important and necessary, and which can be dropped or deemphasized.

Eventual Consistency

In the context of Big Data, availability and partition tolerance are of higher value than consistency. Do you know about the fine-grained consistency that is essential in transaction processing? Yeah, that’s not needed in a high-velocity Big Data world. The mission statement then becomes: “Sooner or later, the data readout to all requestors of a particular data element will contain the last updated value.” HDFS and other distributed file systems and databases (eg Cassandra) follow this concept.

Using Hadoop for Massively Parallel Processing

The fundamental technique unique to Big Data is massively parallel processing. But why? Because of the following:

PySpark replacing Map-Reduce For Big Data

I won’t enter details but to give you a gist, PySpark replaced Hadoop for many operations due to the following reasons:

Now, I will talk about Data-Driven Architecture for Managing Big Data and Business Intelligence

Metadata in the Business Intelligence System

The business intelligence system is data-driven, which is to say that it is metadata-driven. Unlike most of the data we have discussed to this point, we will be storing this in a database, and fully normalizing it. This is also referred to as BI Data Dictionary.

Krishnan has listed out 9 types of metadata:

Code + Data.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store