I wanted to understand how come UNIX is so standardized and developer-friendly.
Before engineers at BELL Labs even thought UNIX, the computer industry suffered unproductivity in large-scale projects: too many people, too many inefficiencies, delay, and improper planning. UNIX made computers more productive.
Unlike hardware, people demand new or modified features all the time while using the software. This invited building software in a change-tolerant way.
The paper titled, “The UNIX TimeSharing System” by Dennis M. Ritchie and Ken Thompson from Bell Laboratories, is available here.
UNIX is an OS designed for general use, multi-user, and interactive (that is, you…
This blog post will cover my internship experience at Wingify from Dec ’17 to April ’18. The three reasons I am writing this blog post:
How did I come to know about Wingify, and why did I decide to intern there?
How did I get an internship at Wingify?
How my two months internship got converted to a six months internship by mistake that landed me a significant project?
What was my…
Apache Flink is an open-source distributed system platform that performs data processing in stream and batch modes. Being a distributed system, Flink provides fault tolerance for the data streams.
To make our platform fault tolerance, we want to preserve the system’s state from time to time. Achieving fault tolerance means being capable of not losing any data when a node or more of the network goes down.
Flink has three options for a stateful backend, and one of them includes RocksDB. However, only RocksDB can provide incremental checkpoints in Flink. …
A large scale distributed system that can support cyclic dataflows.
Back in 2013 and even now, every distributed programming model was tightly coupled with the engine that executed this model and so interoperability was an issue. It is an issue even today but not that much — thanks to systems like Naiad and Spark. Before Naiad, we had systems like MapReduce for batch processing, time stream for stream processing, and Giraffe for graph/vertex processing.
Naiad is a distributed cyclic dataflow system that offers both batch and stream processing on top of the same engine. It introduced a new computational model…
After working on MongoDB for about six years, I am sharing certain practices that have worked well for me.
In relational database designs, the schema is statically defined. In document-based databases such as MongoDB, the schema is dynamic and is based on the document structure. MongoDB is a schemaless database.
The structure of documents should be application-driven and the focus of this design should be on the access pattern. Giving priority to access-patterns mostly results in inconsistency of data. However, linking within the designs creates a structure that avoids consistency issues.
A thumb rule that I have encountered from multiple…
This blog post will discuss the research paper, Zookeeper: Wait-free coordination for Internet-scale systems. Click here for downloading the paper.
Coordination is of utmost importance in distributed applications. While building distributed systems, we need primitives like distributed lock services, shared registers, group messaging, etc. Instead of implementing these on our own, Zookeeper provides such primitives off-the-shelf.
It is like a library that you import in your code and write your own logic on top of it.
A shared (read-write) register, sometimes just called a register, is a fundamental type of shared data structure which stores a value and has two…
This article explains the research paper, “Patience is a Virtue: Revisiting Merge and Sort on Modern Processors” published by Microsoft Research.
I intend to explain the research paper in a simple manner. The reason I chose this algorithm is due to its application scope — infrastructure monitoring and historical analysis of logs(event-based data). I have observed how tedious it is to sort event-logging datasets using algorithms like Timsort, especially when we have real-time monitoring of systems.
A blog post that explains the research paper: The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb
While working as a data engineer, I often encounter the problem of data storage: whether data access should be kept faster or should I go for cold-storage or something in between? Usually, I focus on the frequency of data-reads and make sure that I am using tools that reference the read-data. However, I don’t want to be anecdotal in my observations and hence, wanted to look upon some fundamentals. So, I started reading a research paper by Jim Gray…
Dimensional modeling (DM) is part of the Business Dimensional Lifecycle methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design. The approach focuses on identifying the key business processes within a business and modelling and implementing these first before adding additional business processes, a bottom-up approach. 
The objectives, laid out by Ross and Kimball, are straightforward:
Working as a Data Engineer, you are directly connected to your data science teams and some senior-level personnel who give you business use-cases. You have to be involved in the complete process to gain awareness of why a particular action should be taken. Hence, I have been reading Ross&Kimball and Krishnan for Big Data concepts.
This blog post will cover some fundamentals of the engineering involved in Big Data.
Big Data analytics comprises of two steps:
While building pipelines for a data stream, we capture data in both structured and unstructured storage. Keeping track of the data in…
Code + Data.