Data Engineering

Getting Started with Data Selection

A data engineers start to simplification

Kshitij Gaikar

--

Introduction

A lot of time folks start directly jumping into KPIs ( Key Performace Indicators) without understanding the need for those KPIs. I have met with clients who have dumped all the data they had and never figured out what they really wanted to achieve. Old Retail players hire consultants who have had experience in analysis of sales, units, churn,...etc but never really understood how a big chunk has to be narrowed down to make sense. This post is meant to help those who manage humongous data but want to cut it short to derive results in much faster way.

One of my key responsibilities as a Data Engineer, as the name suggest, is to look at Data. As simple as it sounds, someone might find it difficult to navigate through a data diagram which might contain 100s of tables. I have seen cases where this number has touched it limits and a client had to spend 6 months just to navigate what they really wanted. As we read down, we would try to understand how simple it is to break multiple files into pieces that tie together in an organised way to make sense.

The Data

No data is better than bad data.

One might have heard this statement multiple times. Imagine we are delivering a report. The report is based on certain KPIs which an engineer has helped narrowed down to an analyst or a consultant. The report is published and actions are taken based on it and we eventually find that things went south. This would not only lead to loss of millions of dollars but also a loss in confidence of client which can potentially lead to bad image among other clients.

Remember La La Land got Best Picture at Oscar mistakenly. Someone forgot to give correct envelope to the presenter.

Back to the question — how do we narrow down huge chunk of data to smaller pieces that can make sense ?

First — talk to the client and check what they really want to achieve. As a Data Engineer, you are closer to business than Data itself. A Data Engineer can fill in shoes of consultant but the vice versa is not true.

Divide data into three pieces —

  1. Subjects ( Entities of analysis)

2. KPIs

3. Logs ( this can be transaction log or any log that ties different subject together)

Take an example of excel below.

You receive 100s of flat file and then you are asked to load data, build pipelines for deriving meaningful insights.

The questions here is what is meaningful insights?

  1. Is it loading any set into production and then give them some view ?
  2. Is it certain key metrics that have huge impact on clients revenue ?
  3. or are those some potential metrics that measure indirectly key factors leading to different jumps?

Most of the time , its going to be Number 2. The first task is always to determine what drives your client.

This is also how you build a product. At the core, there must lie some data model with fix set of subjects and their KPIs. Once this model is fixed, you can then add on data that relates to these subjects. Its like hot chocolate, the main components are milk and choco powder. After this you can add different flavours, sugar/no sugar/sugar free,… etc.

Now that you have decided to figure out the core, how to know what it is ? This is simple, most of the clients have a dictionary that tells you what each data piece is. Once you know how tie this pieces, try to build your own data model and see what lies in the centre.

Data Diagram

Once figured out, then this would be the place to start.

Most of the times, you will be provided with such data diagrams which would makes things easier but if you aren’t, just go for the centre.

The next part would be to figure out the three pieces — Subjects, KPIs and logs.

The centre itself is a log. Now for subjects, see what can be related to a person. Example — a card, an account, an ID, a house, a location, a policy,… etc

Then anything that is quantified will be metrics but you don’t have to take all just the important ones.

Getting to Insights

Visualisation is key to quick decisions. Be careful — decisions can be good or bad.

Now that data model is fixed, the next step is to validate and visualize the data. This can be done in Excel, Tableau, PowerBI,… etc

Most of clients rely on powerful visualisation. Everyone loves pictures. They are so easy to understand compared to rows of data which seems mundane.

Build out a good picture over the data you selected.

Cut your coat according to cloth

It is unadvised to visualise everything. All colours when mixed lead to grey or black. It is always better to only feed information that is first necessary then next wanted. Anything beyond this is redundant.

Clients will be overwhelmed by visualisation which is beautiful but has too much information. Keep it short and simple.

Beyond Insights

Now that you have narrowed down a piece of data, made visualisation over it, next step would be to make meaning out of it.

The question is how would you do that ?

If you read above, one of the key fundamentals is to have core model and then build over it. Building over it simply means adding more data peices. Don’t add all at a time. Iterate one by one

All buzz words only lead to humming. We today live in a world of Artifical Intelligence and Machine Learning. One might be thinking of these stratergies to add more peices but don’t let Buzz words drive the work. Buzz is a collects of repeated sounds, so is AI and ML of repeated patterns. Data Engineering doesn’t mean that you have to get into these stratergies all the times. There are only occasional instances when you have to use them. Simple iterations work most of the times.

This is how one would then measure values beyond insights. The different peice of data that are unused often determine indirectly the key factors that drives jumps or drops in KPIs.

The next step would be to figure out what those are.

Divide these again in 3

  1. Attributes of subjects
  2. Unused metrics
  3. External factors — getting data that is outside clients reach

This was the last peice of puzzle. The work ahead is not meant of Data Engineers but for Data Scientist.

More on that later but this is where i believe you would stop. I would talk further on different aspects of Data Engineering in more blogs.

Let me know if you want to know the road ahead

BECOME a WRITER at MLearning.ai

--

--

Data, Tech and Sustainability. I believe in preserving the world so that our future generations have ample of resources. If you have a Green idea, Reachout!