Taming Wild Data

Charles Givre
4 min readJul 25, 2019

In the beginning of any data analytic effort, one of the key determinants of success is the data itself. Often, in real-world situations, the data will come in a variety of formats, and stored in a variety of systems. As a former government contractor, I’ve been in several situations where I’ve been called upon to analyze “big data”, which is a government term for a share drive full of random Excel files. While this isn’t big data in the truest sense, it is still annoying or data in the wild. Other examples of untamed data, might be gigabytes of log files, databases, or raw finanical data. As we know all too well, many situations like this definitely exist, and they can be quite time consuming and difficult to deal with. Ultimately, the success or failure of an analytic effort depends on being able to make sense of the data.

In situations like these, often the data scientist or analyst is stuck between a rock and a hard place. There are many good visualization tools such as Tableau or the open source Apache Superset which work well if your data is already in a tabular system such as a relational database. But what if your data is more wild? What if your data looks like this?

{1:F01COPZBEB0AXXX0377002460}{2:O9401506110804LRLRXXXX4A1100009040831108041707N}{3:{108:MT940 003 OF 058}}{4:
:20:02618
:21:123456/DEV
:25:6–9412771
:28C:00102
:60F:C000103USD672,
:62F:C000103USD987,
-}{5:{CHK:592A3DB2CA5B}{TNG:}}

The data above is a SWIFT message and is used for inter-bank transfers. As you can see, the format is quite difficult to parse and in order to use this data, you either need a specialized tool ($$) or you will have to resort to the ultimate in analytic firepower: coding. You could write a parser in Python, R or some other language that converts this data into something more managable, or alternatively, does the analysis directly, but this is time consuming and obviously, you must know how to code in order to do this. Surely there must be a better way.

Drill to the Rescue!

For the last few years, I have been extremely interested in a few open source tools which answer this problem: Drill, Impala and Presto. These three projects, all under the Apache umbrella, were inspired by the Google Dremel paper, and basically can be thought of as an SQL engine for self-describing data. (I’ve mainly worked with Drill, but I think the same can be said for the other tools)

What impressed me about Drill was not that it could execute ad-hoc queries data on Hadoop with breathtaking speed, but rather that Drill could be adapted to query large amounts of difficult and diverse data with a common language. Furthermore, these data sets could be quickly and effectively joined together via standard SQL JOIN statements. Effectively, Drill can be thought of as a universal translator for data.

For analysts and data scientists, this has major implications. Drill (and its competitors) can enable an analyst to quickly explore and analyze virtually any kind of data without having to learn a proprietary query language, and without having to injest or index the data into a third system like Splunk. Drill has an ever-growing list of supported systems and formats which means that it “speaks” more and more languges. Ultimately, what this means is that you can go from raw data to information of value much more quickly than with other tools.

With Superset, You Can Visualize Data Just as Quickly!

One of the really nice features about Drill is that it can make any data look like a database, but when you take this power and combine it with a visualization tool like Apache Superset (Incubating), you can really unleash some impressive analytic power on your data. I’ve written a few blog posts about the combination of Superset and Drill, the first of which explains how to connect Drill and Superset and the second is a demonstration of network forensics using Drill and Superset.

The image above is a visualization I created using Drill and Superset from a raw packet capture which is showing which IP addresses are communicating with other IPs.

In conclusion, if you work with data, particularly difficult data, you might want to give Drill a try as it can greatly speed up your work flow and get you results faster. There’s a lot more to Drill than I can write about in this article, so I’ll put in a shameless plug for my book, Learning Apache Drill, by me and Paul Rogers if you’d like to learn more.

--

--

Charles Givre

Founder of DataDistillr, Data Science enthusiast, Instructor and Apache Drill PMC Chair. Contact me at charles(at)datadistillr.com.