Let me begin this post by first describing what machine data is. Machine data is essentially log file information. When engineers build systems (hardware or software), they usually incorporate some element of log file capture into the design of those systems for several reasons: first for troubleshooting purposes and second as a backup, in case something unintended happens with the primary system.
As a result, almost every electronic device and software program generate this “machine data”. It’s fairly safe to say, that most things we interact with on a day-to-day basis captures this machine data. For example, our automobile, cell phone, ATM, EZ-Pass, Electric Meters, laptops, TV, Online Activity, Servers, Storage Devices, pacemakers, elevators, etc. all generate and locally store this machine data in one form or another. When we call the mechanic about a “check engine” light warning on our automobile, they ask us to bring the car in to the shop so that they can hook it up to the computer to diagnose the problem, we are leveraging machine data. What the mechanic is really doing is accessing the machine data stored on our automobile to identify error codes or anomalies that would help them to pinpoint a mechanical problem. And, the proverbial “Black Box” that is so crucial to explaining why an airplane may have crashed also leverages this machine data.
So, if machine data is everywhere, how come we never heard much about it?
In a word, it’s difficult. Since machine data comes in lots of different shapes and sizes, it is a difficult proposition to collect and analyze this information against lots of different sources. Going back to the car example, information collected from different sensors are all fed into one collection point. The engineers building the automobile are able to dictate requirements to component manufacturers about the shape, format, and frequency of data collection, etc. of all this machine data. Since they design and build the entire process, they are able to correlate and present this information in a way that useful for mechanics troubleshooting a car problem.
However, if we look at an enterprise IT infrastructure, this same collaboration & integration doesn’t exist. A typical enterprise will have lots of unrelated components. From, load balancers, web servers, application server, operating systems, pc’s, storage devices, multiple sites (on premise and in the cloud), to virtual environments, mobile devices, card readers, etc. So, depending upon the size and scale of the business, they could have lots and lots of machines generating this data. I personally work with some customers whose server counts are measured in the tens of thousands.
Within the enterprise, no universal format for machine data exists. This fact creates an enormous challenge for any enterprise looking to unlock the value of machine data. That, combined with the variety, volume, and variability of this machine data can be downright overwhelming. As a result, enterprises collect the information in silos, and resort to an old school, brute force approach to analyzing this data only when it’s necessary. If a system or process is failing, a team is assembled from the various IT departments to engage in what can best be compared to an IT scavenger hunt, manually pouring through log files, comparing those files to the cause and effect across other log files throughout the network. This whole process is so labor intensive and time consuming that if the problem is only intermittent a decision may be made to abandon even identifying the root cause.
Let’s go back to the car example. Imagine that we bring our car to the mechanic, but instead of simply hooking a computer up to a command and control sensor, the mechanic instead had to connect to and analyze hundreds of different data points on the automobile, and compare all the available data against other data, with the hope of finding the problem. To further build on this point, let’s suppose that our automobile emits an annoying screech at 35 mph. We’ve had the car in the shop three times already for the same problem and have spent hundreds of dollars all to no avail. Eventually, we come to accept the fact the screech as the new normal, and turn the radio up when approaching 35 mph.
There has to be a better way!
Let’s think about this for a minute, what would be needed to get the most value out of this machine data? Well, if we tried to structure the information by storing it in a database using a schema, we wouldn’t be able to account for the variety of the data. No, instead we’ll need a way to store information in an unstructured format. Next, we’ll need a way to get the data from all the different devices to send the information to our unstructured storage in real-time. Building connectors will be too expensive and difficult to maintain, so what we’ll need is a way to simply forward this machine data in any format to our unstructured storage. Next, we’ll need be able to search the data, but how can we do that if it’s totally unstructured? Well, to do that we’’ll need some way to catalog all the data. Since the value of the data raises exponentially in relation to corresponding information, we’ll also need some way to correlate information across different data types, but how? So we start to think, what’s common across all these different data types? Eureka! We discover the date that something happened, and the time that it occurred is present within all this data. We’ll also need a way to extract information from all this data, otherwise, what’s the point of doing all this in the first place? Hmm, since the data has no structure, creating reports with a traditional BI tool won’t work, besides reports are too rigid to ask the complex questions we will likely be looking for in your data. Lastly, we’ll need to address the issue of scale and performance. Whatever we design has to able to bring in massive amounts of data, in real-time, because knowing what’s happening in the present across everything we are running in our enterprise is way more interesting and valuable than what happened last week.
Well, we can continue to ponder ways to solve all these technical challenges, or we can just opt to use Splunk, whose brilliant engineers seemed to have totally nailed it.