Last time we made the case for how different data structures impact your ability to get to the insights you need to foster an agile environment. If we hit the mark, you are likely asking yourself, “How do I get data out of the OLTP structure and in the OLAP structure?”
The answer is ETL (e.g. extraction, transformation, loading). ETL is the process by which data is extracted from data sources that are not optimized for analytics, scrubbed & transformed, and loaded in a centralized host.
We’ll spend the next three posts exploring ETL, starting with Extraction.
You may know Extraction as Integration
You may be more familiar with the word integration than extraction. To get to the level of insight you’ll need, you may need to challenge your existing thinking around integrations. We’ll explore these nuances from three perspectives:
- 1. Purpose - the reason why you’re accessing and moving data
- 2. Accessibility - the means by which you can access desired data
- 3. Quality - how you prove whether or not you got all the data you desire
Purpose: Your integration’s mission statement
Based on your experience purchasing an ERP system, accounting package or HRIS, you’ve likely encountered the term “integration”. Your applicant tracking system pulls newly hired employees into your HRIS. Your payroll platform integrates with your accounting package to post the general ledger. Your marketing automation tool integrates to your CRM system so you can see the traction related to advertising campaigns. You are used to hearing about the assembly line of software systems that bolt your business processes together.
Caution: Integrations for processes are much different than integrations for analytics.
Integrating your data for business processes is typically less difficult than pulling data from multiple business software systems for analytical purposes. Let us explain:
Integrations for processes: A data assembly line
Let’s go back to your payroll system integrating to your accounting package to post various labor amounts to your general ledger. This is oversimplifying, but the integration looks like this.
The payroll platform knows exactly what data needs to leave its system and head to the general ledger. This makes the scope of the data needed small and the definition of success easy to define. Payroll gets posted, general ledger accounts are updated, and the process is complete and ready to do again the next time payroll is processed.
Integrations for Analytics: Look up not left or right
Now, let’s consider using the payroll platform and the accounting package data for analytical purposes. What data do you need? How often? The definition is less clear. Analytics can be powerful due to the number of questions you can ask and answer; however, anticipating all of the potential questions your audience could ask requires a larger, more ambiguous amount of data.
What’s more is that you can’t connect your systems horizontally. Remember back to the first part of this series, your payroll platform and accounting package are OLTP data structured systems, so there’s no place to put the amount of data you’ll need for analytics. You must connect your systems “up” to an analytical layer where you can create that valuable OLAP data structure. Later in the series, we’ll discuss what happens in that analytical layer and the value of how metadata is used to translate the vocabulary used in each system. For now, let’s just focus on the need for additional data and where to store it.
Accessibility: Getting Data Out
Software systems differ in the ways you can extract data. We’ll simplify it into the four most common ways we encounter our customers pulling data. We’ll start with the most labor-intensive and end with a software developer’s dream.
This is when data is pulled data from a system, saved to a desktop, manipulated in Excel and sent to another person or uploaded into a shared drive.
We cringe when we see this because A) it can’t be fulfilling work for the person doing it, B) it’s ripe for error C) it drastically cuts down on how often you’ll get updated data, and D) it’s not secure.
Our hope is that in exploring other methods, you may contemplate new ways to streamline data pulls
There are tools that replicate manual processes. Robotic Process Automation, RPA for short, can be trained to do the tasks of humans. You can ‘teach’ the software to log into a system, download a report, and save it somewhere.
As utopian as this sounds, there is a downside to using RPA to pull reports from a software system. Software systems can frequently make enhancements and improvements to their sites – meaning, those meticulously trained steps may fail.
While RPA can save you time, it is still fairly risky and needs to be monitored. Another important note for both manual options is that you are limited to the data sets that the software system has pre-defined in its reports. There’s very little flexibility in data scope.
Scheduled Data Exports
Some systems can schedule data extracts. These systems vary in how much you can define which data gets extracted:
- User-Defined: Ideally, a data analyst or data engineer can build the data extract to the specifications of your OLAP model.
- Limited User Definition: If a software system does not have the capability for a user to pre-define the data scope, you will be limited to pulling standard reports and cobbling the data back together for your OLAP model.
This is why the pre-built integrations, by analytics companies, to systems without data scoping capabilities are extra valuable.
API stands for application programming interface. APIs allow software systems to integrate in an automated, low-risk and secure way.
The robustness of a software system’s APIs can vary. Preferably, a software system exposes all data elements in their API, meaning a developer can selectively choose what they need to fuel their application. Unfortunately, APIs can be an underdeveloped part of a software system. Two ‘tells’ of an underdeveloped API are the availability/lack of data and if an API acts more like a scheduled data extract (see above) than a development tool.
Don’t get discouraged about APIs. They really are useful. Just be wary of any software system’s API claims.
Quality: 100% of the Data
Fueling the data required for game-time decisions is easier said than done. Validation is our last, and commonly overlooked, attribute of integrations. When you are building a data integration process, you want to:
- Make sure you get 100% of the data, and
- Make sure you don’t lose data.
Sure, that seems obvious; however, it takes some thoughtful and talented engineering to make sure as you are pulling data from your OLTP systems and hydrating your OLAP model, you are getting a complete picture.
Here’s an example: Let’s say you want to do an efficiency calculation and figure out the number of units produced per hour. This would require hours data from your timekeeping platform and units data from your ERP. As you pull the timekeeping data, it will likely look something like this:
It’s easy math to see that Joe has worked 60 minutes, or one hour. But, in data speak, this is ripe for error as there is not a unique key. What’s a unique key? It’s a field that makes the entire line unique. A unique key ensures that the value – in this case minutes – is additive once pulled into the data model. In our example, we are at risk of missing out on the 40 minutes or 20 minutes without a unique key. This would make the efficiency calculation look a lot better than it is. Scary right?
Each data element needs to be inspected to ensure it gets treated correctly. What’s more is the need for validation scripts that run in the background on an ongoing basis to confirm the model is performing correctly. All of this to confirm you are getting a full and complete view into your data.
Closing the Loop on Integrations
We’ve discussed how purpose, data extraction capabilities and validation all impact your ability to get data out for analytical purposes. The good news is that as complex as integrations can get, there are experts and partners available to help you. And, we hope we’ve made the case for not glossing over extracting the data.