While the extraction step is valuable, it does not mean the data is usable. We’ll cover the final two letters, transformation and loading, in this post.
Transformation is the term for cleansing, aggregating and preparing data for analysis.
Now, we won’t be quizzing you on this later; however, to give you a sense of how much work goes into transformation we have listed the most common transformation types:
- Cleaning: Mapping “Male” to “M” and “Female” to “F,” date format consistency, etc.
- Deduplication: Identifying and removing duplicate records
- Format revision: Character set conversion, unit of measurement conversion, date/time conversion, etc.
- Key restructuring: Establishing key relationships across tables
- Derivation: Applying business rules to your data that derive new calculated values from existing date - for example, creating a revenue metric that subtracts taxes
- Filtering: Selecting only select rows and/or columns
- Joining: Linking data from multiple sources - for example, the first name and last name may be joined to form one column called name
- Splitting: Splitting a single column into multiple columns - for example, splitting name into two columns called first name and last name
- Data validation: Simple or complex data validation - for example, if the first four columns in a row are empty then reject the row from processing further
- Summarization: Values are summarized to obtain total figures which are calculated and stored at multiple levels as business metrics - for example, adding up all sales an employee has made to build an employee lifetime value
- Aggregation: Data elements are aggregated from multiple data sources and databases
- Integration: Reconciles different data names and values for the same data element
What Are We Doing with this Data Anyway?
While the extract and transform stages primarily deal with the data sources, the load stage depends largely on the intent of the data once it’s in its final resting place.
No matter what your end goal is, it’s important to understand the work you’re requiring of the target environment. For instance, you wouldn’t want to frequently load small batches of data when the data warehouse is optimized for infrequent, large batches. On the other hand, some analyses can’t wait! So, a more event-driven approach makes sense.
Nonetheless, when you do load into an analytics warehouse, there are two primary methods:
- 1. Full load
- 2. Incremental load
A full load is an entire data dump that takes place the first time a data source is loaded into the warehouse.
An incremental load brings together the difference between target and source data at regular intervals. For instance, bringing in the new orders that happen throughout the day to compare orders to labor staffing is one of the ways to consider incremental.
It is the incremental load where things get more complex.
For these reasons, your incremental loading process must be able to handle the three biggest problems it may face:
- 1. Ordering: Correct sequence is necessary
- 2. Schema evolution: Need data to be in a consistent state
- 3. Monitorability: You must be able to catch failures in a timely manner
Each problem will likely cause your analytical data to be incomplete or wrong; therefore, it’s vitally important you have a loading process that you can rely on consistently. We won’t keep going on these subjects in this post, but put a bookmark here for the next time you are evaluating data aggregation. Your team will appreciate you having this perspective.
From Transformation to Storage
To be successful in building an analytics solution, you must understand and be able to extract, transform and load your data in the proper way. Now, you should have a better, basic understanding of extraction (aka integrations), transformation, and loading your data into storage.
By now, you know that the place for cross software system analysis is not inside of one of your business applications. You get that there needs to be a data Switzerland of sorts. And, likely you’ve heard the terms “Data Warehouse” and “Data Lake” thrown around. But, what’s the difference? And, more importantly, how do you decide what is right for your organization? Check out our next blog: Turning Data into Insights, for more information.