Blog > April 2021 > The “What” and “How” of Data Lineage

The “What” and “How” of Data Lineage

Read my previous blog post in this series, The Who, What, When, Where and Why of Data Lineage

The “What”: The Truth About Data Lineage
It may seem elementary to start with a definition of Data Lineage, but with so many diverse offering types and implementations, let’s start on common ground.  Per our local Wikipedia:

Lineage analysis will enable the “why” and “what” of provenance.  When you see through the business (business rule) and IT (transform) lens the way that data is moving from one area to the next, you start to question and provide answers such as:

  • Is this dataset used in the correct business context?
  • Does it apply to the boundaries of my KYC policy?
  • Am I targeting the right sample set?

Understanding if the transform aligns with the business policies and rules is key to proper business context.

At ASG, we have seen long-time data issues that occurred as data was being passed over business borders solved immediately with the transparency of the end-to-end lineage. The data was either misinterpreted or used differently between departments.  Sometimes, as assumptions are made, the data morphs into a completely different meaning resulting in costly mistakes. Catching these idiosyncrasies without transparency into both the business and physical information would take weeks of remediation—if you are aware of it at all.

The “How”:  How Does it Work?

This is an extraordinary time for data lineage benefits, as lineage has become so much easier to deploy and automate; however, at least once a week, I talk to clients or partners who don’t commit to a lineage project. Some clients never passed square one of their original lineage goals. or They may have pockets of solid usage, but they haven’t met their original goals of tracking and tracing PI across the organization.  There are many reasons for this failure to meet data lineage objectives due to process and/or technology.  Here are the two most common reasons:

  1. If lineage is imported with a spreadsheet, it is not automated. 
  2. If information is collected with a questionnaire, it is not automated.  Believe it or not, there are vendors who promote lineage “questionnaire repositories.”

If a data lineage vendor needs more than one technology to fulfill lineage, be skeptical and ask questions:

  • How integrated is this solution?
  • How much is the vendor investing in discovering and scanning metadata themselves vs. handing you off to a third-party?
  • Is the technology available today? Be cautious when they tell you “it’s on the roadmap.”
  • How long does it take to create a new scanning device?
  • Do they have a practice around the creation of new scanning technologies?

When comparing data lineage factors, we hope that you will consider ASG Data Intelligence (ASG DI) and the unique differentiators we provide. The ASG DI solution has 270+ scanners and it has taken 10+ years to build this portfolio. It’s not easy keeping up with the vast modern technologies pumping out annually, but this constant evolution has forced us to speed up and standardize the creation of the scanners themselves.

The creation of lineage is divided into three main capabilities: connect – import – link. If any of these core steps continuously utilizes a spreadsheet to process, it’s not automated.  If the vendor’s solution uses AI to infer the lineage, make sure you test and validate the results! If the inferred lineage is mostly resulting from naming conventions and log files, beware. It’s to your advantage for machine learning to be based on coding patterns and metadata. To break it down further:

  • Discovery should be based primarily on the scanning of metadata.
  • The import and linking should be automated and based on the transformation code or technology.  
  • Upon import, the linking should be activated for full automation.

Is this a silver bullet? No, but if 85% of the lineage is automated, you will be happy with the results.  Make sure that the vendor has a means to connect the gaps. If the gaps are filled with a spreadsheet or automated stitching, there should be a way to track and govern the manual path.

Accelerate Time to Data Value with a Factory Approach

Earlier I mentioned “committing” to a lineage project.  Knowing that 100% automated lineage across the organization will not happen in 90 days, you will want to orchestrate for the journey. If you want to go fast, consider a “factory approach” streamlining each phase of the lineage.

Perhaps your approach will be to stitch immediately and circle back to automate after phase II of the project. Maybe a large part of the project is to clean up years of data issues, and you will benefit more from identifying critical data and securing those data supply chains for business value. This is more of a domain approach versus an accelerated factory approach. The point being, expect gaps. Define your approach and commit to that approach. You will be amazed at what you will find out about your data landscape and what you will achieve with fast access to reliable business data!

Don’t Forget the Use Case

Finally, to be successful with lineage, you must align it with a use case. Tracing and tracking code and data elements is powerful, but it’s not “meaningful” if it’s not being applied to a business value use case. Modern use cases for lineage include impact analysis for cloud migration, regulatory compliance, mitigating privacy issues, new data and analytic campaigns, supporting a new digital process, and selecting the right data for the new ML model, to name just a few.

These modern use cases using data lineage as their core data provider is what will accelerate successful data-driven projects across the organization!

DATA LINEAGE  = ACCELERATED DATA-DRIVEN SUCCESS
 

Join our upcoming lineage panel “Inside Data Lineage: Data Lineage Expert Panel, April 27th at 11am ET where ASG Technologies leaders will introduce you to the advancements of data lineage and the surprising array of business use cases for intelligent data lineage.

REGISTER NOW

Definitions

Data lineage includes the data origin, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.

This is an excellent, modern definition with an emphasis on data analytics. In talking with our clients at ASG Technologies, they are seeking new automated approaches to monitoring accuracy, privacy and data value creation. Data lineage is moving beyond the realm of compliance to enabling validation for certified datasets, reports, ML models and digital datasets. The commonality being that data lineage is critical to supplying confident insights for the majority of new data initiatives.

Wikipedia goes on to explain “Lineage is a simple type of why provenance.” If you’re mentoring the organization on Data Literacy, you might want to include a course on data lineage. Lineage provides a graphical view of how data moves and changes form as the business changes and expands. For example, when you detect a knot of recursive (back) arrows, it typically represents a business process changing over time. As popular business processes change over time, they invoke changes to the business rules that impact the transformation logic for a group of data elements. New calculations and derivations of the data are applied on top of the previous changes, which have been applied on top of the process logic prior to that change and so on.  With transparency to the lineage, these changes continue to multiply, which leads to duplication and complexity. Veteran lineage clients use reports and alerts to the most complex data elements with the heaviest transformation rules and the most recursive relationships to forensically clean up their data environments.

Posted: 4/20/2021 8:00:00 AM by Susan Laine - Global DI Evangelist
Filed under :Data_Intelligence, DI, Lineage