Data lineage? I don't think that means what you think...
Let's define lineage, provenance, and chain of custody

Data governance is not only about controlling the storage and use of data. It is also about managing and assessing historic metadata about that data. Data governance is of second-order importance for most people. When I attempt to communicate the importance of second-order concerns I often end up talking quickly, avoiding long words, and using analogies. Analogies are excellent!
Three of the big data governance analogies are:
Chain of custody
Provenance
Lineage
You hear them bandied about a lot, lineage most of all. In my experience, they are used more or less interchangeably most of the time. But words have meaning and analogies are relatively specific. Hopefully we don't have to ask what lineage is, in concept, or it would not be a useful analogy. If an analogy is valuable, it can be used specifically. If it is used specifically, it has more value.
Maybe Define Our Terms
Let's take a quick minute for a high-level definition of these key data governance terms. What are we talking about here?
Chain Of Custody
Chain of custody comes out of the legal frame of reference. It means tracking who had a thing over time, and who had access to it. If the thing is a murder weapon, say the candlestick, it is important to know it was found in the library by Inspector Clouseau, was bagged and tagged by him at 11 pm, and he entered it into evidence at 7 am down at the station.
If the thing in question is murderously bad data, we want to know that it arrived at the MFT at 1 am from the vendor, was preboarded by CsvPath Framework at 3 am, and was loaded by ETL into the data lake at 4 am. We also want to know who at the vendor sent the data, who had access to the MFT server's configuration, who wrote the CsvPath scripts, who designed the ETL process, and who had access to the bronze area of the data lake the data landed in. At its most basic, data chain of custody is the data-flow diagram annotated with access control and a log.
Provenance
I think of the concept of provenance as coming first from the art world. For a piece of art to have value it has to have two things: an innate attractiveness or relevance and a known-act of creation. Likewise for data to have value, it must be useful or interesting and have a known source. Unlike most statues, data moves and is often agglomerated from multiple sources in its earliest days. That means provenance is also implicitly about the assembling of a set of datum.
Who first collected and assembled the data tells us if the source was reliable. As we track further (dis)assembly of the set over time we can assess all the hands that touch it, and by extension our knowledge of their capabilities and biases. We can, for example, trust econometrics data assembled from the official records on data.gov and from well-known NGOs. Our trust in econometrics data assembled from the official blogs of Mickey Mouse, Marvin the Martian, and Wiley Coyote is much lower.
Lineage
Lineage is a term of art in the world of genealogy. Exploring ancestry tells us how families change over long time frames as they do things, have things done to them, and incorporate new individuals. Every generation of a family can be seen as a dataset. Not necessarily true or false, but clearly related to, and distinct from, its precedents and progeny.
At each step in the lineage we can see not only the gene pool changing, but also the societal influences, and geographic impact. Likewise with data. Each time a dataset changes, in each system it passes through, we can see individual fields added and removed, schemas applied, conformance transformations made, restatements, etc., etc. Each derived dataset is a new generation. As with chain of custody and provenance metadata, the high-level goal of lineage tracking is assigning a level of trust at a point in time -- and the possibility for remediation. But clearly lineage is not just another word for provenance or chain of custody.
No One Concept Applies
In governing data at the edge or in the moment or over the lifecycle, all three of these concepts apply. We cannot equate lineage with chain of custody or provenance with lineage without losing important concepts. If the analogy has any meaning, it is a specific meaning. And with data, proper management requires us to address all these issues. Without clarity of provenance, lineage, and chain of custody we cannot fully trust our data and its impact on our commercial or collective actions.
We get the provenance, lineage, and chain of custody information we need by carefully tracking how data moves through our systems, using tools like, for instance, Open Lineage. In the moment, at the time we design a data flow, data lifecycle, and data storage and transformations there is a lot to forget that we should be building in. Will change data be captured? How does data pass through the edge into the organization? Who looked at what data when? Were all the items of data assembled of equally trustworthy sources? And so on. Having a set of analogies to tick off is a helpful mnemonic that helps make sure we cover our bases.
Helpful as long as we keep them straight.






