Is Your Data Preboarding Quality Control Efficient?

Now and then I make the rounds of the data QC players to see what’s new. These are the data observability systems, mainly, but also some data governance and data catalog tools. The observability companies include Soda.io, Great Expectations, Monte Carlo, and many others, among which is a company called BigEye. It’s a very crowded field these days and BigEye was new to me.

(Also, for the record, like the others mentioned, BigEye isn’t a data preboarding company, they are well downstream on the database side of things).

I took a spin through BigEye’s docs. One statement popped:

"Traditional data quality tools are rooted in rules that must be hand written and manually updated when data changes. Rows are checked against the rules and either pass or fail. Rules must be preemptively thought up by humans which is expensive and error prone."

Now, then, says I, that hits close to home.

CsvPath Framework is not a “traditional data quality tool”. That’s partly because there is no tradition of tools that provide a methodical preboarding workflow. And it’s partly because CSV, Excel, JSONL, and other delimited data formats have never had a solid validation language or a way to create effective schemas. CsvPath Framework solves for both these gaps.

That said, BigEye’s statement is totally speaking to the kind of validation rules effort CsvPath enables. That makes the statement fair game for an analysis and push-back. I’ll say upfront, I don’t know BigEye’s product and they seem like serious people, so I’ll assume they have a good thing going, overall. I just think that quote completely missed the dartboard.

Let’s step through it.

“Rules that must be hand written and manually updated”

Hand written, huh? It sounds like they may be thinking of the Rosetta Stone. If code were coded on rocks, I’d get this. But it’s not. And increasingly code is coded by AI so who cares, rocks or otherwise?

Here’s the thing. From the perspective of data preboarding, you already have the rules in “narrative” form. I guarantee it. The reason I can say that is simple: you have a partner and you have negotiated with your partner what should be exchanged for money. No serous money is going to change hands until the data provider explains to the data consumer what they are delivering, when, with what domain and range, and what the abstract schema and and physical format is. So, like I say, you already have the rules.

And manually updated? That is a feature, not a bug. When data formats or content expectations change we want to have the new expectations communicated to us. We then want to make the appropriate changes and test them against known sets. And, when all is well, we roll the changes into production. Testing and rollout should have a lot of automation. But updating QC rules is absolutely something to do with great intentionality, full information, and step-by-step.

I’ve heard this before from developers. If we have too many tests, test at too fine a grain, or test the UI using test automation, we’ll just create more work maintaining tests, they say. As if changing a test to match a new reality is a problem. That’s what tests are for, to insure no surprises. When the app or pipeline or database changes and a test breaks, that’s a test that is doing exactly what we want. We should thank it for its service, not complain because we have to do work to get paid.

When the facts on the ground change you change the tests. Just is what it is. Next!

“Rules must be preemptively thought up by humans”

It is indeed hard to think of all the cases that must be covered in order to have a level of test coverage that let’s you sleep at night. What if the same file comes from the data partner twice in one day? What if a tremor hits San Francisco and causes the rack to become unplugged on the same day that the backup power died making the server crash before it can write data from memory cache to disk? Hmm.

The good news is that you actually have a lot less to think up than you may have thought, because see point one: you already have the requirements because you entered into a data partnership with your eyes open. We’re not talking about software engineering, where the software is an act of pure creation based on an abstract idea. This is data from a data partner with a shape, size, type, and price stamped on every bit coming over the wire. We don’t have to think up anything.

“Expensive and error prone”

Now the idea of rule writing being “expensive and error prone” is fair. Rule writing takes effort, even if we’re in the main just transcribing the requirements or product information from the contract with the data partner. You can usually get some kind of a bump by using AI to code your most complex rules. We know a bit about this because the CsvPath Framework team built a prototype of an LLM-backed csvpath statement generator. (That worked great. We’re now adding a production version into FlightPath Data). But even with a lift from AI you have to check the rules and prove that they work. Yes, that’s a bit expensive.

However, expensive is a contextual unit of measurement. Is an hour of an SME’s time expensive? Yes, hence the ‘E’ in SME. But the hour is not nearly as expensive as the time that SME and their tech team would take to trace an error that reached production back to the point where the rule that might have cost an hour’s work would have prevented it. That rule had it been written, verified, and put into production, would have been cheap at double the cost.

What about error prone? Yes, that’s true, rule writers can make mistakes. Let’s just get the AI to write the rule, sit back, and have a sandwich… oh, wait.

The thing is, data preboarding operations either have human checkers applying manual judgement that is, for sure, expensive and error prone. Or they have automation, involving rules that are written by humans (or at least verified by humans) that are expensive and error prone. Or they use AI somehow, which is probably pretty expensive, actually, in all sorts of ways, but regardless, definitely error prone.

I’d take option B. Write your rules, based on well-known commercial arrangements, using AI or just SMEs, test the hell out of them, and automate. When the rules break because the data relationship changes, fix the rules or fix the relationship. I guarantee it will be less expensive and error prone than option A (all human judgement and manual toil) and probably significantly better than option C (AI magic attempting to address a situation that is about well-known requirement matching, not unknown pattern matching).

Slow is smooth and smooth is fast

The efficiency of your data QC shouldn’t be decided based primarily on how much effort you have to put into your data preboarding rules in order for them to give good coverage. It should be determined by

How many data errors get to production?
How easy is it to trace and fix errors?
How visible and understandable is the whole data lifecycle starting from data preboarding?
Are your data operations consistent and repeatable?
How quickly do you know when things are going or have gone south?

Are there more bullets I haven’t stated? Certainly. But those are a good start. Get those right and you are a data preboarding superstar. Focus only on the effort you put into QC rules and you’re likely to over-spend on gee-wiz tools and under-invest in actual data quality. Generally speaking, the latter is expensive and error prone.

One final thought. If you are concerned about how much effort you will need to put into data preboarding validation and upgrading rules, by all means get in touch. We’d be happy to have a look at your requirements and give you a good sense of what it will take to use CsvPath Framework to smoothly go fast.