CsvPath Framework Blog

Does My Company Need Data Preboarding?

Atesta Analytics — Sun, 01 Feb 2026 21:55:54 GMT

Does my company need data preboarding?

We often say that every company that trades data in files with data partners already does data preboarding, somehow. That’s not marketing spin, it’s absolutely true. If you receive data you must have some process for storing it, identifying it, checking if it is good, and moving it to downstream consumers.

When you ask yourself, do I need a data preboarding solution, you’re really asking if you need to invest in upgrading your flat-file infrastructure to a next-level capability. “Professionalizing” it, so to say, with a solution like CsvPath Framework and FlightPath Server.

That’s a much more interesting question!

There are two parts to the answer. There is an ROI consideration. And equally, a tool-for-the-job consideration. Can preboarding deliver found money? Is preboarding the best way to find money?

First, the Economic View

At a super high level ask yourself:

Are my costs are unacceptable?
Is my revenue not as strong as required?

The costs-too-high problem is about the same as the revenue-is-insufficent problem. The difference being that if costs are too high the improvement from better preboarding is captured by a RIF, i.e. shrinking the team, or intentionally outgrowing them over time. On the other side, if revenue is insufficient, the human resources spend stays the same and the productivity improvement goes into expanding the business.

Here is a table showing the basic problems as they might be seen in a representative B2B tech-enabled services business. Say, an invoice processing or billing company. Inefficient preboarding primarily hits three groups in five very generalized ways.

If data preboarding is the right lever, how does it unlock each of these problems?

Unblocking revenue

Data preboarding unlocks new revenue when BizOps is the main bottleneck. If your data operations can’t handle a marginal increase in data throughput, and the business has contracts unfilled, automation that increases BizOps capacity effectively delivers the average revenue per employee for every additional FTE’s worth of automation. When the whole company is blocked, every person’s-worth of unblocking is a person’s-worth of additional revenue.

Data preboarding is first and foremost about the automation of manual effort, raising throughput and reducing risk. Throughput is increased by:

Creating business rules that take the place of manual validation and upgrading
Minimizing data handling (copying files, renaming them, etc.)
Minimizing rework; less manual effort, fewer mistakes, less rework

BizOps and DataOps teams spend a surprising amount of time checking files, moving them, checking headers, fixing date formats, and the like. Low value but necessary stuff. CsvPath Framework makes it much easier to have an automated file handling workflow with most human judgement minimized. Used successfully, it is a force-multiplier that can increase throughput dramatically.

A rule as simple as:

$[*][ date(#ordered_on, “%Y-%M-%D”) header_name(#1, “ID”) ]

can eliminate a visual check on the format of order dates and the presence of an “ID” header in the correct place. The work doesn’t happen twice as fast. It doesn’t happen 10x faster. From the perspective of units-handled-per-person, it is infinitely faster. Most improvements based on validation and upgrading rules aren’t that dramatic, but the simple example makes the point that the potential is huge.

Revenue delayed

It would seem that adding business rules would lengthen the time needed to onboard a new customer. Creating rules requires interviewing SMEs, coding, and testing. However, there’s more to it. A preboarding framework should be able to stamp out a new data partner project, allow reuse of rules from other data partnerships, minimize the scripting involved in moving data around, and minimize process creation and learning time.

CsvPath Framework delivers by:

Minimizing project setup time down to a few minutes, requiring no code or configuration-as-code
Allowing validation and upgrading to be composed from small, easy to read, easily tested rules shared across projects
Applying a consistent pattern to all data partnerships with significantly less drift and fewer exceptions that slow down rollout and maintenance

Opportunity cost

Data engineering and BizOps teams commonly lose 50% or more of their time to manual data management and fire-fighting. Fire-fighting is particularly pernicious because it is unpredictable and often all-consuming. That amount of time lost could be better spent on work supporting revenue generation, efforts to increase pricing-power, and/or product-line extension.

As well as the benefits already noted, CsvPath Framework tackles the causes of lost data engineering and SME time by:

Making data immutable, runs idempotent, and processing assets versioned
Providing a clear identity for data, as well as data and configuration files, that is traceable at every processing step
Integrating most market-leading monitoring, alerting, and lineage platforms using the OpenTelemetry and OpenLineage standards
Providing easy process rewind and replay options
Making validation and upgrading rules easy to understand and test

Immutability, the ability to repeat processing with exactly the same outcome, based on immutable data sets that never change, makes finding and fixing errors far faster. When a step can be repeated using pipeline assets that are versioned and data that is unchanged, complexity is removed, clarity is increased, and the risk of experiments and tests drops.

These capabilities simplify and destress error recognition and replay. By shifting remediation left preboarding shrinks the blast radius, thereby lowering the cost of understanding what happened, fixing it, and restarting data processing. Fixing data at the point of error is significantly cheaper and faster.

Revenue risk

Data operations puts recognized revenue and captured margin at risk. There are a number of ways poor quality operations can hurt:

Downstream SLAs may be broken and penalties or service recovery expenses incurred
Partners or customers may require rework cascading beyond directly affected data
Over time trust may erode and costs increase to the point that partnerships end and there are development costs of finding and onboarding new partners
Lower customer trust in the data provided may cause churn
Sales may become harder due to the loss of references, poor reputation, and demotivation of the sales team or channel partners

Hit to pricing power

Products that are founded on data obviously must work and when they don’t new product development and/or product extension takes the hit. Virtually all industries and categories are competitive. Products that don’t improve don’t grow and companies that can’t field new products are eventually passed by. It goes without saying that some of the opportunity cost of gaps in data preboarding will be paid by product development friction. Some of this cost can paid by product development crowding out internal systems work. However, gaps in internal work also constrain the business with their own friction.

For example, say a data operation is running at 50% efficiency. The higher effort crowds out data analytics work for sales team decision making. The lack of correct and inciteful analytics causes malinvestment in Sales, resulting in less revenue. The product side of the house looks at the revenue volume and sources and compounds the problem with their own malinvestment. Their investment in new product runs at full speed, to the detriment of the sales analytics, but results in fewer sales at lower prices, and therefore less revenue.

It will always be difficult to pin down linkages like these. However, it is easy to make the case that in a company that is data engineering and data SME capacity constrained agility is reduced. With fewer experiments and other feedback loops, information that could have better informed product choices is not available. The revenue impact may not be easily assessed, but the agility drag is easy to see and the revenue impact can be assumed from that.

The Right Tool For the Job?

All the economic reasons for data preboarding would be for nothing if data preboarding is not the right tool for the job. Happily we know that it is, because essentially all file-feed data partnerships have a preboarding component already and inherently. Everything written above assumes not that preboarding is absent, but that it is insufficient. So that’s not the question we want to look at.

The better questions are is better data preboarding a good investment for your company? And is CsvPath the right tool for the job? At first you might say, obviously yes, data preboarding is a good investment, for the reasons above. But it’s not that simple. In some cases, spending effort on next-level data preboarding is less valuable than other investment might be.

Let’s stipulate that FlightPath, built on top of CsvPath Framework, is, at least today, the only enterprise grade data preboarding solution that isn’t deeply embedded in a larger enterprise software package, such as an ERP system. That won’t be true forever, but let’s go with it. FlightPath is an open source product. Nevertheless, implementing any systems change is an investment. The question isn’t how much FlightPath costs. The question is, does it fit the use case.

CsvPath Framework targets use cases that match:

Regular file-feed deliveries
Delimited data
Multiple data partners
Consistent and automation-friendly data business rules
Files in the megabyte to low gigabytes range
Cost of errors is high

If multiple of these bullets is true, CsvPath Framework and its frontend FlightPath products are worth evaluating. Let’s spin through them briefly.

Regular file-feed deliveries

Investment in data preboarding is likely to be efficient if file deliveries are regular and the data is consistent. By regular we’re talking about daily, weekly, or monthly. Periodic deliveries that are not on a set schedule may also work. The reasoning for this is simple: if you’re taking data deliveries frequently there are likely to be more gains from automation.

Delimited data

CsvPath Framework support CSV, JSONL, and Excel. In the future it is likely that other document-form data will be supported, in particular XML and/or JSON, and/or EDI. But that is not the near-term. (Check back in mid-2026; things may have changed.) The overall preboarding workflow can work with any type of file, but the validation and upgrading at the core of CsvPath Framework only operates on tabular, delimited data. Luckily for the Framework, the universe of delimited, tabular data partner data exchanges is very wide.

Multiple data partners

The benefits of consistency, rapid partner onboarding, and business rules-driven efficiencies are greater and greater the more data partners your company has. CsvPath Framework works great with one data partner, ten partners, daily or weekly, or anything similar to that. However, it was created to handle far larger situations. Applying CsvPath Framework to hundreds or thousands of data partners results in efficiencies that are massively greater than the efficiency gains possible from one or two partners. The Framework scales horizontally and keeps its work simple and consistent, so the number of partnerships is not gated by the tool.

Automation-friendly business rules

There are obvious ways rules can be automation-friendly:

Consistency
No rocket science
A reasonable number

If rules are not expected to be applied consistently, it is hard to see how automation can help. Most data partnerships have expectations that are clear and don’t change much. But if that is not the case, human judgement will continue to be needed.

If rules are consistent but they are so complex reducing them to automation could be impractical. In practice this is unlikely. What is more likely is that any complex rules simply take longer to nail down and automate. However, the harder it is to automate a rule the more likely it is that the rule nets you a high return from automating it. That’s because complex rules require expert humans that are expensive, slow, and error prone.

Lastly, it is much better to have a good number of rules, but not an overwhelming number of them. A “bunch” of rules is good for gains in efficency. Hundreds of rules slows development, testing, and operational performance. CsvPath Framework has features and use patterns specifically geared towards making very large sets of rules more practical. But it isn’t actually magical. At some point you end up in an impractical place. That said, if you have trouble using CsvPath Framework to automate a huge rules base, you’re undoubtedly having similar problems with your manual or bespoke preboarding.

Reasonable file sizes

CsvPath Framework performs similarly to how any Python programming would perform. It was built to handle very small files up to multi-gigabyte files. Beyond 10 gigabytes you’re getting into significantly long run times. Handling 10-50 gigabyte files is doable. But at some point there aren’t enough hours in the day and any replays only add to the burden. Realistically the vast majority of data partnerships generate files that are a gigabyte or less. Chopping files into smaller files before processing, moving data and compute into more independent, colocated, on-demand environments, and other standard strategies can also help. And of course the biggest consideration with large files is the number and complexity of business rules. Just as with SQL or any other validation and upgrading tool, the more complex the statements the longer the run time.

High cost of errors

Fairly obviously, if errors are free, we can afford many more of them. In that case, automation would be less valuable. In the real world, errors are typically expensive. In fact, errors that are closing in on free generally aren’t really treated as errors. The higher the error price and the more errors you have or could have, the more important it is to use a robust data preboarding tool like CsvPath Framework.

Does My Company Need Data Preboarding?

Here we are. We’ve answered two questions:

Is there meaningful ROI from investment in data preboarding?
Is data preboarding, particularly CsvPath Framework, the right tool for the job?

In both cases the answer is yes for a wide range of companies and use cases. The value of reducing friction is high and the fit good for many types of high frequency, multi-partner, complex rules situations. These aren’t surprising answers, but I hope we’ve been specific enough that you can clearly recognize if your case is right for the investment, or if you should invest elsewhere.

If your company is one of the multitude that could get substantial ROI from better preboarding, I recommend you evaluate and run a small trial project using CsvPath Framework and FlightPath Data to see if it’s the right thing for you. You may decide to use a more modern approach with those tools going forward. If you want a hand specing out an evaluation, by all means, get in touch. We’re there for you.

Is Your Data Preboarding Quality Control Efficient?

Atesta Analytics — Fri, 30 Jan 2026 02:26:04 GMT

Now and then I make the rounds of the data QC players to see what’s new. These are the data observability systems, mainly, but also some data governance and data catalog tools. The observability companies include Soda.io, Great Expectations, Monte Carlo, and many others, among which is a company called BigEye. It’s a very crowded field these days and BigEye was new to me.

(Also, for the record, like the others mentioned, BigEye isn’t a data preboarding company, they are well downstream on the database side of things).

I took a spin through BigEye’s docs. One statement popped:

"Traditional data quality tools are rooted in rules that must be hand written and manually updated when data changes. Rows are checked against the rules and either pass or fail. Rules must be preemptively thought up by humans which is expensive and error prone."

Now, then, says I, that hits close to home.

CsvPath Framework is not a “traditional data quality tool”. That’s partly because there is no tradition of tools that provide a methodical preboarding workflow. And it’s partly because CSV, Excel, JSONL, and other delimited data formats have never had a solid validation language or a way to create effective schemas. CsvPath Framework solves for both these gaps.

That said, BigEye’s statement is totally speaking to the kind of validation rules effort CsvPath enables. That makes the statement fair game for an analysis and push-back. I’ll say upfront, I don’t know BigEye’s product and they seem like serious people, so I’ll assume they have a good thing going, overall. I just think that quote completely missed the dartboard.

Let’s step through it.

“Rules that must be hand written and manually updated”

Hand written, huh? It sounds like they may be thinking of the Rosetta Stone. If code were coded on rocks, I’d get this. But it’s not. And increasingly code is coded by AI so who cares, rocks or otherwise?

Here’s the thing. From the perspective of data preboarding, you already have the rules in “narrative” form. I guarantee it. The reason I can say that is simple: you have a partner and you have negotiated with your partner what should be exchanged for money. No serous money is going to change hands until the data provider explains to the data consumer what they are delivering, when, with what domain and range, and what the abstract schema and and physical format is. So, like I say, you already have the rules.

And manually updated? That is a feature, not a bug. When data formats or content expectations change we want to have the new expectations communicated to us. We then want to make the appropriate changes and test them against known sets. And, when all is well, we roll the changes into production. Testing and rollout should have a lot of automation. But updating QC rules is absolutely something to do with great intentionality, full information, and step-by-step.

I’ve heard this before from developers. If we have too many tests, test at too fine a grain, or test the UI using test automation, we’ll just create more work maintaining tests, they say. As if changing a test to match a new reality is a problem. That’s what tests are for, to insure no surprises. When the app or pipeline or database changes and a test breaks, that’s a test that is doing exactly what we want. We should thank it for its service, not complain because we have to do work to get paid.

When the facts on the ground change you change the tests. Just is what it is. Next!

“Rules must be preemptively thought up by humans”

It is indeed hard to think of all the cases that must be covered in order to have a level of test coverage that let’s you sleep at night. What if the same file comes from the data partner twice in one day? What if a tremor hits San Francisco and causes the rack to become unplugged on the same day that the backup power died making the server crash before it can write data from memory cache to disk? Hmm.

The good news is that you actually have a lot less to think up than you may have thought, because see point one: you already have the requirements because you entered into a data partnership with your eyes open. We’re not talking about software engineering, where the software is an act of pure creation based on an abstract idea. This is data from a data partner with a shape, size, type, and price stamped on every bit coming over the wire. We don’t have to think up anything.

“Expensive and error prone”

Now the idea of rule writing being “expensive and error prone” is fair. Rule writing takes effort, even if we’re in the main just transcribing the requirements or product information from the contract with the data partner. You can usually get some kind of a bump by using AI to code your most complex rules. We know a bit about this because the CsvPath Framework team built a prototype of an LLM-backed csvpath statement generator. (That worked great. We’re now adding a production version into FlightPath Data). But even with a lift from AI you have to check the rules and prove that they work. Yes, that’s a bit expensive.

However, expensive is a contextual unit of measurement. Is an hour of an SME’s time expensive? Yes, hence the ‘E’ in SME. But the hour is not nearly as expensive as the time that SME and their tech team would take to trace an error that reached production back to the point where the rule that might have cost an hour’s work would have prevented it. That rule had it been written, verified, and put into production, would have been cheap at double the cost.

What about error prone? Yes, that’s true, rule writers can make mistakes. Let’s just get the AI to write the rule, sit back, and have a sandwich… oh, wait.

The thing is, data preboarding operations either have human checkers applying manual judgement that is, for sure, expensive and error prone. Or they have automation, involving rules that are written by humans (or at least verified by humans) that are expensive and error prone. Or they use AI somehow, which is probably pretty expensive, actually, in all sorts of ways, but regardless, definitely error prone.

I’d take option B. Write your rules, based on well-known commercial arrangements, using AI or just SMEs, test the hell out of them, and automate. When the rules break because the data relationship changes, fix the rules or fix the relationship. I guarantee it will be less expensive and error prone than option A (all human judgement and manual toil) and probably significantly better than option C (AI magic attempting to address a situation that is about well-known requirement matching, not unknown pattern matching).

Slow is smooth and smooth is fast

The efficiency of your data QC shouldn’t be decided based primarily on how much effort you have to put into your data preboarding rules in order for them to give good coverage. It should be determined by

How many data errors get to production?
How easy is it to trace and fix errors?
How visible and understandable is the whole data lifecycle starting from data preboarding?
Are your data operations consistent and repeatable?
How quickly do you know when things are going or have gone south?

Are there more bullets I haven’t stated? Certainly. But those are a good start. Get those right and you are a data preboarding superstar. Focus only on the effort you put into QC rules and you’re likely to over-spend on gee-wiz tools and under-invest in actual data quality. Generally speaking, the latter is expensive and error prone.

One final thought. If you are concerned about how much effort you will need to put into data preboarding validation and upgrading rules, by all means get in touch. We’d be happy to have a look at your requirements and give you a good sense of what it will take to use CsvPath Framework to smoothly go fast.

Should You Build Or Buy Data Preboarding?

Atesta Analytics — Tue, 27 Jan 2026 23:46:31 GMT

The decision to build or buy comes up constantly in commerce. With software, rolling your own often looks innocuous in the moment, but ultimately may be a life or death choice. Time to market, time to refresh, time to fix, time between failures, and many more indicators are driven by what engineers thought was a good idea at the time. There’s a difference between starting simple and simple thinking.

What we learned creating FlightPath Data

The people contributing to the open-source CsvPath Framework and the FlightPath products have not only built data preboarding for enterprises but also built it in a more general way as an open source product. On top of that, we use FlightPath Data and FlightPath Server daily, so in effect we’re buying a pre-built tool every day. What did we learn?

If you let it be manual it is likely to remain manual

To paraphrase Frank Herbert, once you've processed a kind of data manually, you must always process that data manually. The reason is that it’s much easier to check data in a spreadsheet or SQL console, than it is to build software and automate processes. Moreover, if someone is willing to hack on data by hand, everyone else is happy to move on to other work. Instant gratification + Somebody Else’s Problem == persistent under-investment in automation.

The most long-term successful preboarding efforts automate from the beginning. To be hyper-focused on automation while building everything yourself is hard, so buying becomes the obvious choice.

Preboarding is actually hard

No, really. What could be hard about grabbing files from an SFTP folder and jamming them into a relational database with Python? The naive solution to preboarding? Barely do it. Yes, that would be easy. On day-1.

The problems with that approach are many. They include: restated data, lost files, unmanaged scripts, changing business rules, lack of process visibility, workflows that incorporate human-driven loops, the risks of fallible human judgement, and many more problems. All these conspire against you. The naive solution quickly turns into a risky, expensive nightmare. And that’s before we even get into the details. The details are even harder.

And the core challenge is getting business rules out of human heads and into a validation and upgrading framework for automation. Building a validation framework with the power and flexibility to replace human judgement is over most developer’s pay-grade.

Solving for all those challenges with a bespoke solution requires more engineering cost than most companies are willing to spend up front. When there’s the option to buy data preboarding you should buy it.

Manual is expensive

This may seem obvious. On the other hand, most companies accept the operational overhead of manual processing, rather than invest in automated preboarding. That overhead comes in the form of risk, as well as head-count. Let’s start with the FTEs.

Two recent experiences with the file feed data preboarding efforts of PE-backed B2B services companies both started out with manual BizOps data handlers out numbering the technical staff involved 2 to 1. Even allowing for engineers being more expensive, the manual processing tax was definitely high. If we agree that some small amount of manual checks are inherent in those businesses (in those cases, invoice management and insurance benefits) we still roughly doubled the ops overhead on the inbound data flow. Moreover, in a perfect world the engineers can do other work between manual preboarding crises. That additional opportunity cost pushed the tax higher.

With a pre-built data preboarding solution you reduce costs multiple ways:

Developers don’t have to build preboarding
There are fewer BizOps FTE dedicated to manual data intervention
Developers do far less firefighting when infrastructure is robust
Tech team focus can be on what the company does, rather than how the systems do it

The last bullet is subtle. Developers tend to know their tools and systems intimately, but often don’t know the business very well. That is a continual source of inefficiency and lost opportunities. When developers can focus more on business goals and less on infrastructure it is a win for everyone.

Nobody likes to lose customers

Data problems happen. Particularly in preboarding scenarios where data is arriving from data partners that have a completely different context and incentives. The sooner you catch bad inbound data the better.

From a cost perspective it is often said that the remediation cost increases non-linearly the further in from the edge problems are caught. I.e. a $10 catch by the data preboarding system at the edge saves you from a $1000 problem if the error gets all the way to a production system.

The immediate problem is the increasing difficulty of technical triage as the data flows inbound into data lakes and the downstream applications, analytics, and AI. Each step from data source to production use adds more logic to check, more people who can make mistakes, more layers of data to untangle, etc.

The longer-term problem is that customer patience is finite. If a customer keeps catching data problems they soon won’t trust the data and will go elsewhere. This is not an abstract speculation. It happens all the time. The customer’s exit is sometimes prevented by claw-backs, discounts, and simple lock-in. Regardless of exit or not, reputational and financial damage is done. Moreover, morale takes a hit, and that is impactful. It particularly hurts when the Sales team begins to doubt the correctness of the data underlying the product they are trying to sell.

If you let it be complicated, it will be

One of the arguments against buying technology is “it doesn’t fit our process”. Sure, COTS software is always opinionated. If a tool isn’t opinionated it isn’t guiding you towards best practice.

If you are at the build vs. buy choice point with data preboarding and someone says “our workflow is too specific to buy”, push back. The goal is to run the simplest business possible. Revenue being equal, simplicity is one of the great drivers of margin. If the buy option doesn’t fit the process, consider if the reason is physics or fashion. Somethings you can change. If change would result in a simpler, off-the-rack business, that’s something to strive for, not avoid. If change is doable and NPV is good, invest in change.

So, then, It’s a Buy

Vertical integration is having a moment. SpaceX, Apple, and Amazon are all deeply vertically integrated. Most engineers like to build. But Apple doesn’t own semiconductor fabs and SpaceX does not, as far as we know, manufacture bolts and screws. You have to pick your battles.

In our case, we back open-source packaged software. You can download FlightPath Data and FlightPath Server from Apple or Microsoft and start creating consistent, maintainable, and scalable preboarding projects today with no money changing hands. For us, “buy” is not necessarily a transaction. But make no mistake, we want you to buy into CsvPath Framework and the FlightPath products. We’re here to give you honest assessments of what can work for you and to make your data preboarding successful.

For us the decision is build vs. try. We hope you'll agree and take our product out for a test flight.

Data Preboarding: Welcome To the 4th Dimension

Atesta Analytics — Sat, 22 Nov 2025 08:05:09 GMT

Over the weekend I dug into an eclectic array of iPaaS, ERP, and EDI systems. The new, the older, and the 80s. Catching up on the CsvPath Framework competition. What can I say? It was too cold to paint the house.

And actually, it got pretty interesting, though. What dawned on me was that I needed a way to explain why these data files-savvy systems were all singing, all dancing, and yet not getting the whole data integration job done. Why, at the end of the day, doesn't the iPaaS finish the job the earlier systems didn't complete?

Without Naming Names

Here's an example. An ERP system used in large manufacturing and logistics firms had a flat-file data file import function. It took in files types that included invoices, as a good example. The process went like this:

Open data import dialog
Enter file system location of the CSV / Excel files
Enter credentials
Pick a database destination
Click go and sit back and watch

The files loaded or they didn't according to very basic rules that amounted to making sure the SQL schema didn’t wig out. Files that didn't load stayed where they were. If a file loaded it was moved to a done folder. And that's it. Yes on one, no on two.

What's wrong with that?

A few things. Let's assume we're talking about invoices that arrive daily, weekly, or monthly. And let's assume the process is automated, we're not clicking the load-file button manually.

The first question is where do we land the files and how are they organized? What are the naming conventions? Do files ever get delivered twice or not at all? Are there restatements, if so, how are those indicated? What is the identification of the data set contained by a file?

Ah, I can hear you say it, how is any of that the ERP system's problem? Well, yeah, I know you didn't actually say that, but for sure someone somewhere once did. I know that because the ERP system's data file load function simply didn't care. It was an SEP.

Somebody Else's Problem

Just a few minutes later I found myself digging into another company that was building similar solutions on a popular, more modern, iPaaS. Again, no last names. I hadn't read that tool’s docs before so I took a look. Surely the new hotness would do better! The docs were certainly better. And the bluster brighter and more amusing.

Sadly, the data file import functionality was almost exactly the same. Ouch.

I moved on to other apps and tools looking for answers, while in the back of my head I pondered the curious lack of change between my granddad's ERP and my nephew's iPaaS. It felt like Groundhog's Day. And then it hit me.

Nothing Good Ever Happens On Groundhog’s Day

These systems are insisting in a three dimensional world when we actually live in four dimensions.

Seriously, yes, we actually do, and, sure, some ERP tools may actually be living in some other two dimensional world from ours, but that's just not material here.

There are three dimensions to our personal space. The fourth dimension is time. Preboarding accounts for time. Import and export functions and pipelines and your run-of-the-mill data onboarding process are about moving bits from here to there. Time is not of the essence, other than in a runtime performance kind of way.

Preboarding is about all the things that you have to arrange and account for to make that simple bit-slide be part of an actually effective workflow that can handle a client promising to send tomorrow a CSV invoice file exactly the same as the one you just loaded, except with the correct calculations this time.

Now it all makes more sense to me as I'm looking at all the data onboarding, importing, mapping, and loading docs. These tools that don't fully account for time, and all the crazy things that happen in it, are setting boundaries. Setting boundaries is good, I guess. It's not laziness. We don't want the ERP system or the iPaaS to burnout or get demotivated.

Wherever You Go, There You Are

Still, something has to care about the process of merging data into the enterprise. Not just loading it, but really integrating it into a healthy data estate. Whatever tools, applications, and/or cloud services you pick, you still have that overriding concern. We deal with the way thing really happen, not just the things the iPaaS decides is within its boundary-setting sensibilities.

You may go to war with the iPaaS you have, not the iPaaS you wish you had, but that doesn’t mean you can just wish any part of the war away.

That thing that cares about the time dimension and the real world is data preboarding. Preboarding is a simple set of steps:

Land data in permanent immutable storage partitioned by data category, with a hierarchical time-of-arrival oriented layout that retains versions
Register the bits as the most recent in the category at a location and a point in time, giving a durable identity traceable from downstream
Validate the CSV / Excel data using as much quality control logic as practical so as to fast-fail bad data
Upgrade the data, if needed, idempotently
Capture processing events to validation reportage, lineage, process telemetry, errors, and logging
Write valid and upgraded output data, along with invalid data, if useful, to an immutable searchable versioned archive accessible to downstream

When you use that many words I admit it doesn't sound that simple. But it really is. And the whole point of it is time. Data comes in, we react to it, people ask questions, data gets revised, it gets reloaded, more questions are asked, problems happen, we roll back, explain, and restart, etc., etc. Tick-tock, tick-tock.

Many People Do Respect Time

There are, of course, many applications that know they live in four dimensions. Those apps preboard their data. Strategic Healthcare Management Systems or SHMS, for example, seems to have gotten the memo that we live in a four-dimensional world. I stumbled on them in my not-painting-the-house travels, but I haven't used the system, so I don't know all the guts and glory. Reading their docs, they are clearly doing many, if not all, of the bullets above. They do it because that's the world we live in.

(Fwiw, I have no relationship with SHMS. And to make sure of that, I'll note that, phonetically speaking, SHMS is an awesome acronym for a software company. ;)

Living in four dimensions is liberating. It means you prepare for the things that are likely to happen, rather than being surprised by them. No back-flips in order to explain, retrace, or redo. Zen-like calm as you know that your system is one with the circle of life that in wheeling will surely result in the same file being resent with the same name and very slight changes from the same client over and over. You are ready. You have balance. You have accounted for time.

And Now a Word From Our Sponsors…

Or maybe you have not. On the one hand, solution builders generally do account for time through preboarding workflows, if their solution is good. Software and web services vendors, on the other hand, often don't, presumably because they want to help solutions builders make ends meet.

If you find yourself at the pointy end of a software package that doesn't have the time to help you do better preboarding of your CSV/Excel files, it's probably time for you to take a look at CsvPath Framework and FlightPath Server. The solution to having no data preboarding is to adopt the most robust open source preboarding tools available. There's no better time to do it.

Do Schemas Have a Place In Delimited Data?

Atesta Analytics — Thu, 13 Nov 2025 03:02:37 GMT

CSV, Excel, and other delimited files have not historically had a robust schema language. In recent years the now dormant CSV Schema Language 1.0 partially filled this gap. Now CsvPath Validation Language fills it more completely. For most people, however, the unasked question is, why do I need a schema?

What Is a Schema?

Data schemas are shorthand for rules. They create a structural definition of data based on whole-part units and part-part relationships. Every structural schema could be decomposed to a set of rules; however, the abstraction of the units, a.k.a. entities, is an expressive shorthand that aids understanding.

We all know examples of schema languages. SQL has its Data Definition Language subset to define what tables look like and how they are related. XSD uses XML, namespaces, and XPath to create containment information models. SQL's superpower is relationships between entities; XSD's is hierarchical nesting. Both work at the level of sets of datum. DDL defines a "document" as a row in a table. XSD defines a document as one XML text.

There are rules-based schema languages as well. To highlight two, Schematron is an XPath-based business rules language that also works on XML's hierarchical structure, but without defining entities as such. Similarly, SQL has it's Data Manipulation Language subset to interact with the data in a database. Validating data is a core competency of DML, putting it on the same level as Schematron, even if validation is not DML's first priority.

CsvPath: A Schema Language for Delimited Data

In comparison to XML and the data in a relational database, CSV data is both simpler to model and harder to validate. Setting aside Excel for the time being, the simplicity comes from a delimited file being, commonly, roughly the same as a relational table, but without the possibility of relationships. (Again, commonly; you can of course have any pointers in CSVs you like, they just don't have a well-known function).

The challenge is that there is no server to enforce the a required schema. To oversimplify in a useful way, without enforcement, data in a database file would just be rows and columns embedded in a fancy file format. Because CSV has no enforcement context, it's the wild west. That makes things hard.

CsvPath Framework does not act as a relational database server does; though FlightPath Server closes some of the gap. Itself, CsvPath Framework simply offers strong validation. It provides that enforcement that makes CSV data behave more like a relational database table. If you use CsvPath Framework as a whole data preboarding apparatus, not just a schema validator, you get more of that enforcement, and the automated quality management opportunities that come with it.

How Do CsvPath Validation Language Schemas Work?

CsvPath Validation Language has an overall structure that is far less verbose than XSD, more on the level of DDL. The parts of a csvpath are:

The root
Scanning instructions (a.k.a. the scanning part)
Match components (a.k.a. the matching part)

The root is simply a $, optionally followed by a file path, if the user is targeting a specific file.

The scanning part and the matching part are each wrapped in brackets, []. The scanning part contains numbers that pick out lines to scan. The matching part scans each of those lines by passing the values to its matching components. If the matching components return true, the line is considered to have matched. By default, a csvpath's match components are ANDed together.

This is a deceptively simple structure. In reality, there are many knobs and dials:

Logical operation: AND by default, but settable to OR
Validity test: strategy is determined by the schema author
Validity result: marked by the schema author, or not

The first bullet, the logical operation, is straightforward. Match components are either first false wins or first true wins. By default, all lines match. However, a line that matches isn't necessarily considered valid data. It is equally reasonable to say that matching lines are invalid. Consider this:

$[*][ 
    #Cretan == "liar" 
]

The syntax is: $[*] means scan all lines. The # references a header, in this case the header is Cretan. What this csvpath statement says is, if the value of the Cretan header is "liar", the line matches.

More Important Background (Peas Before Cake!)

Now we have to decide if the match means the line is valid or invalid. That is the validation strategy. Both options make sense. If you want to present a user with only the bad lines, you would collect those that matched a definition of bad. If you want to present only the good lines your match components would define what good looks like.

In fact, we can have CsvPath Framework collect both matched and unmatched lines. As well, we can switch the match part from ANDing match components to ORing them. These are also part of the validation strategy, but they are a feature of the Framework, not CsvPath Validation Language itself. Also keep in mind that CsvPath Framework allows you the option to abstain from collecting any lines, if you like.

Let's add yet more wrinkles to the validation strategy, all built into the language, not the Framework.

Declaring validity
Side effects
Print-only validations
Built-in validations

A csvpath statement can use the fail() function to declare a line invalid. In a certain circumstance a line can be invalid, regardless of if I plan to collect it, or not. Once a line triggers the use of the fail() function, the file as a whole is considered invalid, or failed. In order for this to work, we need another concept, a side-effect.

Side effects are actions taken by a csvpath that do not determine if a line matches. Using the print() function is an example of a side effect. When my csvpath does print("Epimenides is a Cretan") there is no impact on the line capturing, or not, but the string is appended to the printout.

If I want to only print when a line is invalid, I can use a construction like yes() -> print("Epimenides is a Cretan") where yes() is a function that is always true and -> is the when/do operator that performs the right hand side when the left-hand evaluates to true. Another example might be #Liar == "Epimenides" -> fail().

Side-effects in combination with the print() function allow us to follow in XSD and Schematron's footsteps and deliver a report-out or print-only validation. SQL and CsvPath by their nature collect data. XSD and Schematron report on the state of data, but do not collect it. However, when you use the when/do operator and the print() function, CsvPath Validation Language can operate in the same print-only mode. This is sometimes ideal, for example when you have very large files that may have many errors.

Standing between collecting lines and performing only side-effects, there are many validations that are built in to CsvPath Validation Language. These come in two types:

Language errors
Data errors

Language errors are problems with how you wrote your csvpath statement. For example, if you write add(5) you will get a language error. This is because the add() function requires two values and you only provided one.

On the other hand, if you write add(5, "three") you will get a data error, because you cannot add a string that spells out three to the integer 5. The add() function must be able to cast the values you pass it to numbers. In practice, a language error will happen once immediately upon your starting to validate. A data error will happen on every line of the file you attempt to validate. In this case, because on each line the add() match component will once again fail to add 5 and "three".

These language and data errors are built-in. They will fire even if none of the rules you write are triggered by lines. It doesn't matter if you collect lines with built-in errors or lines without built-in errors, a built-in error always results in its match component returning negative. And it is these validation errors, the built-in data errors in particular, that give us the ability to write schemas.

Yes, this was a long path to take to get back to the concept of structural schemas, but the background is worth it.

What Does a CsvPath Validation Language Schema Look Like?

A CsvPath Validation Language schema looks more similar to SQL than to XSD. Here is a simple one:

$[*][ 
    line.person( 
        string(#firstname), 
        string(#middlename), 
        string(#lastname)
    )
]

We construct CsvPath schemas using the line() function. In this case, our csvpath statement has one entity named "person" that holds a first, middle, and last name. Moreover, our csvpath says that for a line to be valid it must have three headers named "firstname", "middlename", and "lastname", which must hold values that can be cast to strings (as all CSV values can be) or nothing, and there must be no other headers or header values.

This is the equivalent SQL:

CREATE TABLE person ( 
    firstname VARCHAR, 
    middlename VARCHAR, 
    lastname VARCHAR 
);

We can go much further with this CsvPath schema. For example we can have

$[*][ 
    line.person.distinct( 
        string.notnone(#firstname), 
        string(#middlename), 
        string(#lastname) 
    )
]

With this update, we are requiring every person to be a different combination of names. We are also requiring a firstname by using the notnone qualifier. A qualifier is an annotation that modifies how a match component works. In this case our line() function has two qualifiers. The distinct qualifier has the meaning we just said; whereas, the person qualifier is simply an arbitrary name that we use to refer to the entity.

To update the SQL to match:

CREATE TABLE person ( 
    firstname VARCHAR NOT NULL, 
    middlename VARCHAR, 
    lastname VARCHAR, 
    CONSTRAINT unique_person_names UNIQUE (firstname, middlename, lastname) 
);

There is much more you can do with line() and the datatype functions, string(), integer(), decimal(), date(), datetime(), none(), url(), email(), and boolean(). The several schema-related qualifiers add even more power. At the same time, near the top we said that schemas are just shorthand for rules. Let's look at the person entity as a rule set.

$[*][ 
    #firstname 
    #middlename 
    #lastname 
]

This csvpath requires all three fields to be present. However, we only want the firstname header to be mandatory. That means we need to update the csvpath to:

$[*][ 
    #firstname 
    or(#middlename, none()) 
    or(#lastname, none())
]

Whats-more, our schema wanted the fields to be in the order given. But match components don't work that way. Their order matters from a logical point of view, but the order of use doesn't say anything about the order of headers. To make the ordering we want we have to add:

$[*][ 
    #firstname 
    or(#middlename, none()) 
    or(#lastname, none()) 
    header_names_mismatch("firstname|middlename|lastname") 
]

(Note that header_names_mismatch returns false if names mismatch. The function's name is in some ways backwards and will be aliased in a future release to make more grammatical sense).

With the addition of two more functions we can require uniqueness across all sets of names like this:

$[*][ 
    ~ this rule set is equivalent to the person entity ~ 
    #firstname 
    or(#middlename, none()) 
    or(#lastname, none()) 
    header_names_mismatch("firstname|middlename|lastname") 
    not( has_dups(#firstname, #middlename, #lastname) ) 
]

This version of the csvpath requires all three fields to be present.

To recap, this latest version of the rules is more complex than the person entity it can stand in for:

$[*][ 
    line.person.distinct( 
        string.notnone(#firstname), 
        string(#middlename), 
        string(#lastname) 
    )
]

It is easy to see why the structural schema based on a named person entity is preferable. But the point is clear, a schema is shorthand for rules. And that brings us to the power of rules to extend schemas.

Without going deeply into the topic, rules apply to the requirements and activities that happen to and with the schema entities. For example, using a structured line() entity person tells you what people are, but it doesn't tell you that a person must have no middle name if they have no lastname. We might add that declarative logic like this:

$[*][ 
    line.person.distinct( 
        string.notnone(#firstname), 
        string(#middlename), 
        string(#lastname)
     ) 
     not.nocontrib(#lastname) -> not(#middlename) 
]

This way a person data that is seen to match person, but which has a middle name and no last name will return false. Again, in our schema strategy, we are selecting invalid lines.

In this rule the not() in the left-hand side of the when/do expression has a nocontrib qualifier. The nocontrib qualifier indicates that match component does not contribute to the decision to match or not match. This results in the lastname test being ambivalent about the presence or absence of #lastname. At the same time, the #middlename test is a contributor to the match, providing it is checked because there was no lastname.

The equivalent SQL would be:

CREATE TABLE person ( 
    firstname VARCHAR NOT NULL, 
    middlename VARCHAR, 
    lastname VARCHAR, 
    CONSTRAINT unique_person_names UNIQUE (firstname, middlename, lastname) 
    CONSTRAINT check_not_middlename_if_not_lastname CHECK (lastname IS NOT NULL OR middlename IS NULL) 
);

SQL is very good at rules. Generally, though, only a small fraction of validation rules are written into DDL. In CsvPath Validation Language, rules are commonly added alongside line() definitions. That said, CsvPath Framework makes it almost trivially easy to run sets of csvpaths against a data file, so it is essentially cost-free and often cognitively advantageous to separate rules or sets of related rules into their own csvpaths. When handled that way, the similarities to SQL rules as DML statements are clear.

One More Thing About CsvPath Validation Language Schemas

Our simple csvpath schema has three headers. What if we wanted five headers? How about an unlimited number? Or what if we want there to be one header before the three names, but we don't care what it is. Or even that, except we care about the name but not the type?

All these things are possible. CsvPath has the blank() and wildcard() type functions to handle those requirements. The wildcard() function takes a “*" for an unlimited number of headers or an integer to represent exactly that number of headers. Similarly blank() indicates one header, which we may name or not.

With blank() and wildcard() we have the ability to position our line() entity anywhere on a line we like. That becomes interesting because it gives the opportunity to define a sparse entity, meaning one that has just a few headers specified out of potentially many actual headers. Or, even more interestingly, to define multiple entities per line of data. And, indeed, we can overlap entities or even have entities share member headers. There are many possibilities.

Why might we want to have multiple entities per line? The clarity of this schema-based csvpath helps make the point:

$[][ 
    line.person.distinct( 
        string.notnone(#firstname), 
        string(#middlename), 
        string(#lastname), 
        wildcard() 
    ) 
    not.nocontrib(#lastname) -> not(#middlename)

    line.address( 
        wildcard(3), 
        string.notnone(#street), 
        string.notnone(#city), 
        integer.notnone(#zip, 5) 
    ) 
]

Contrast that to:

$[*][ 
    #firstname 
    or(#middlename, none()) 
    or(#lastname, none()) 
    header_names_mismatch("firstname|middlename|lastname|street|city|zip") 
    not( has_dups(#firstname, #middlename, #lastname) ) 
    not.nocontrib(#lastname) -> not(#middlename) 
    #zip 
    #city 
    #street 
    integer(#zip, 5) 
]

The rules version is more terse. But the schema version is far more readable. And in the schema we have named entities, so there is less cognitive load on us reading it. It is immediately apparent that a person has an address. When we look at the rules, it is clear we are saying something about people and addresses, but it's much more work to pick out what exactly we're saying.

To Get Back To the Question… What Was the Question?

So do schemas have a place in the world of delimited data. Do CSV and Excel files benefit?

I think the answer is clear from looking at the quality of this simple person entity definition we created using CsvPath Validation Language. Moreover, at least for those using CsvPath Framework, we have a constraint enforcer now, similar to the relational database server's enforcement of DDL on columns of data in files. Certainly there is a need for data to be determined valid or invalid as quickly as possible after its arrival. Nothing but cost and aggregation is added by letting errors drift downstream unconsidered.

So, yes!

In my mind the answer is easy. Structural schemas, rules-based schemas, and schemas that are a mix of both, all have a strong value-play in delimited data. The more powerful and expressive CsvPath schemas make our early quality gates the better for the downstream quality of our data, the lessening of our toil, and the greater the agility of our processes.

An easy answer.

For more on CsvPath Validation Language see the Github repo’s docs pages. For more about the CsvPath Framework, see csvpath.org. And for giving CsvPath a try, the easiest way is to download FlightPath Data.

Box == Pony == ETL. Really?

Atesta Analytics — Thu, 06 Nov 2025 22:05:53 GMT

I had two unusual but illuminating conversations in the past few days about with engineers about CsvPath Framework and data preboarding. One of them made me do a double-take. You'll like that one, I promise. The other conversation was less surprising, less amusing, and followed a common fallacy.

Where Does Governance Begin?

In many companies, data is the most critical asset. When do you start protecting your assets? A what point do you start to care if the input that determines your success in the market is actually under control and at a sufficient level of quality?

In many cases, perhaps most cases, data file preboarding is handled within or after data is ETLed into production databases. That means files are accepted unconsidered and unregistered and immediately pushed into a new context where their data is mixed with production data and hard to trace. Why do we do that? Largely because it’s the path of least resistance.

The point is, we weld ETL to the edge of our data estate because we have a convenient tool that allows us to skimp on pre-work. Just load and go. The illusion of speed with the surety of the roulette wheel.

Who Does Governance Begin With?

Here's what one engineer recently said to me:

There is nothing wrong with your framework, but I'm really just a one-trick pony. I have developed a robust generic T-SQL process that imports, processes, and stores CSV/Excel data. It's just a pure T-SQL ETL engine with no front-end and no other technologies

On one hand, I love this! He's got a successful tool and don't-fix-it-if-it-ain't-broke attitude. Moreover, T-SQL qualifies as "boring technology", which is often the thing you want. Boring is tested, bulletproof, and widespread for a reason. I might not question the tech choice here, even if they definitely wouldn't be mine.

But on the other hand, so many things! Setting aside the questionable self-appellation "one trick pony", there's a lot to be concerned about. Apparently his tool has been used over and over. That fits our observation that a lot of companies have a preboarding approach that is simplistic. A lot of companies!

The question is, if the data looks right, does it matter if it isn't? And if it matters, does it matter that there is no paper trail to show where the process went wrong? You know my answer.

How Unitary Is Data Loading, Really?

Is loading data a unit of work? Maybe. My second conversation this week didn't land any classic know-thyself quotes about talented quadrupeds. Instead I got stuck on a box. Possibly a soap box.

The box was on a data flow diagram that showed how a file feed was received and processed through steps into an analytics database. There were several steps and four or five systems. Multiple steps were in a layered data lake. The usual stuff, joining, selecting, mastering, etc. But I got stuck on the first one because it was sourced from a cloud and pushed to a data lake and was labeled ETL.

One box at the edge doing one unit of work to get the lovely datas from outside the enterprise parameter into the data lake. We put more layers between a browser and a webapp, even though each browser request typically has virtually no impact on the enterprise as a whole. Ok, bad analogy, but still. Each time this ETL box lights up it presumably has the potential to cause real havoc downstream.

This is the golden-hammer problem combined with the feeling that if the sky has never fallen on my own head, it probably isn't ever going to fall. When the requirement is to move some data, it is just too easy to grab the ETL tool at-hand — be it T-SQL, Informatica, Glue, Airbyte, or a notebook — and move the darn data so you can close the ticket. Too easy + too unsafe == too risky and too slow.

We Need Better Requirements!

The reality is, that requirement is wrong or at least incomplete. In virtually every case of handling a data file feed we need to deliver a few things.

Need	How	Why
Provenance	We capture when a file arrived, from where, how, with what name, to what location	We do this so we know who to call if there are problems and how identify what we received so they can find a solution and resubmit the data.
Identification	Preboarding registers data in its arrival form giving the specific bytes an identity that is durable throughout the data lifecycle.	We do this so there is no confusion about the start of the data lineage and no breaks in the lineage as the data progresses. Without an identity two files labeled March-2025-invoices.csv with different content due to restatements or fixes cannot be distinguished, making forensics nearly impossible
Immutable staging	When a file lands it is registered under an identity and moved to a unique location in immutable staging	Immutable data is data that can be recovered, replayed, and reasoned about. There is no hesitation about updating software because the data is always available to recover to. Forensics are easy because we identify bytes that never change due to copy-on-write semantics. Moreover, copy-on-write is inherently easier to program, so more agile, higher quality, and more debuggable.
Validation	Preboarding has validation as a core delivery.	We validate in preboarding in order to do two things: 1) to raise quality, and 2) to reduce manual effort, and thereby raise agility and lower costs.
Idempotent upgrading	As we validate we have an option to upgrade fields and field values. Ideally we do upgrading in an idempotent way, enabled by data immutability, that allows us to rewind/replay when requirements change or problems are found.	For data to be useful the individual datum must be comparable and Interpretable. Preboarding offers the opportunity to canonicalize fields and values as the validation happens. While not validating is unlikely and counterproductive, upgrading is entirely optional, and may boost productivity.
Lifecycle events	Preboarding throws off metadata events about states and transitions.	You cannot manage what you don’t measure. We want monitoring and alerting on the data, separate from the monitoring and alerting we have on the OS, database, and applications. The preboarding system should log events like validation errors, validation stages, lifecycle changes, etc. That way we can monitor the data operations, not just the data operations machinery.
Metadata collection	All the steps above and more are captured to a metadata store for future reference. Lineage, validations, the who, what, when, where, etc. of the data preboarding lifecycle.	Without capturing virtually every detail of the preboarding process we lose information that we might want during triage and forensics down the road. Moreover, we want to provide downstream data consumers all the details and assurance they need to trust the data they pull from the preboarding archive.
Archival publishing	The output of preboarding is pristine raw-form data, along with metadata explaining the source, journey, and current state, in an immutable archive queryable from downstream.	We need immutability in this last step just as much as in the first. The downstream consumer needs to be able to find, interpret, trust, and access every version of every data set quickly and easily.

CsvPath Framework was built specifically to do all of these things precisely because these are the things we’ve been bitten by over and over, collectively at many companies.

What’s In the Box?

That ↑ is a lot for one box to carry. When I see a single box or one-trick pony standing in for robust data preboarding my hair starts to smoulder and the gears grind as I try to imagine all of preboarding fitting in one little box. When I point to the box and ask, often I get a blank look or — and I may be imagining this — the suspicion that I’m over-complicating things.

But here’s the thing, unless you live in a world where all data partners are trustable and mistakes are never made, you’re going to have to spend effort on this stuff. It’s a pay me now, or pay me later thing, but there’s no escaping the doing-the-whole-job tax, unless you have a special circumstance I’m not eligible for that makes fixing problems not your problem. If that’s not the case, open the box. Check inside. There must be a pony in there somewhere.

Then come find me. I’ve got a pre-built preboarding tool your pony can pull into production pronto.

We Hold These Truths. Data Preboarding Isn't Enough.

Atesta Analytics — Thu, 30 Oct 2025 16:49:47 GMT

Let’s look at a real data file feed preboarding system to understand how easily you can improve the preboarding stage of your data pipelines.

Better preboarding has a potentially huge productivity boost from eliminating manual data review. It also generates impressive productivity returns from cutting out data-fail firefighting. However, the automation economics need to pencil out. It matters what tools you use for preboarding. Spoiler alert, we think CsvPath Framework comes out looking very good in this comparison.

All these steps should be preboarding requirements:

Immutable staging
Durable data set identification
Validation and upgrading
Descriptive and lineage metadata
Immutable, queryable publishing to downstream
Integration with existing data infrastructure

All that would take a lot of person-hours to build from scratch. Believe you me; we did it, so we know. Moreover, you have to create the validation and upgrading scripts. Those are the beating heart of the push to reduce manual processes. If those scripts are hard to create, it’s hard to make a case for changing business as usual — even if the costs of business as usual are too high.

How hard is it to govern, really?

The federal government (in the US) does preboarding a lot, on a huge scale, and at great cost. The FDA, CDC, DHS, and all the other agencies aggregate their data in https://www.usaspending.gov/ using CSV file submissions. The site is an awesome effort to bring transparency to federal spending. As you can imagine, like the budgets it tracks, it is massive. The infrastructure is an impressive example of preboarding. It is all open source on Github.

To the folks who built this, if you’re out there, your work is awesome. I’m using it just as an illustration in hand-wavy mode, not factually-precise mode.

Most organizations don’t face this scale of data file feed ingestion challenge. Nevertheless, all companies collect data and make sure it is trustworthy, even though it comes from sources with different priorities, levels of technical sophistication, development cycles and SDLCs, etc., etc. Most of us are a microcosm of usaspending.gov, to some degree. And we don’t get it right, money is lost and developers and BizOps teams don’t sleep.

But data quality has to be affordable. The open source CsvPath Framework, and FlightPath products can help make data preboarding attainable, without the engineering effort of usaspending.gov.

How does CsvPath Framework lower the implementation cost?

First, a clear architecture. CsvPath Framework makes doing the right thing easy. Developers, architects, and BizOps people don’t sit in conference rooms trying to decide what to build. That saves time, opportunity cost, and money spent on consultants and coffee and bagels.

Second, implementing data validation rules using CsvPath Validation Language is often simpler than doing the same validations using SQL or Excel macros. Once you get started it can move quickly while being more understandable for everyone involved.

Third, CsvPath Framework often dovetails neatly with the way your data file feeds are handled today. It can be setup to consume and produce almost exactly the same filesystem directory structures and filenames, so other steps in the delivery chain may need only minor tweaks.

Of these three, the business rules automation simply has to be productive. Without that, the manual investment continues, even if other benefits of preboarding are realized. So let’s get those rules working! Looking at an example preboarding rule from usaspending.gov gives you a good idea of how CsvPath makes implementation easier.

Rubber, meet seriously hot tar

I’ll pause to say it again: this is an illustrative exercise only. I don’t know the details behind usaspending.gov’s business rules. And I don’t know if writing csvpaths would be the perfect answer for this specific system — no solution can possibly be right for every situation. Still, it is a good illustration of why considering CsvPath Framework for your preboarding is a very good idea.

Data comes into the system as CSV files. It is loaded into a database. This is ELT — extract, load, transform. As part of the ELT process the data is identified, staged, and checked. I won’t go into how good the identification is, if the staging is immutable copy-on-write, how raw data can be found, and if its lineage is available, etc. I’m guessing that all checks out. These guys don’t mess around.

In the validation process there are hundreds of rules. They are all written in SQL, as you would expect from the architecture. Picking one at random, rule FABS31.1, let’s have a look. Yep, that’s a lot of SQL.

Now that’s not the most impenetrable SQL in the world; I’ve seen worse. But it’s a lot. Without the rule text I would have to find a developer and spend some time. Undoubtedly, I’d need an SME too. And the SME would probably not be able to pick apart the SQL themselves.

But here’s the thing. That rule text in the comment at the top is pretty understandable. By basically anyone. Plus or minus a few assumptions about what the look-up tables are like, you can probably sketch out how this rule works in about three lines. To wit, the rule text is three lines long.

Here is approximately the same rule in CsvPath Validation Language. It’s sitting in FlightPath Data, the development and ops console for FlightPath Server and CsvPath Framework.

$[*][ 
     not( in( #LegalEntityCountryCode, #countries.variables.foreign_countries) ) -> take()

    not(#ActionType == A) -> skip()

    before(#ActionDate, date("%b %d, %Y"))  -> skip()

    none(#AwardeeOrRecipientUEI) -> skip()

    not(  
        and( 
            in(#AwardeeOrRecipientUEI, $sam.variables.ids), 
            or(
                after(#ActionDate, date("%b %d, %Y")),
                between(
                    #ActionDate, 
                    get($sam.variables.start_dates, #AwardeeOrRecipientUEI),
                    get($sam.variables.end_dates, #AwardeeOrRecipientUEI)
                )
            )     
        )
    )
 ]

The three lines of rule definition text starts the top comment, followed by a five-line implementation note and the name of the rule. All in, 13 lines of comment. The rule itself is only 19 declarative lines. And those lines are easy to read. Compare that to 48 highly technical lines of SQL. I think you’ll see the advantage.

Now we assume a few things

Again, in case I have said it enough, we’re making a lot of assumptions and speculations here. Some of them are:

We assume a look-up list of foreign countries is available
We also assume that SAM registration dates are available for lookup by ID

These are not big asks. The SQL needs them too. We also made some assumptions about the result set required. Using CsvPath’s collect(#header) function to create the line-by-line results we want to keep is simple, but we would need to know more about the shape and naming of the data.

The final assumption is important. Usaspending.gov has a couple hundred quite complex rules. What are the performance parameters and requirements? SQL and CsvPath process data differently. CsvPath is more Spark-like, in that it works row-by-row. It’s hard to compare to SQL without testing, because the processing method is so different.

SQL is fast, but it can have performance bottlenecks. Some queries are molasses. But most queries operate on indexes, caches, and partitioning. It is likely that the SQL query would run faster. How much faster? Would there be a meaningful difference for an automated, lights-out process? Hard to say without more information. This rule is very likely doable. And most of us have much less data than the feds, so we might not care.

And one last question: could the csvpath be even better? Would adding a multi-rule schema be more efficient? What about using breadth-first execution? Could I simplify the logic even more? Answering these questions might have real upside.

Directionally, this is interesting!

Imagine all the benefits for your own data estate. No architectural vision to suss out. No need to build the preboarding solution. An easy fit with existing infrastructure. Up and running first trials in just days. And a simple business validation language that mortals can read, and maybe even write for themselves. Imagine.

Yes, it is a somewhat artificial example, given the number of things we don’t know about usaspending.gov. But, nevertheless, the opportunity is clear. You owe it to yourself, your devs and your BizOps team to consider a simpler preboarding solution: CsvPath Framework and FlightPath Server.

And hopefully the usaspending.gov example reinforces the point that preboarding is important. Uncle Sam wants you to preboard your data — however you choose to do it!

Your Data Lake Is Flying Blind

Atesta Analytics — Thu, 30 Oct 2025 00:40:44 GMT

This past week, FlightPath Server has (finally!) taken to the skies, in tandem with FlightPath Data.

In honor of FlightPath Server's release, let's do the airline analogy for data preboarding. We all talk about landing data and data in-flight. As analogies go, air travel is for sure a good one.

Imagine an airline lands a plane

The pilot has identified the flight and taxied. The plane approaches the gate only a little late. The crew docks and the door opens. The ground crew does a double-take, this wasn’t the flight on the clipboard.

Regardless, deplaning starts. 70% of the people need to catch a connecting flight. They elbow their way to the front, spill out gangway and run for the next gate.

Most of them, but not all, find the correct gate in a timely way, despite the hub airport's best attempts to confuse and mislead.

At the gate they rush onto the plane without showing tickets and grab any seat they can. The flight is, of course, overbooked, but nobody checks. Moreover, at least a few of the rushing flight-catchers don't realize they caught the wrong flight until they are in the air. The flight attendants are puzzled and immediately upgrade them to business class, because isn’t that what you do? When the flight lands... wherever it does, the misdirected travelers do the mad scramble again for another flight, hoping they end up in the right city this time. Bags have gone missing.

Now, let's pause for breath and look at that picture. And this is why it's not quite as good an analogy as retail deliveries. Basically, because it's factually correct!

Ok, ok, just kidding. It is not correct, and it's not fair to the hard working, caring airline employees that somehow manage to make flying not this experience. So I take it back, with apologies!

What really happens up there?

In the real world passengers who have connecting flights exit the plane in good order, sometimes first, if there were delays. They follow clear signs and instructions to the next gate. At the gate they check in. If they have questions, the gate attendants are there to answer them.

The flight boarding is announced at 20-minutes before, and again as boarding nears. Boarding starts after the plane is cleaned. It goes by seating groups and classes. People are more or less polite and turn-taking. Tickets are scanned carefully and emergency exit questions are asked. Regulation sized bags are stowed, others are diverted to checked. Assigned seats are taken. If a passenger is not on the manifest they are rerouted before the doors shut.

The flight is announced over and over throughout the process. Everyone knows the flight’s identity and destination. Nobody is surprised by anything, much less after the plane pulls back from the gate.

That's how it works for 12 million fliers every day. Generally, it goes surprisingly well. Seven 100ths of one percent of trips worldwide result in lost or delayed bags, and far fewer in the US, I’m happy to say. I'll take those odds! We remember the problems because they are personal, but the vast majority of those millions of trips are uneventful. Data should be so lucky!

You see what I'm driving flying at?

In the world of Data, the trip is often more chaotic. Data enters the organization from data partners to the tune of millions of datum per day. With loose processes and low-flying governance. Things get messy fast.

In many organizations, the identity of the data set — crucially, the version of the set, not the set as a concept — is unclear. The seating assignment is scrambled. The individual data points aren't checked against a schema and may not have a ticket. The next leg in the journey is often unclear. And there is no record of what data points passed what gates managed by what attendant.

Moreover, when a data point is eventually found to be in the wrong seat or on the wrong flight, getting them into the right seat or off the plane disrupts our clarity about the other data points from earlier flights, calling the whole database’s fitness for production into question. Ultimately the whole corpus of data, all of it essentially in-flight, is repeatedly perturbed and becomes suspect because of the poor handling in in-place modifications resulting from new data rushing the gates. The whole data flow grinds to a halt for re-ticketing.

Data preboarding: your traffic control, pilot, and attendant

The data preboarding process is about bringing airline-like operations to data file feed engineering and operations. Ingestion of data file feeds should have two clear stages. Preboarding, to land, register, validate, and generate metadata history. And loading, to move "ideal-form", trustworthy raw data into the data lake, data warehouse, applications, analytics, and AI.

This isn't complicated and it's not controversial. We try to take in data methodically, just like travelers try to get to the right gate. A solid preboarding process is how to take the drama and heroics out, lower review and triage costs, minimize customer risks, and help everyone be more agile and responsive.

The FlightPath Team Takes Wing

As said at the top, FlightPath Server recently joined FlightPath Data and CsvPath Framework to complete the leading data file feeds preboarding solution. FlightPath Server's role is, first, to listen for inbound data arrivals and begin the preboarding process. And second, FlightPath Server provides an API for downstream data consumers to find trustworthy data and metadata published in an immutable archive. FlightPath + CsvPath is an open and free architecture for preboarding that you can roll out rapidly.

What you get, besides peace of mind, lower costs, etc., is a solution makes data intake simple through:

Immutable staging
Durable identification
Validation and upgrading
Descriptive and lineage metadata
A permanent archive queryable from downstream

And it's a solution that fits into your current data estate, integrated with the same cloud services, MFT servers, databases, metadata protocols, and webhook senders and receivers you already use.

Without data we'd get nowhere. Without data preboarding we won't enjoy the trip. With FlightPath Data the air is smooth and the sun is shining. Come fly with us!

AI Is a Lossy Knowledge Format

Atesta Analytics — Tue, 28 Oct 2025 20:30:28 GMT

FlightPath Server’s product launch is next week. It will take its place next to the FlightPath Data frontend as a key link in the premier data preboarding solution.

At the end of the last FlightPath launch, as the dust settled, I looked at some of the other folks on ProductLaunch. Two of them jumped out at me for combining AI and data extraction. Both tools took unstructured or semi-structured data and ran it through LLMs to generate validated data. That hit close to home. Here’s why.

Some years ago I led the product management function for a company that collected unstructured data, processed it, and sold data feeds. To spare all concerned (innocent, guilty, and bystanders) I’ll say we dealt in classified ads data. Remember classified ads? I don’t, but I hear they were cool.

We made parsers that took raw ads and turned them into structured data in an internal XML format. The XML files were aggregated in document and relational databases and ultimately sold as CSV files or through APIs and other software. It was a good business. We were considered the best at it.

The teams creating and operating these products were about 300 strong at their peak, give or take. They used NLP, semantic search, rules-based expert system ontologies and logic, and deep learning. We sold our products to Google and lived in fear of Google creating a competing product and kicking us to the curb. Ultimately that happened, but not the way I’d expected.

In 2017, as everyone knows, Google decreed that attention is all you need. Within four years, the world was on fire with AI. I’d moved on from the classified ads to another vertical search and data company with a similar NLP-heavy tech stack, but in a different domain. Eventually I had to do the compare and contrast of the layered NLP->search->deep learning I’d been doing vs. new LLM and RAG ways of doing the same thing.

I started with just getting structured data from a classified ad using one of the new LLM APIs. I forget which, but for the record, I love Claude. :) Creating a test harness to use the LLM API to process a file took only about 20 minutes from a cold start.

In about 30 minutes I stopped. I had convinced myself that I had just created a better classified ad parser than a team of 40 people had done over about a decade. I was floored!

Back to the present time. Product launches. So, then. I saw these two products on Product Hunt that were in the CSV data extraction space. And FlightPath, my own new launch, being in an adjacent data management place, they caught my eye.

FlightPath is the frontend for the open source CsvPath Framework. It is a data preboarding tool. What is data preboarding? I’m glad you asked!

The first step is ingesting new knowledge

Data preboarding is a more specific term for ingestion or onboarding. Basically, getting new raw data into the enterprise takes two steps, at a high-level: preboarding and loading. The preboarding step is a clarifying shift-left of activity that unfortunately often happens after data is made available in the data lake or data warehouse:

Registering the data with durable identity in an immutable staging area
Validating and upgrading it in an idempotent way to ideal-form raw data
Generating metadata for full explainability and provenance
Publishing the metadata and data in an immutable trusted publisher for downstream consumers

The point being, validation is central to the preboarding concern. And that’s where the conundrum is. If LLMs can extract data so well, what is the point of parsers and NLP, or (my present concern!) a solid preboarding architecture? Anthropic would have have you simply point Claude at a pile of unstructured text and create CSVs, or at a pile of CSVs and create knowledge and insight, and… profit!

My feeling is that there are two types of tools in this corner of the world. To over simplify: those that are inherently lossy and those that, in principle, can never be wrong. As you would imagine, I believe the LLMs can and should eat up essentially all the use cases for lossy tools. And, conversely, I would never (or at least never for the next few years) let an AI attempt to muscle in on handling the never-get-it-wrong use cases.

The thing with the classified ads is straightforward. The LLM gave me 90% correct data on the admittedly small trial I did. The tools I helped build almost 10 years ago would do well to get into the 70% range, downhill with the wind. And that was considered Ok.

70% was a lot higher than 0% and required just milliseconds to do. Without the parser you’d be cutting and pasting or re-keying. Neither alternative was tenable. The expectation was that the data would be pretty ugly. And there was an acceptance that people would never be satisfied. So long as people paid for the data in the end, it didn’t matter about the sausage-making.

But now the sausage-making is gone. Something like 40 FTE * $50,000 (average; we were a global company) in product development salaries alone goes away. The output gets to 90% correct. Time to market dives into the floor. And the main concern now is the performance of an API that you don’t have to run yourself, unlike the old API that was equally performance-challenged but you did have to run yourself. Crazy! Crazy awesome.

Sometimes kinda-sorta is not Ok

But that’s just the lossy side. When a much-beloved AI — that shall remain nameless — recently got into an out and out disagreement with me as to if Boston is the capital of Massachusetts, I found myself rolling my eyes and laughing. (For the record, it is). When I’m looking at my tax returns, health medical record, or credit card statement I am decidedly not in a lossy mood and I’m not laughing. If we’re talking about generating candidate insights from a million anonymized records, sure, precision isn’t the issue. But when we care specifically about 1 record, it manifestly is the issue.

The world is full of data that should be immutable, idempotent, deterministic, and explainable. It should be specified in detail and held to spec rigorously at each stage of a well understood lifecycle and journey. This is where, when the data is in CSV or Excel, CsvPath Framework is the right tool for the job. Obviously LLM AIs cannot perform acceptably in that world. By their design, in those cases, they are not the right tools for the job.

So is there no value to LLMs in a world of predictable precision where Things Must Be Correct? Not at all. LLMs are great at generalizing the high-dimensional predictions that go into creating forward-looking rules based on past results. If you settle on a highly structured form of data constraint definition for the LLM to use, even better. As many of us know, LLMs are good at predicting acceptable code for specific cases. The limitations of structured language reduce the opportunities for LLMs to give results of varying quality. And the limitation helps engineers nudge, refactor, and test the code into production form efficiently.

Crystallize an LLM prediction in code and you have productivity with protection. This is essentially the best of both worlds. An LLM can write candidate rules for CSV or other data processing chore quickly based on examples and sample data. A data engineer can quickly see if the LLM is talking trash or spot on. If the former adjust, or ask the question another way. And at runtime, the results of executing a rule or schema against actual data is deterministic and provably correct.

All that said, how do I feel about products that purport to directly use LLMs to create data excellence? Mixed. For certain use cases, it’s a no-brainer. For others, not so much.

In fact, LLM data munging is not where FlightPath and CsvPath Framework live. We deal in precision and predictability at scale. But still, for many purposes where data outcomes can be approximate or inspired, LLMs are a perfect fit. And similarly for generating rules and schemas based on sample data we find LLMs are great users of CsvPath Validation Language. Just so long as their work can be crystallized in well-tested rules — we’re all about that!

Your Data Lake Is Selling Sketchy Goods

Atesta Analytics — Tue, 28 Oct 2025 20:06:26 GMT

Consider This Data Ingestion Analogy

Bob the shopkeeper runs an office supply store. He has steady foot traffic in a mall and does well.

Early one day, Bob calls his distributor and orders pens, paperclips, and reams of paper. Shortly, his distributor’s truck pulls up with a delivery. The driver drops six large crates on the dock. Bob signs and the truck drives off.

Bob rips open the boxes and immediately runs all the goods out to the sales floor. He props open the front door and welcomes shoppers into the store.

Now, a question for you:

Has this ever happened?

No, never! Not since goods showed up in horse-drawn carts.

For sure, Bob places an order and the truck comes. But then Bob does something radical — and this is that data analogy. He ingests his delivery methodically by.

Opening the boxes and checking the quantity of goods
Looking to see if there is breakage or incorrect items
Scanning each item as he unpacks the shipment
Putting each item on inventory shelves in date order
Updating the cost basis and pricing in the inventory system

Only after doing all that does Bob select what items should go on the shelf for customers to buy.

Of course he does it that way! How else would you run a successful shop?

By way of analogy, Bob’s delivery handling process is pretty similar to how your organization should take in data from its data partners. Data products are like any other products. They have value and should be handled with care.

Let’s Do Ingestion DataOps Like Bob

If we handled inbound data files like Bob handles retail goods we would:

Collect the data files into immutable versioned storage with clear naming
Give each item of data a unique identifier that is durable through the intake process and beyond
Validate that the files contain the data expected in correct form and in the amount required
Idempotently upgrade any malformed data to its ideal raw-data form
Publish the data and metadata files to an immutable permanent archive available to downstream data consumers

Of course we would do it that way! How else would you run a successful data ingestion operation?

This Is Data Preboarding!

Data preboarding is a method of data ingestion. When you preboard data you do a methodical intake process to allow you to load “ideal-form” raw data into your data lake, data warehouse, or application. The preboarding process makes sure your raw data is trustworthy, under control, and traceable. This is edge data governance that makes a difference!

With the guarantees data preboarding provides, your loading process can focus on the specific needs of each downstream data consumer. Data consumers need joins, splits, mastering, schema mapping, format transformations, parallel processing, multiple system loading, aggregation and summation, and a host of other business requirement steps. What data consumers don’t need is firefighting untrustworthy data from unclear sources with poor provenance, uncontrolled changes, and processing steps that may or may not have happened.

No really, data consumers don’t need that!

Everyone Preboards Their Data… Somehow

We say that data files have been preboarded when all the steps have been addressed: identification, versioned storage, validation, upgrading, metadata tracking, and publishing. But really, preboarding is whatever you do before your data is accepted into the technology organization. If you simply throw a file from MFT (managed file transfer) right into the data lake, then that’s your preboarding process. Not a very good one, but there it is.

Historically, the problems with doing data preboarding well have been:

It takes effort to successfully have nothing happen — nothing in this case is good!
There have been few well-known architectures and few tools specifically for data preboarding
CSV files — the most ugly and problematic data — are hard and frustrating
Because of the lack of glamour architectures, the pile of old, unloved scripts, and the goal of invisible success, the best people gravitate elsewhere

And yet, the problems of not preboarding your data well are every bit as urgent as the problems of having your data center fail or your website taking a siesta every day. Big money and reputational risk are tied up in handling those unglamorous CSV files!

Making Data Preboarding Exciting

Business-existential work that is hard, yet has the potential for the kind of innovation that gets the spotlight, will always find its way into the hands of the best people. At least, it will if you focus on how good your data preboarding can and should be.

There are exciting architectures and tools for data preboarding that are an order of magnitude more interesting to create than the early-days file system-heaps of data we called a data lake. And the win from lowering the cost of manual data processing, ending firefighting, and protecting dollars-and-cents liability should be held up as the big deal that it is.

All it takes is good tools and the realization that data preboarding makes it possible to win. And, of course, it also takes the actual desire to win the data game. It definitely takes that.

Go Forth and Preboard Your Data!

If you’re ready to get control of your data partnerships and their out-of-control CSVs, Excel files, and other tabular data, look to FlightPath.

FlightPath is a drop-in solution based on CsvPath Framework. It is the preeminent data preboarding architecture. And it packs the validation, lineage, and data staging tooling you need to run a successful DataOps ingestion process.

The Bermuda Triangle Of Data

Atesta Analytics — Sun, 28 Sep 2025 23:19:50 GMT

Let me tell you about a data ingestion problem that was incurred due to a faulty preboarding process. I'm changing a few details but this is basically how it happened. Ultimately the teams got through it. Their data preboarding process got better. They lived to ingest another day.

The company

The company was an information services provider. A step up from a mere data broker, instead of selling raw data, they processed it into information and provided actions and analytics. The architecture was that of a vertical search engine. A vertically integrated mini-Google focused on one industry. They gathered raw data, processed it using bespoke NLP and ML, loaded it into a search engine of their own design, and provided rules- and AI-based insights on the search results. Like RAG, but with an inverted tree, rather than a vector db. Cool stuff.

And it was definitely a garbage-in-garbage-out situation.

The business

Let's call the company ShipmentInsights. They are without question the market leader in their niche.

ShipmentInsights’s customers accessed data and actionable insights about the shipping world through a search portal. A customer seeking an advantage in their market would assess rival shipping companies or insurance providers or shipping consumers to find arbitrage and patterns of activity that missed profitable opportunities. AI-based pattern-matching in a unique data set was the secret sauce that let ShipmentInsights find market imperfections.

Needless to say, selling a solution to imperfection raises the bar on your own internal processes.

The process

ShipmentInsights collected data in bulk. They received import/export data from ports, shipping companies, industry news outlets, catalogs and marketplaces, and brokers of various shipping-related goods and services. When the data arrived it would be parsed by the company's NLP, annotated, stored in multiple states of analysis.

Most information came in monthly files, some came weekly. It was rare for the data to come via API and the preference was for batches. Most data arrived as CSV files. Some as Excel. And a few sources used XML. All the data was stored in a data lake. Then through a many-step process it was upgraded and converted to tabular form and loaded into a data warehouse. Much of the upgrading process was manual or and quality control was likewise manual. The process required teams on three continents.

Whoops, where's my data?

The problem presented itself in the form of a puzzled and worried call to ShipmentInsights support. The customer was one of the big fashion houses, let's call them Couture Du Sol. Couture had seen a spike in their shipping costs that didn't make sense. It was coupled with routing delays that put their fall collection shipment deadlines in question. That the resulting problem was based on ShipmentInsights's data was easy to see and hard to argue with. The data was wrong.

Unfortunately for ShipmentInsights, Couture Du Sol was their largest customer, representing 4% of revenues. They wanted answers. And they wanted to know the details. How else could they trust the data going forward?

As you can imagine, multiple teams jumped into gear.

Almost three weeks later

It took a lot of digging. Not only were there in excess of 500 data feeds, there were three distinct spheres of data operations: US, EMEA, and Asia, each with their own DataOps and Operations teams. Each was distinctly different. The data flow diagrams were spaghetti. ShipmentInsights was not a young company. These systems had been manufacturing money for two decades.

Each data source was a partner, a vendor, or a passive public source harvested by ShipmentInsights systems. And each one had unique elements in its data flow, even though many feeds had much in common with one another. Each data file landed in a landing zone that was shared with other feeds. These landing zones were a bit crufty and inconsistent metadata was produced. More importantly, each region had its own validation and upgrading strategies.

Data flows were worked on by human quality checkers, labelers, and cross referencers. The manual work was done in bespoke tools that varied region to region. Some data was keyed. That keying and other low skill manual work was farmed out to a team in the lowest cost geo the company could find, but at the expense of more coordination costs and process complexity.

Long story short, The Data, I presume

Like Livingston, the data was eventually and conclusively found. During the process of understanding the brittle points in how data was brought into ShipmentInsights it became clear that multiple day's data had gone missing. At first it was thought to have been lost in the Indian Ocean. Ultimately, though it became clear that the missing data went off radar over the Atlantic.

Critical parts of four non-contiguous days of production data updates had not been loaded. Frustratingly, the missing days were spread over three years prior to the complaint and had rippled forward in a cascade that happened to snare the largest and most demanding customer. Ain't that always the way?

How it happened

The US team's data arrived by SFTP on an MFT system. It mostly came by pull, in some cases push, and in a small minority by an internal process that itself pushed data to the MFT system. The files landed in a single landing zone that was bucketed by date and source. Many of the files were sent to the lower cost team for basic clean up. Those files arrived back in the landing zone, again over SFTP, in a slightly different place. A somewhat higher value processing workflow took over from that point. That workflow resulted in updated files and those files were loaded into a staging area of the data warehouse. From there the data was shared to EMEA and Asia. In the case of EMEA, the sharing happened as an export shipped over a message queue.

The message queue was approximately over the Bermuda Triangle. Data went missing.

Why did it take so long to come to light?

The biggest problem wasn’t that the preboarding was convoluted and manual. It turned out to be mainly a problem of data identity and manifesting.

The source data lacked clear identity and consistent metadata
The data was mutable and versions were poorly tracked
Data published internally was not cataloged in a way that was easy to monitor and cross check

In short, the DataOps team publishing data to EMEA was acting like a rough data aggregator but treated by EMEA as a trusted publisher. Nobody questioned their assumptions, but even if they had, what would they have checked to make sure EMEA got the goods advertised?

On top of that, yes, the preboarding was also often manual and convoluted. I.e. expensive and risky.

How would better preboarding help?

ShipmentInsights's clearly had a preboarding process. In fact, more than three of them. Regrettably, all imperfect.

Better preboarding would have helped ShipmentInsights with their specific ingestion problem in several ways:

Data would be captured from incoming file feeds in a consistent way, across the board
Each file, and each version of each file, would be identified clearly with a durable ID that carried downstream
Data would be managed immutably, so every data file is always exactly what you expect it to be
Hand-offs between geos would be handled consistently, much the same as hand-overs from external data partners
The known-good raw data would be presented with its lineage in a permanent published data archive for easy cross-checking

These improvements to ShipmentInsights's preboarding process would have simplified and clarified the data flow. Ideally to the point that Couture Du Sol would not have had the problem they did. Moreover, were such a problem to surface, a consistent and well designed preboarding architecture would be quick to review, not a two to three week slog by many people looking under rocks for unknown problems in an unfamiliar process.

DataOps teams spend up to 50% of their time firefighting. This thumbnail sketch of one preboarding disaster helps explain why.

Hooray, problem solved

With better data preboarding, ShipmentInsights could have protected that 4% of revenue. They could have scaled down their on-going investment in error-prone manual processing. And their data could have had a shorter and more agile path to market that would have allowed more hands to do more high-value product dev and customer solutions work. All while customers were not complaining.

Sounds nice, right? ShipmentInsights thought so too. They put building a new preboarding process out to bid. Cognizant, EPAM, IBM and others made proposals. At the time it didn't go anywhere. The reason? Back then, nobody knew what good looked like. Everyone ShipmentInsights asked agreed finding out what good looked like would cost the sun, the moon, and the stars. That slowed things down, for sure.

Today we no longer have that problem.

CsvPath Framework is what good data preboarding looks like. It is the preboarding process that should have protected Couture Du Sol and made ShipmentInsights business markedly more profitable. And CsvPath Framework, along with its FlightPath automation server, is open source. You can get the benefits of a purpose-built preboarding architecture at any scale with no licensing overhead to weigh you down.

Take a look at CsvPath Framework and compare your challenges to what I've described. Think about the possibilities of a more robust ingestion using a solid data preboarding approach. And stop shipping data through the Bermuda Triangle.

🧨 And Now, Data Preboarding Disasters

Atesta Analytics — Thu, 25 Sep 2025 15:12:12 GMT

We all know bringing data into the enterprise from data partners that you don't control is hard. Monday mornings you yearn for green lights across the pipeline status board. Sometimes you get them. On those other days, how about a hot mug of it-all-happens-to-everyone-else-too?

Over the next few weeks I'm going to tell real, though slightly anonymized, stories of data partnerships in jeopardy, data operations on fire, and spend control out the window. Basically, just plain DataOps breaking bad in the absence of solid data preboarding. Why is this helpful? Well, schadenfreude, a bit. But mainly to show how you're not in it alone. And that better data preboarding can make a difference in making the ouch stop.

These thumbnail stories are all true, unexaggerated, and accurately told, though names, data domains, and products have been changed to protect confidentiality. They are not intended to push the envelope -- bigger disasters for sure happen!

These modest problem cases just happen to be handy and photogenic. They are a few of the reasons CsvPath Framework, its implementation of the Collect Store Validate Publish architecture, and FlightPath were born. I hope you will see how the stories made the tools what they are and the tools address the needs the stories illustrate.

And I hope you enjoy the tales. And I really hope your Monday gets better! ☕

Look for these blockbusters coming soon!

The Bermuda Triangle Of Data
Oh crap, we have to run it again... and again...
Ouch, the number 11 is not in fact the same as 1...
What happens when a 45-day payment window falls on your hand...
How use very expensive SMEs, badly. A.k.a why people shouldn't check the work of computers...
What happens when the only guy-who-knows goes and becomes COO...
How to make enemies and influence external people using files...

Are These Activities On Your Data Arrival and Preboarding Map?

Atesta Analytics — Thu, 18 Sep 2025 14:32:57 GMT

Structured data file feeds are simple in concept. Dig below the surface to actually making it happen, though, and you see a complicated set of activities that must be orchestrated correctly for the first stage of ingestion to work reliably. Many of you know that, of course. Still, at the small and large scale ends of things it is easy to forget the whole chain. Large company operations are often so specialized individuals become insulated from activities they don't directly participate in. They become arborists, not land managers. And small companies often merge steps, edit out activities, or otherwise lighten the load wherever possible.

Stepping back to see the complete big picture can be a help

What I'm trying to do here is simply catalog the activities. Breaking down each one is a job for follow-up posts. Because we are focused on preboarding here, and CsvPath Framework in particular, I'll indicate what steps can be (better) addressed with adding an explicit and methodical preboarding stage to ingestion. That obviously isn't the whole list, but preboarding covers some activities completely, and assists in making more of them move smoothly.

The first stages of data ingestion also have an exit point. The data has to go somewhere. With a focus on data preboarding, your exit is often into the data lake or ETL staging area, though other possibilities exist. In the case of a data lake, the area containing raw data -- bronze, if you like -- may act as the storage layer for preboarding, or it may be where preboarded, trustworthy "ideal form" raw data is transferred to. Either way, that is also a topic for other follow-on discussion. It is also reasonable to say the data hasn't been full ingested until it's in the application(s) or analytics system(s). That's fair, of course; different roles have different processes, or different parts of the larger process.

Last (for now), but not least (not remotely least!) there is the financial impact of the MFT and preboarding stage of ingestion. All these activities require expensive time, attention, and technology. And they all embody risk in terms of liability, SLA metric consequences, hard-to-value-but-valuable reputation hits, and excessive cost-of-doing-business losses. The scale of data file feed value and risk can occasionally be eye-opening, even to us who have long been around it. Definitely a topic to explore further.

The preflight checklist, so to speak

So, without further ado, here is a bulleted list of ingestion activities. It is from MFT arrival to preboarding acceptance to availability downstream. No doubt I've missed or mashed together many things. You may think I'm making a mountain out of a mole hill or a mole hill out nothing. Please send me your edits and suggestions!

Customer onboarding *

Credentials exchange
Configuration of customer->MFT (file/data formats, paths, naming, schedule, protocol, error handling, whitelisting, testing)
MFT system configuration (infrastructure capacity, events and triggers, account setup)
Observability configuration (alerts config, dashboard create/edit)
Configuration of MFT->DataOps/biz ops teams (archiving, integration scripting/config, replay process create/edit, testing)
Documentation and metadata update

Operations

Timeliness config
Registration (data’s birthday, social security number, family name, street address)
Conformance checks (readability, size, encoding, canonical forms, datasets expected, attribution, etc.)
File handling (backups, rotation, versioning, retention)
Forwarding (workflow steps, notification)

Data acceptance

SME review
Data validation and quality management
Customer change negotiations
Data mastering
Internal data publishing

Configuration update

Review and reset on essentially any of the above

Forensics

Arrival how and when (provenance, arrival metrics, point-in-time MFT config review)
Data statistics at registration
Change management (lineage tracing, change data capture, point-in-time script review)
Chain of custody (user access tracking, workflow/transfers, permissions/credentials review)
Business rules review
Testing review (data testing, config testing, workflow testing)

Right, then — that’s your 30,000-foot view. A map for more future exploration. What is missing? Discuss! And happy preboarding!

⦿ Bold items are part of CsvPath Framework or FlightPath Server’s preboarding remit. Many can be completely handled in the Framework; for others, CsvPath is just one piece of the puzzle.

Data lineage? I don't think that means what you think...

Atesta Analytics — Mon, 15 Sep 2025 16:30:59 GMT

Data governance is not only about controlling the storage and use of data. It is also about managing and assessing historic metadata about that data. Data governance is of second-order importance for most people. When I attempt to communicate the importance of second-order concerns I often end up talking quickly, avoiding long words, and using analogies. Analogies are excellent!

Three of the big data governance analogies are:

Chain of custody
Provenance
Lineage

You hear them bandied about a lot, lineage most of all. In my experience, they are used more or less interchangeably most of the time. But words have meaning and analogies are relatively specific. Hopefully we don't have to ask what lineage is, in concept, or it would not be a useful analogy. If an analogy is valuable, it can be used specifically. If it is used specifically, it has more value.

Maybe Define Our Terms

Let's take a quick minute for a high-level definition of these key data governance terms. What are we talking about here?

Chain Of Custody

Chain of custody comes out of the legal frame of reference. It means tracking who had a thing over time, and who had access to it. If the thing is a murder weapon, say the candlestick, it is important to know it was found in the library by Inspector Clouseau, was bagged and tagged by him at 11 pm, and he entered it into evidence at 7 am down at the station.

If the thing in question is murderously bad data, we want to know that it arrived at the MFT at 1 am from the vendor, was preboarded by CsvPath Framework at 3 am, and was loaded by ETL into the data lake at 4 am. We also want to know who at the vendor sent the data, who had access to the MFT server's configuration, who wrote the CsvPath scripts, who designed the ETL process, and who had access to the bronze area of the data lake the data landed in. At its most basic, data chain of custody is the data-flow diagram annotated with access control and a log.

Provenance

I think of the concept of provenance as coming first from the art world. For a piece of art to have value it has to have two things: an innate attractiveness or relevance and a known-act of creation. Likewise for data to have value, it must be useful or interesting and have a known source. Unlike most statues, data moves and is often agglomerated from multiple sources in its earliest days. That means provenance is also implicitly about the assembling of a set of datum.

Who first collected and assembled the data tells us if the source was reliable. As we track further (dis)assembly of the set over time we can assess all the hands that touch it, and by extension our knowledge of their capabilities and biases. We can, for example, trust econometrics data assembled from the official records on data.gov and from well-known NGOs. Our trust in econometrics data assembled from the official blogs of Mickey Mouse, Marvin the Martian, and Wiley Coyote is much lower.

Lineage

Lineage is a term of art in the world of genealogy. Exploring ancestry tells us how families change over long time frames as they do things, have things done to them, and incorporate new individuals. Every generation of a family can be seen as a dataset. Not necessarily true or false, but clearly related to, and distinct from, its precedents and progeny.

At each step in the lineage we can see not only the gene pool changing, but also the societal influences, and geographic impact. Likewise with data. Each time a dataset changes, in each system it passes through, we can see individual fields added and removed, schemas applied, conformance transformations made, restatements, etc., etc. Each derived dataset is a new generation. As with chain of custody and provenance metadata, the high-level goal of lineage tracking is assigning a level of trust at a point in time -- and the possibility for remediation. But clearly lineage is not just another word for provenance or chain of custody.

No One Concept Applies

In governing data at the edge or in the moment or over the lifecycle, all three of these concepts apply. We cannot equate lineage with chain of custody or provenance with lineage without losing important concepts. If the analogy has any meaning, it is a specific meaning. And with data, proper management requires us to address all these issues. Without clarity of provenance, lineage, and chain of custody we cannot fully trust our data and its impact on our commercial or collective actions.

We get the provenance, lineage, and chain of custody information we need by carefully tracking how data moves through our systems, using tools like, for instance, Open Lineage. In the moment, at the time we design a data flow, data lifecycle, and data storage and transformations there is a lot to forget that we should be building in. Will change data be captured? How does data pass through the edge into the organization? Who looked at what data when? Were all the items of data assembled of equally trustworthy sources? And so on. Having a set of analogies to tick off is a helpful mnemonic that helps make sure we cover our bases.

Helpful as long as we keep them straight.

Ingestion, ETL, onboarding, and preboarding

Atesta Analytics — Wed, 10 Sep 2025 20:44:19 GMT

How many data partners do you work with day-to-day, week-to-week? Most companies exchange bulk data with more parties then they think. Payroll, orders, marketing automation, regulatory filings, inventory, web traffic and many more activities all have the potential to periodically or regularly require data exchange. Information services companies, science organizations, service bureaus, and managed services partners, so much the more so.

Every time we want to make one of these data exchanges happen we have to pick an approach and a standard. Not infrequently, the approach is automated file transfer and the standard is CSV or Excel over HTTPS. The next topic is how do we get this data into our application, analytics, or AI? Get it right and business hums along happily, and evenings and weekends are, well, stay evenings and weekends. Get it wrong and there’s a strong potential for all hell to break loose.

Onboarding as an end state

The purpose of collecting data you didn’t make is to perform transactions, make decisions, or sell it. To over simplify, that all happens after the data is onboarded into an application, analytics tool, or AI. From that point of view, onboarding is basically an end state in user space. At the point data has been onboarded it is useable and positioned for use.

The data’s journey to get to that onboarding is, of course, much longer than that last hop. It has to be ingested to a raw ready state and prepared for use. Typically that takes a bit of effort. Those two steps are data preboarding and ETL, or assembly (to reach for a slightly more general term).

The problem is… impatience

Three steps between you and anything is two steps too many, am I right? However, in the case of inbound data, impatience is a killer. Too many times we see data entering the organization essentially at the assembly step. I.e., dropped right in the data lake or immediately ETLed somewhere. In organizations that have the time and talent to create their own applications, we see cases where data comes in and is immediately taken by an application without either of the prior two steps. Both of these shortcuts are ultimately problematic.

When an application pulls in data before the data passes through an assembly step there are a few possible problems. One is that the information becomes proprietary to that application, making its assembly into any other context more of a project. Another problem is that the opportunity for unified data governance is lost.

However, the biggest problem when data drops unconsidered into the application or the assembly stage is that the benefits and guarantees of preboarding are lost. Do you know exactly when you received the data, in what version, from which source through what channel, and with what corrections? Did it validate? Did it load cleanly? How many times did the job run and who ran it? Where is it stored for posterity?

Now, you can do preboarding many ways. And it can be said that if received data is accepted for loading to anywhere, then it has passed preboarding. By that measure, if a file is ETLed successfully into a database you could say it has been preboarded as well as loaded. But has it been, really? You can put milk in a bottle and sell it, but that doesn’t necessarily mean it has been pasteurized.

The value of data preboarding

Preboarding is the process of taking raw unknown data and turning it into stable, well-known, and trustworthy raw data. Is that just make-work? Ask anyone who has lost track of a version of a file worth tens of thousands of dollars. Or who missed a filing deadline because it wasn’t clear which files came from what source on what day. Your AI may be happy to wait all day for you to finish a conversation about the employment numbers, but you’re not going to be happy if it quotes you the unrevised numbers from Q1 in Q3. Even less if they combine Paris, TX with Paris, France. Uncertainty has consequences. And these things happen a lot.

In bad cases, up to 50% of a combined data engineering and business operations team’s time may be lost to firefighting data problems that get into the assembly stage or that are onboarded into applications, analytics, or AI. A 10+% firefighting load on every million dollars in data feed-tied revenue is, in many companies, seen as perfectly normal. And those are the average, or even above average, companies. That 10% to 50% comes directly out of profits and often grows linearly, if not managed down.

The answer is straightforward, don’t be impatient, be methodical. Progressing inbound data step-by-step from preboarding to assembly to onboarding may feel slower, but, as the SEALs say, slow is smooth and smooth is fast. (Just agree — you don’t want to mess with SEALs). And remember, zoomed out enough, setup is essentially a 1x, operations is an Nx.

Taking a methodical approach to ingesting data is also not expensive by definition. There are always options for spending a ton of money and time on anything. But good open source tools exist (you’re reading the blog of one of them!) and the goal of preboarding is so cut, dried, and linear that the right answer is hard to miss. If you choose CsvPath Framework for your tabular files-based preboarding, the architectural pondering you’ll do approaches zero because the Framework was built to do exactly what you need. Add the FlightPath Data frontend and it’s even easier.

Hopefully, after all is said and done, your team is yearning for a more methodical and less manual approach. Preboarding gives you one. They will for sure appreciate less heroic firefighting. And your data will certainly thank you.

Well-formed, Valid, Canonical, and Correct

Atesta Analytics — Wed, 10 Sep 2025 17:16:34 GMT

The world of data is massively multi-dimensional. One of the most important dimensions is data validation. Without validation, you got nothing. Sometimes less than nothing. But for as much as how central validation is, how we talk about it is often loose and limited.

This post defines a few terms relating to validation. Not infrequently we just skate past these concepts. We're talking — or not — about levels of data acceptance. How good do we feel about an item of data or a data set as a whole?

While acceptance is ultimately a boolean, the world of data isn't just black and white. Garbage-in, garbage-out is definitely a thing. But one person's trash may be another person's treasure. And your grass-fed raw data may need to be cooked before I can eat it. The terms in the title build on one another. Each one is an f-stop in the vision for what good looks like.

How acceptable is this?

The terms in question are well-formed, valid, canonical, and correct. I list them in order from a data consumer's perspective — least specified to most. Why is their relationship important? Generally because data goes through stages, from acquisition to preboarding to ETL to enrichment and mastering to production end uses. If we can't speak about levels of quality in acceptance terms at each step, how could we know when to progress data to the next stage?

This progression is a practical matter for tool builders as well. How does MFT (managed file transfer) know when to progress data from arrival to preboarding? How does preboarding progress to onboarding and workflow tools? What are the stages of the medallion data lake, specifically?

Above all, do we know what good looks like? Are we moving too fast? How do we know when we're done? Are we there yet?

Well-formededness

Data that is well-formed first and foremost matches a physical specification, and, secondly, has the correct "outline" to be an item of data of the form expected. The specifications are standards like:

XML
JSON
HTML

Well-formedness also relies on lower level definitions such as unicode and byte-ordering. Without detailed agreements on what constitutes minimally viable raw data the world quickly breaks down through an inability to communicate.

Valid

The next level up from well-formed is validity. Validity is a more robust stage, in that if data is valid, it is probably useful for something.

Files that are valid have data that is compared against a definition of what good data looks like. Data can be validated using rules or models. Well-known examples include:

XSD
Schematron
DDL (apologies for the paywall; google can find you more references)
X12 (ditto!)
JSONSchema
CsvPath

Some of us love these specs, despite their dryness. Each has its own strengths and coolnesses. An XSD is primarily a model. A Schematron file is principally rules. In fact, a model is a short-hand and generalized way of writing rules. And, in this context, a set of rules is just a classification. But in practice it's simple: an item of data that doesn't match its schema is considered invalid.

Canonical

A canonical form is the form that is preferred over other possible forms of the same data. A simple example is the term IBM. Its canonical form may be IBM. It may also be seen as I.B.M. or International Business Machines. If we are canonicalizing data using this mapping to IBM and we see I.B.M. we substitute the canonical form. Note that if there are multiple accepted forms the canonical form is any of them, given the right time, place and/or bounded context. Canonicalization is closely related to data mastering.

Correct

Correct data is more than well-formed + valid + canonicalized. Correct means that the semantic and business rule content of the data meets expectations. For example, imagine a CSV file that includes a list of companies. Each company has an area of commercial activity. We see that:

The file is readable as a CSV file, so it is well-formed
The file has values under all headers in all rows, so for our purposes we'll call it valid
The company name I.B.M has been canonicalized to IBM so we'll say that the data is in a canonical form
And the company listed as IBM is described as being in the business of Sunflower Farming

Due to the last bullet having sketchy intelligence — we don't think IBM grows sunflowers, but maybe? — we'll say that this data is incorrect. Ultimately this is the most important consideration. However, if the lower acceptance layers are good-to-go, then the value of effort expended to make the data actually correct may be worth it. Or maybe IBM should start growing sunflowers. Actually, both things can be true.

Where CsvPath validation can help

Our focus is on the ingestion strategy called data preboarding. Preboarding takes files that arrive kinda looking like data and whips them into shape so you know you in fact have good data.

Historically, CSV files have not had a commonly used validation language. CsvPath Validation Language is a new language to help change that. It gives you the validity, canonicalization, and correctness check you need to trust an unknown data file. CsvPath Validation Language offers both rules and schemas.

In other posts we'll talk more about CsvPath Validation Language. As a powerful function-based language for both schemas and business rules embedded in a complete preboarding architecture, there a lot to get excited about! Stay tuned or, if you can’t wait, bop over to https://www.csvpath.org and CsvPath Framework’s GitHub repo.

Collect, Store, Validate, Publish

Atesta Analytics — Tue, 09 Sep 2025 17:24:47 GMT

For many companies, ingesting data is a foundational risk. Think about it, if your business is built on collected data what happens when you fail to control the raw data feeds? Garbage in, garbage out. And who wants to be selling garbage? For those companies, getting ingestion right is existential.

Those of us who build systems based on data aggregation, processing, analytics, and/or monetization channels know you need a well-thought out architecture. For delimited file-based systems, the architecture to beat is called Collect, Store, Validate, Publish — or CSVP for short.

CsvPath Framework provides a prebuilt preboarding process. It acts as the trusted publisher to your data lake, data warehouse, and applications. You can use the framework in many ways — it is flexible, but opinionated.

CsvPath is narrowly focused on one problem: the trillion-dollar challenge of data file ingestion. It tackles the problem using the a single best-practice pattern — CSVP. That means it is quick to apply, broadly applicable, and consistent. If you have a data ingestion problem, that’s what you want to hear!

What Is Collect, Store, Validate, Publish?

Design patterns make it easy to reuse proven approaches and communicate about designs. The Collect, Store, Validate, Publish pattern publishes known-good raw data to downstream consumers. It fills the gap between MFT (managed file transfer) and the typical data lake architecture.

CSVP controls how files enter the organization. It is also applicable to any boundary where flat-files migrate from team to team. The point where a files stop being data in-flight and become data-at-rest. That is the point where data is typically loaded into a data lake. You don’t want to be throwing garbage into a lake!

Architecture requirements

Data preboarding is about control. Control has a few fairly obvious requirements:

File landing
Data registration
Validation
Upgrading
Lineage metadata production
Archiving

Let's break them down.

File landing

When you receive files you need to put them somewhere. A CSVP system versions files because they may change over time. In this way, data becomes immutable. From file landing forward, CSVP uses copy-on-write to make operations safe and repeatable. The staging area creates a naming structure that promotes findability. Access to the staging area must allow queries based on version, name, order of arrival, and date. These queries serve as reference for each CSVP activity and downstream data consumers.

Registration

Files require a clear identity that follows the data they contain. Data needs a birth certificate and a social security number. This is the beginning of the data's lineage. The identity must be durable across changes and specific enough to identify versions of like data.

Validation

Data validation checks that data is well-formed, valid against a schema, in canonical form, and correct with regards to business rules. Data preboarding is a way of doing a data quality shift-left. The earlier you can fail bad data the easier and cheaper it is to correct it. Validation done right minimizes manual checking and speeds up delivery. Since manual checking and triaging bad data found by data consumers can eat up to 50% of data operations time, validation is a critical component of CSVP.

Data upgrading

Data partners frequently send data that does not conform to expectations. Likewise, some data comes as expected but in a form that varies from inhouse data. Often these inconsistencies can be easily fixed by small changes. For example, a field for the name of a month may require three characters, but in the case of one data provider always receive September as “Sept”. A small difference like this can be fixed, resulting in upgraded data.

Lineage metadata

Data lineage tracks data by durable identity as it moves from system to system and changes over time. Lineage includes version, operator, validation and upgrading scripts, timing, sequence of runs, and other indicators. These indicators explain exactly how each data artifact was created. Ultimately, downstream data consumers should be able to easily trace any unexpected data step-by-step back to where it entered the organization. This traceability makes triage efficient and actionable.

Archiving

The final step of CSVP is publish the data and metadata to an immutable permanent archive. This is where downstream consumers find their data. CSVP’s querying capability allow consumers to use data references as their sources. These references are specific, consistent, and descriptive in a way that simple file system paths or URLs cannot match.

**Why do we call CSVP a Trusted Publisher?**

When we talk about downstream data consumers we’re primarily talking about the data lake, data warehouse, and/or applications. These consumers are systems of record. It is vital that they receive known-good data. While some data lakes are managed in a way that progresses data from raw to finished product — as in a Medallion Architecture — that progression should always start with data that is accepted.

Accepting raw data means deciding what ideal-form raw data looks like and verifying each new piece of data against that standard — or rejecting it as early as possible. This is what data preboarding is all about. Using a CSVP architecture to preboard your data requires you to spell out what good looks like. The payoff is two part:

You can scale down expensive, slow, and risky manual processing
You insulate the organization from problematic data and heroic firefighting

Preboarding is required because data partners you don’t control are inherently untrustable. They can and will change their data for internal reasons. Their interpretation and requirements will differ from yours. Their level of investment in their data operations will not match yours. And they will make mistakes you cannot overlook. Moreover, often they won’t tell you when change happens. You have to find out for yourself.

When you preboard your inbound data you are intermediating the preboarding process between data producer and data consumer. From the downstream consumer’s point of view, the preboarding archive becomes the data publisher that they can trust.

How CsvPath Framework does CSVP

At a high level, the CSV Pattern looks like this picture.

Sophisticated features

The features cover al the bases we laid out above:

Lightweight projects encapsulate difference data partnerships for clarity
Data files are captured and maintained in an immutable, versioned staging area
Processes are simple, linear, and consistent across data partners
Validation is checked for well-formedness, validity, canonicalization, and correctness
Processing is idempotent, using copy-on-write semantics so data is never lost or untraceable
Rewind and replay allow for data fixes with confidence and without restarting from scratch
Metadata for provenance, lineage, and validity is captured at every step
Results are published as “ideal-form” raw data in an immutable permanent archive

If you read this and think: how else would you do it? that is good! CSVP is an intuitive approach to controlling data ingestion. It helps make data preboarding a distinct stage with a focused, high-value goal.

In practice, though, in many companies the pattern is not this clear and intentional. In fact, in many companies the pattern isn't used consistently across data partners. Of course, no pattern can cover every possible situation. We believe the CSV Pattern covers the majority of transaction-oriented, many-party, file-based, loose-integration situations — particularly those where one or more of the parties is technically weak for any reason.

A drop-in design

CsvPath Framework provides CSVP architecture in a prebuilt, pre-integrated package. It is multi-cloud ready and setup for observability using any OTLP platform (e.g. Grafana) or OpenLineage collector (e.g. Marquez) out of the box. The full CSVP implementation comes in three components:

CsvPath Framework is a Python library distributed on PyPi a complete programmatic CSVP solution
FlightPath Data is a Windows or MacOS app distributed on the Microsoft and Apple stores
FlightPath Server is an API for upstream and downstream systems, distributed on WinGet and Brew

The components are all open source and available on GitHub.

In every way possible, CsvPath Framework works to be a drop-in replacement for a less developed system or a full solution for a greenfield situation.

Learn more about CsvPath Framework

For more details about how CsvPath Framework implements the Collect, Store, Validate, Publish Architecture, hop over to the documentation. For information about FlightPath Data checkout https://www.flightpathdata.com.