Skip to main content

Command Palette

Search for a command to run...

Data Preboarding: Welcome To the 4th Dimension

Updated
6 min read
Data Preboarding: Welcome To the 4th Dimension

Over the weekend I dug into an eclectic array of iPaaS, ERP, and EDI systems. The new, the older, and the 80s. Catching up on the CsvPath Framework competition. What can I say? It was too cold to paint the house.

And actually, it got pretty interesting, though. What dawned on me was that I needed a way to explain why these data files-savvy systems were all singing, all dancing, and yet not getting the whole data integration job done. Why, at the end of the day, doesn't the iPaaS finish the job the earlier systems didn't complete?

Without Naming Names

Here's an example. An ERP system used in large manufacturing and logistics firms had a flat-file data file import function. It took in files types that included invoices, as a good example. The process went like this:

  • Open data import dialog

  • Enter file system location of the CSV / Excel files

  • Enter credentials

  • Pick a database destination

  • Click go and sit back and watch

The files loaded or they didn't according to very basic rules that amounted to making sure the SQL schema didn’t wig out. Files that didn't load stayed where they were. If a file loaded it was moved to a done folder. And that's it. Yes on one, no on two.

What's wrong with that?

A few things. Let's assume we're talking about invoices that arrive daily, weekly, or monthly. And let's assume the process is automated, we're not clicking the load-file button manually.

The first question is where do we land the files and how are they organized? What are the naming conventions? Do files ever get delivered twice or not at all? Are there restatements, if so, how are those indicated? What is the identification of the data set contained by a file?

Ah, I can hear you say it, how is any of that the ERP system's problem? Well, yeah, I know you didn't actually say that, but for sure someone somewhere once did. I know that because the ERP system's data file load function simply didn't care. It was an SEP.

Somebody Else's Problem

Just a few minutes later I found myself digging into another company that was building similar solutions on a popular, more modern, iPaaS. Again, no last names. I hadn't read that tool’s docs before so I took a look. Surely the new hotness would do better! The docs were certainly better. And the bluster brighter and more amusing.

Sadly, the data file import functionality was almost exactly the same. Ouch.

I moved on to other apps and tools looking for answers, while in the back of my head I pondered the curious lack of change between my granddad's ERP and my nephew's iPaaS. It felt like Groundhog's Day. And then it hit me.

Nothing Good Ever Happens On Groundhog’s Day

These systems are insisting in a three dimensional world when we actually live in four dimensions.

Seriously, yes, we actually do, and, sure, some ERP tools may actually be living in some other two dimensional world from ours, but that's just not material here.

There are three dimensions to our personal space. The fourth dimension is time. Preboarding accounts for time. Import and export functions and pipelines and your run-of-the-mill data onboarding process are about moving bits from here to there. Time is not of the essence, other than in a runtime performance kind of way.

Preboarding is about all the things that you have to arrange and account for to make that simple bit-slide be part of an actually effective workflow that can handle a client promising to send tomorrow a CSV invoice file exactly the same as the one you just loaded, except with the correct calculations this time.

Now it all makes more sense to me as I'm looking at all the data onboarding, importing, mapping, and loading docs. These tools that don't fully account for time, and all the crazy things that happen in it, are setting boundaries. Setting boundaries is good, I guess. It's not laziness. We don't want the ERP system or the iPaaS to burnout or get demotivated.

Wherever You Go, There You Are

Still, something has to care about the process of merging data into the enterprise. Not just loading it, but really integrating it into a healthy data estate. Whatever tools, applications, and/or cloud services you pick, you still have that overriding concern. We deal with the way thing really happen, not just the things the iPaaS decides is within its boundary-setting sensibilities.

You may go to war with the iPaaS you have, not the iPaaS you wish you had, but that doesn’t mean you can just wish any part of the war away.

That thing that cares about the time dimension and the real world is data preboarding. Preboarding is a simple set of steps:

  • Land data in permanent immutable storage partitioned by data category, with a hierarchical time-of-arrival oriented layout that retains versions

  • Register the bits as the most recent in the category at a location and a point in time, giving a durable identity traceable from downstream

  • Validate the CSV / Excel data using as much quality control logic as practical so as to fast-fail bad data

  • Upgrade the data, if needed, idempotently

  • Capture processing events to validation reportage, lineage, process telemetry, errors, and logging

  • Write valid and upgraded output data, along with invalid data, if useful, to an immutable searchable versioned archive accessible to downstream

When you use that many words I admit it doesn't sound that simple. But it really is. And the whole point of it is time. Data comes in, we react to it, people ask questions, data gets revised, it gets reloaded, more questions are asked, problems happen, we roll back, explain, and restart, etc., etc. Tick-tock, tick-tock.

Many People Do Respect Time

There are, of course, many applications that know they live in four dimensions. Those apps preboard their data. Strategic Healthcare Management Systems or SHMS, for example, seems to have gotten the memo that we live in a four-dimensional world. I stumbled on them in my not-painting-the-house travels, but I haven't used the system, so I don't know all the guts and glory. Reading their docs, they are clearly doing many, if not all, of the bullets above. They do it because that's the world we live in.

(Fwiw, I have no relationship with SHMS. And to make sure of that, I'll note that, phonetically speaking, SHMS is an awesome acronym for a software company. ;)

Living in four dimensions is liberating. It means you prepare for the things that are likely to happen, rather than being surprised by them. No back-flips in order to explain, retrace, or redo. Zen-like calm as you know that your system is one with the circle of life that in wheeling will surely result in the same file being resent with the same name and very slight changes from the same client over and over. You are ready. You have balance. You have accounted for time.

And Now a Word From Our Sponsors…

Or maybe you have not. On the one hand, solution builders generally do account for time through preboarding workflows, if their solution is good. Software and web services vendors, on the other hand, often don't, presumably because they want to help solutions builders make ends meet.

If you find yourself at the pointy end of a software package that doesn't have the time to help you do better preboarding of your CSV/Excel files, it's probably time for you to take a look at CsvPath Framework and FlightPath Server. The solution to having no data preboarding is to adopt the most robust open source preboarding tools available. There's no better time to do it.