Skip to main content

Command Palette

Search for a command to run...

Do Schemas Have a Place In Delimited Data?

CSV and Excel have been missing the party

Updated
13 min read
Do Schemas Have a Place In Delimited Data?

CSV, Excel, and other delimited files have not historically had a robust schema language. In recent years the now dormant CSV Schema Language 1.0 partially filled this gap. Now CsvPath Validation Language fills it more completely. For most people, however, the unasked question is, why do I need a schema?

What Is a Schema?

Data schemas are shorthand for rules. They create a structural definition of data based on whole-part units and part-part relationships. Every structural schema could be decomposed to a set of rules; however, the abstraction of the units, a.k.a. entities, is an expressive shorthand that aids understanding.

We all know examples of schema languages. SQL has its Data Definition Language subset to define what tables look like and how they are related. XSD uses XML, namespaces, and XPath to create containment information models. SQL's superpower is relationships between entities; XSD's is hierarchical nesting. Both work at the level of sets of datum. DDL defines a "document" as a row in a table. XSD defines a document as one XML text.

There are rules-based schema languages as well. To highlight two, Schematron is an XPath-based business rules language that also works on XML's hierarchical structure, but without defining entities as such. Similarly, SQL has it's Data Manipulation Language subset to interact with the data in a database. Validating data is a core competency of DML, putting it on the same level as Schematron, even if validation is not DML's first priority.

CsvPath: A Schema Language for Delimited Data

In comparison to XML and the data in a relational database, CSV data is both simpler to model and harder to validate. Setting aside Excel for the time being, the simplicity comes from a delimited file being, commonly, roughly the same as a relational table, but without the possibility of relationships. (Again, commonly; you can of course have any pointers in CSVs you like, they just don't have a well-known function).

The challenge is that there is no server to enforce the a required schema. To oversimplify in a useful way, without enforcement, data in a database file would just be rows and columns embedded in a fancy file format. Because CSV has no enforcement context, it's the wild west. That makes things hard.

CsvPath Framework does not act as a relational database server does; though FlightPath Server closes some of the gap. Itself, CsvPath Framework simply offers strong validation. It provides that enforcement that makes CSV data behave more like a relational database table. If you use CsvPath Framework as a whole data preboarding apparatus, not just a schema validator, you get more of that enforcement, and the automated quality management opportunities that come with it.

How Do CsvPath Validation Language Schemas Work?

CsvPath Validation Language has an overall structure that is far less verbose than XSD, more on the level of DDL. The parts of a csvpath are:

  • The root

  • Scanning instructions (a.k.a. the scanning part)

  • Match components (a.k.a. the matching part)

The root is simply a $, optionally followed by a file path, if the user is targeting a specific file.

The scanning part and the matching part are each wrapped in brackets, []. The scanning part contains numbers that pick out lines to scan. The matching part scans each of those lines by passing the values to its matching components. If the matching components return true, the line is considered to have matched. By default, a csvpath's match components are ANDed together.

This is a deceptively simple structure. In reality, there are many knobs and dials:

  • Logical operation: AND by default, but settable to OR

  • Validity test: strategy is determined by the schema author

  • Validity result: marked by the schema author, or not

The first bullet, the logical operation, is straightforward. Match components are either first false wins or first true wins. By default, all lines match. However, a line that matches isn't necessarily considered valid data. It is equally reasonable to say that matching lines are invalid. Consider this:

$[*][ 
    #Cretan == "liar" 
]

The syntax is: $[*] means scan all lines. The # references a header, in this case the header is Cretan. What this csvpath statement says is, if the value of the Cretan header is "liar", the line matches.

More Important Background (Peas Before Cake!)

Now we have to decide if the match means the line is valid or invalid. That is the validation strategy. Both options make sense. If you want to present a user with only the bad lines, you would collect those that matched a definition of bad. If you want to present only the good lines your match components would define what good looks like.

In fact, we can have CsvPath Framework collect both matched and unmatched lines. As well, we can switch the match part from ANDing match components to ORing them. These are also part of the validation strategy, but they are a feature of the Framework, not CsvPath Validation Language itself. Also keep in mind that CsvPath Framework allows you the option to abstain from collecting any lines, if you like.

Let's add yet more wrinkles to the validation strategy, all built into the language, not the Framework.

  • Declaring validity

  • Side effects

  • Print-only validations

  • Built-in validations

A csvpath statement can use the fail() function to declare a line invalid. In a certain circumstance a line can be invalid, regardless of if I plan to collect it, or not. Once a line triggers the use of the fail() function, the file as a whole is considered invalid, or failed. In order for this to work, we need another concept, a side-effect.

Side effects are actions taken by a csvpath that do not determine if a line matches. Using the print() function is an example of a side effect. When my csvpath does print("Epimenides is a Cretan") there is no impact on the line capturing, or not, but the string is appended to the printout.

If I want to only print when a line is invalid, I can use a construction like yes() -> print("Epimenides is a Cretan") where yes() is a function that is always true and -> is the when/do operator that performs the right hand side when the left-hand evaluates to true. Another example might be #Liar == "Epimenides" -> fail().

Side-effects in combination with the print() function allow us to follow in XSD and Schematron's footsteps and deliver a report-out or print-only validation. SQL and CsvPath by their nature collect data. XSD and Schematron report on the state of data, but do not collect it. However, when you use the when/do operator and the print() function, CsvPath Validation Language can operate in the same print-only mode. This is sometimes ideal, for example when you have very large files that may have many errors.

Standing between collecting lines and performing only side-effects, there are many validations that are built in to CsvPath Validation Language. These come in two types:

  • Language errors

  • Data errors

Language errors are problems with how you wrote your csvpath statement. For example, if you write add(5) you will get a language error. This is because the add() function requires two values and you only provided one.

On the other hand, if you write add(5, "three") you will get a data error, because you cannot add a string that spells out three to the integer 5. The add() function must be able to cast the values you pass it to numbers. In practice, a language error will happen once immediately upon your starting to validate. A data error will happen on every line of the file you attempt to validate. In this case, because on each line the add() match component will once again fail to add 5 and "three".

These language and data errors are built-in. They will fire even if none of the rules you write are triggered by lines. It doesn't matter if you collect lines with built-in errors or lines without built-in errors, a built-in error always results in its match component returning negative. And it is these validation errors, the built-in data errors in particular, that give us the ability to write schemas.

Yes, this was a long path to take to get back to the concept of structural schemas, but the background is worth it.

What Does a CsvPath Validation Language Schema Look Like?

A CsvPath Validation Language schema looks more similar to SQL than to XSD. Here is a simple one:

$[*][ 
    line.person( 
        string(#firstname), 
        string(#middlename), 
        string(#lastname)
    )
]

We construct CsvPath schemas using the line() function. In this case, our csvpath statement has one entity named "person" that holds a first, middle, and last name. Moreover, our csvpath says that for a line to be valid it must have three headers named "firstname", "middlename", and "lastname", which must hold values that can be cast to strings (as all CSV values can be) or nothing, and there must be no other headers or header values.

This is the equivalent SQL:

CREATE TABLE person ( 
    firstname VARCHAR, 
    middlename VARCHAR, 
    lastname VARCHAR 
);

We can go much further with this CsvPath schema. For example we can have

$[*][ 
    line.person.distinct( 
        string.notnone(#firstname), 
        string(#middlename), 
        string(#lastname) 
    )
]

With this update, we are requiring every person to be a different combination of names. We are also requiring a firstname by using the notnone qualifier. A qualifier is an annotation that modifies how a match component works. In this case our line() function has two qualifiers. The distinct qualifier has the meaning we just said; whereas, the person qualifier is simply an arbitrary name that we use to refer to the entity.

To update the SQL to match:

CREATE TABLE person ( 
    firstname VARCHAR NOT NULL, 
    middlename VARCHAR, 
    lastname VARCHAR, 
    CONSTRAINT unique_person_names UNIQUE (firstname, middlename, lastname) 
);

There is much more you can do with line() and the datatype functions, string(), integer(), decimal(), date(), datetime(), none(), url(), email(), and boolean(). The several schema-related qualifiers add even more power. At the same time, near the top we said that schemas are just shorthand for rules. Let's look at the person entity as a rule set.

$[*][ 
    #firstname 
    #middlename 
    #lastname 
]

This csvpath requires all three fields to be present. However, we only want the firstname header to be mandatory. That means we need to update the csvpath to:

$[*][ 
    #firstname 
    or(#middlename, none()) 
    or(#lastname, none())
]

Whats-more, our schema wanted the fields to be in the order given. But match components don't work that way. Their order matters from a logical point of view, but the order of use doesn't say anything about the order of headers. To make the ordering we want we have to add:

$[*][ 
    #firstname 
    or(#middlename, none()) 
    or(#lastname, none()) 
    header_names_mismatch("firstname|middlename|lastname") 
]

(Note that header_names_mismatch returns false if names mismatch. The function's name is in some ways backwards and will be aliased in a future release to make more grammatical sense).

With the addition of two more functions we can require uniqueness across all sets of names like this:

$[*][ 
    ~ this rule set is equivalent to the person entity ~ 
    #firstname 
    or(#middlename, none()) 
    or(#lastname, none()) 
    header_names_mismatch("firstname|middlename|lastname") 
    not( has_dups(#firstname, #middlename, #lastname) ) 
]

This version of the csvpath requires all three fields to be present.

To recap, this latest version of the rules is more complex than the person entity it can stand in for:

$[*][ 
    line.person.distinct( 
        string.notnone(#firstname), 
        string(#middlename), 
        string(#lastname) 
    )
]

It is easy to see why the structural schema based on a named person entity is preferable. But the point is clear, a schema is shorthand for rules. And that brings us to the power of rules to extend schemas.

Without going deeply into the topic, rules apply to the requirements and activities that happen to and with the schema entities. For example, using a structured line() entity person tells you what people are, but it doesn't tell you that a person must have no middle name if they have no lastname. We might add that declarative logic like this:

$[*][ 
    line.person.distinct( 
        string.notnone(#firstname), 
        string(#middlename), 
        string(#lastname)
     ) 
     not.nocontrib(#lastname) -> not(#middlename) 
]

This way a person data that is seen to match person, but which has a middle name and no last name will return false. Again, in our schema strategy, we are selecting invalid lines.

In this rule the not() in the left-hand side of the when/do expression has a nocontrib qualifier. The nocontrib qualifier indicates that match component does not contribute to the decision to match or not match. This results in the lastname test being ambivalent about the presence or absence of #lastname. At the same time, the #middlename test is a contributor to the match, providing it is checked because there was no lastname.

The equivalent SQL would be:

CREATE TABLE person ( 
    firstname VARCHAR NOT NULL, 
    middlename VARCHAR, 
    lastname VARCHAR, 
    CONSTRAINT unique_person_names UNIQUE (firstname, middlename, lastname) 
    CONSTRAINT check_not_middlename_if_not_lastname CHECK (lastname IS NOT NULL OR middlename IS NULL) 
);

SQL is very good at rules. Generally, though, only a small fraction of validation rules are written into DDL. In CsvPath Validation Language, rules are commonly added alongside line() definitions. That said, CsvPath Framework makes it almost trivially easy to run sets of csvpaths against a data file, so it is essentially cost-free and often cognitively advantageous to separate rules or sets of related rules into their own csvpaths. When handled that way, the similarities to SQL rules as DML statements are clear.

One More Thing About CsvPath Validation Language Schemas

Our simple csvpath schema has three headers. What if we wanted five headers? How about an unlimited number? Or what if we want there to be one header before the three names, but we don't care what it is. Or even that, except we care about the name but not the type?

All these things are possible. CsvPath has the blank() and wildcard() type functions to handle those requirements. The wildcard() function takes a “*" for an unlimited number of headers or an integer to represent exactly that number of headers. Similarly blank() indicates one header, which we may name or not.

With blank() and wildcard() we have the ability to position our line() entity anywhere on a line we like. That becomes interesting because it gives the opportunity to define a sparse entity, meaning one that has just a few headers specified out of potentially many actual headers. Or, even more interestingly, to define multiple entities per line of data. And, indeed, we can overlap entities or even have entities share member headers. There are many possibilities.

Why might we want to have multiple entities per line? The clarity of this schema-based csvpath helps make the point:

$[][ 
    line.person.distinct( 
        string.notnone(#firstname), 
        string(#middlename), 
        string(#lastname), 
        wildcard() 
    ) 
    not.nocontrib(#lastname) -> not(#middlename)

    line.address( 
        wildcard(3), 
        string.notnone(#street), 
        string.notnone(#city), 
        integer.notnone(#zip, 5) 
    ) 
]

Contrast that to:

$[*][ 
    #firstname 
    or(#middlename, none()) 
    or(#lastname, none()) 
    header_names_mismatch("firstname|middlename|lastname|street|city|zip") 
    not( has_dups(#firstname, #middlename, #lastname) ) 
    not.nocontrib(#lastname) -> not(#middlename) 
    #zip 
    #city 
    #street 
    integer(#zip, 5) 
]

The rules version is more terse. But the schema version is far more readable. And in the schema we have named entities, so there is less cognitive load on us reading it. It is immediately apparent that a person has an address. When we look at the rules, it is clear we are saying something about people and addresses, but it's much more work to pick out what exactly we're saying.

To Get Back To the Question… What Was the Question?

So do schemas have a place in the world of delimited data. Do CSV and Excel files benefit?

I think the answer is clear from looking at the quality of this simple person entity definition we created using CsvPath Validation Language. Moreover, at least for those using CsvPath Framework, we have a constraint enforcer now, similar to the relational database server's enforcement of DDL on columns of data in files. Certainly there is a need for data to be determined valid or invalid as quickly as possible after its arrival. Nothing but cost and aggregation is added by letting errors drift downstream unconsidered.

So, yes!

In my mind the answer is easy. Structural schemas, rules-based schemas, and schemas that are a mix of both, all have a strong value-play in delimited data. The more powerful and expressive CsvPath schemas make our early quality gates the better for the downstream quality of our data, the lessening of our toil, and the greater the agility of our processes.

An easy answer.

For more on CsvPath Validation Language see the Github repo’s docs pages. For more about the CsvPath Framework, see csvpath.org. And for giving CsvPath a try, the easiest way is to download FlightPath Data.