By Design ■ Episode 2
In 1972, someone at IBM needed to move numbers between programs. They separated them with commas. It worked. That was the entire design process.
No specification was written. No committee was formed. No standard was proposed. No working group was convened. No versioning scheme was debated. No logo was commissioned. Someone put commas between values, and the values arrived at the other end in the correct order. The engineering community, presumably satisfied that this was sufficient, moved on to other things.
Thirty-three years later, Yakov Shafranovich wrote RFC 4180 to formalise what was already running on every machine on earth. The format, one might say, was rather faster than the standardisation.
The Complaint
"CSV is not a serious format. No types. No schema. No validation. No standard encoding. One misplaced comma and your import collapses. One semicolon-delimited file from a German colleague and your entire pipeline has a rather bad morning."
Every data engineer has said this. Most of them said it this week. The complaints are all correct. CSV has no type system: the number 42 and the string "42" are indistinguishable. It has no schema: columns can appear, disappear, or rearrange themselves between files without notice. It has no standard encoding: the same file can be UTF-8 in one system and Windows-1252 in another, and the only way to find out is to open it and look for garbled umlauts.
And yet.
The Decision
Nobody decided. That is precisely the point.
No committee means no politics. No schema means no version conflicts. No types means every system on earth can read it: your database, your spreadsheet, your shell, your thirty-year-old mainframe that everyone is terrified to touch, and your colleague's Excel installation that insists on interpreting gene names as dates. (This is not a hypothetical. It has corrupted twenty per cent of genomics papers. But that is Excel's fault, not CSV's.)
And despite the name, CSV is not married to the comma. The
format is delimiter-agnostic. Pipe, tab, semicolon: use
whatever character does not appear in your data. Use
| and commas in your data stop being a problem
entirely. The parser does not care. The specification, such
as it is, does not care. The data does not care. Only the
German colleague cares, and he switched to semicolons twenty
years ago because the comma is a decimal separator in German,
which is precisely the kind of edge case that a committee would
have solved and a non-specification simply absorbs.
One parser: split at delimiter. One convention: the first row might be headers. One dependency: none.
CSV as a Unix Tool
This is where CSV reveals its true elegance. It is not merely a data format. It is a text format, and text is the universal interface of Unix. Every tool in the Unix toolchain operates on delimited text without knowing or caring what the format is called:
# Key-value lookup: find user 42, extract the third field
grep "^USER_42," users.csv | cut -d, -f3
# Aggregate a column: sum the fourth field across all rows
awk -F, '{sum+=$4} END {print sum}' sales.csv
# Sort by third column, numerically
sort -t, -k3 -n data.csv
# Join two CSV files on the first column
join -t, accounts.csv transactions.csv
No database. No query language. No dependency. No
pip install. No npm install.
grep, awk, sort,
join: tools from the 1970s, operating on a
format from the 1970s. The entire data pipeline is four
commands that ship with every Unix installation and have
not changed their interface in fifty years. One does rather
appreciate stability in an industry that considers six
months a long support window.
The Trade-Off
The complaints are not wrong. They are the price of admission.
No standard encoding. No escaping convention anyone agrees
on. A quoted field containing a comma, inside a
comma-delimited file, inside a system that does not handle
quotes. One has been there. It was a Friday. It was not
pleasant. The debugging involved hexdump and
a vocabulary that one does not normally employ in
professional correspondence.
The format pays for its universality with fragility at the
edges. Every edge case is a surprise, and every surprise
arrives at half past four on a Friday, because that is
apparently when German accounting systems decide to export
their quarterly reports with semicolons and Windows-1252
encoding and a trailing carriage return that makes
diff think every line has changed.
The Proof
60 per cent of enterprises use CSV for data exchange between systems. Every spreadsheet application on earth reads it. Every database exports it. Every CRM, ERP, and accounting system produces it. RFC 4180 arrived in 2005; by then, billions of CSV files already existed, blissfully unaware that they were not yet standardised.
XML tried to replace it: too verbose. A simple three-column table becomes forty lines of angle brackets. JSON tried: no tabular structure. Arrays of objects work, but try opening one in a spreadsheet. Parquet tried: excellent for analytics, but requires a runtime and a library to read. Avro tried: requires a schema registry, which requires a server, which requires an operations team, which rather defeats the purpose of moving data from one place to another quickly.
CSV survived them all. Not because it is better. Because it requires nothing: no tooling, no agreement, no runtime, no expertise. A text editor and the ability to count delimiters. That is the entire stack.
The Principle
The format that requires no agreement will outlast the format that requires consensus. CSV has no governance, no authority, no design document, no version number, no logo, no annual conference, and no certification programme. That is not its weakness. It is the reason nobody has managed to kill it in fifty-three years.
Last episode, Hipp sacrificed write concurrency and gained the most deployed database in history. CSV sacrificed everything: types, schemas, encoding, validation, escaping, governance. It gained universality so complete that the format predates the specification by three decades and will almost certainly outlive it.
Nobody designed CSV. Everybody uses it. One does wonder whether there is a lesson in that.
In 1972, IBM separated values with commas. No specification. No committee. No schema. Thirty-three years later, someone wrote the RFC. 60% of enterprises still use CSV. XML, JSON, Parquet, and Avro all tried to replace it. The format that requires no agreement outlasts the format that requires consensus.