(Replying to PARENT post)

STOP.

People insist on making data exchange formats used to transfer data between software 99.99% of the time "easy for humans" that 0.001% of the time, at the expense of the common case. This inevitably results in squishy, inefficient formats like HTML and JSON that become performance quagmires and security land mines: http://seriot.ch/projects/parsing_json.html

CSV is not even easy for humans to process, with or without tools like grep. Can you successfully handle quote escaping and line-wrapping?

    This ,"is ",    valid 
    "Hello",123,"''from the
    not,what,you,
    ""think"",next line"""
That's one record, by the way.

Your level one challenge is: Extract the unescaped value of the "valid" column. Write your robust, tested, "grep" command-line below.

Level two: Make that parse consistently across a dozen tools from a dozen vendors. Try ETL tools from Oracle, Sybase, and SAP. Try Java, PowerShell, and Excel. Then import it via an ODBC text driver, and also into PowerBI. I wish you luck.

πŸ‘€jiggawattsπŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

This is disingenuous. There isn't really a CSV standard that defines the precise grammar of CSV.

But, suppose there was such a standard, then som of the problems you've described would become non-problems (eg. support across multiple tools). Potentially, if such a standard disallowed linebreaks, it would also make the use of grep easier.

I actually really wish there was such a standard...

Your other assertion about "squishy" formats is also... well, not really true. The format you listed became popular due to the data they encoded, not because of their quality. It's very plausible that a better format may exist, and they are probably out there. The problem is that we set the bar too low with popular formats, and that they often seem easy to implement, which results in a lot of junk formats floating around.

πŸ‘€crabboneπŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I downvoted you, not because I disagree with your point (although I do disagree with it), but because of your peremptory and dismissive usage of "STOP" and "level one challenge."

This issue is not so straightforward that you should be rudely stating that people with other views should just stop and go back to the drawing board. Not that I think one should ever do that, even if the issue were simple.

πŸ‘€propter_hocπŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Excel is ubiquitous and makes csv easy for humans to read.

Machines do the 99.99% of the time that things are working. The .01% when humans need to get in the loop is when something goes wrong and life is better when you can read the data to figure out what went wrong. JSON is amazing for this. CSV not so much but it's still better than a binary format.

Machine efficiency is not the most valuable metric. Machines are built for human use. The easier a system is for humans to use the better. Even if it's orders of magnitude less efficient.

πŸ‘€xupybdπŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Accounting wants to work with a portion of the Parquet database. Your options are: give them data in CSV/Excel formats, or hire a dev to work for accounting full time and still have them constantly complaining about having to wait on the dev to get them the data they need.

The programmers, data scientists, and other technical fields should never process data in Excel, but you can’t expect that of the whole organization, so exporting to it is important. At this point, I’d recommend XSLX as the export format because it’s harder to mess up the encoding, but libraries exist for CSV and others as well.

πŸ‘€OlreichπŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I think the issue is more of recovering misformatted or incomplete data when something goes wrong, or otherwise understanding what's in the file if reading it in one format doesn't work as intended (e.g if the file format goes out of use and maintainers are unfamiliar with it). CSV has issues for sure, and yet you have posted content that's readable at some level as text. It's not a quasirandom selection of unicode characters that looks like some private key.

I think these types of examples can be misleading because they take an example of pathological CSV and don't present the comparison, which is typical binary. Sure your example is a problem but what if it were in a nested binary format? Would that be easier for humans to read if the file were mangled? (It's also often more difficult to resolve formatting errors from just one record too, so that's also misleading.)

I disagree with the idea that humans might want to look in the file using text or other software "0.001%" of the time. Sure in some settings it's rare but in others it's pretty common.

CSV is a bit of a misnomer anyway because the field separators can really be anything so you should be using a separator that is extremely unlikely to appear in a field.

Anyway, I'm not actually trying to defend CSV as the ideal as I think it is way, way overused, in part because of the issues you raise. But I also think making a format that's easy for humans to read in a failsafe file format (text usually) is an important criterion.

I guess in general I haven't been convinced that any of the data format proposals I've seen get some level of publicity are especially better than any other. For some use cases, probably, but not for others. Sure, that's how it works, but I wish there were better options getting traction or attention. It seems like when I do see something promising it often doesn't get off the ground, and there isn't a robust competition of ideas in the space. Often the reading-writing software drives data format use β€” that's not necessarily my preference but it seems to often be the case.

πŸ‘€derbOacπŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

HTML and JSON are not good examples of text formats designed with security or performance in mind.
πŸ‘€KinranyπŸ•‘2yπŸ”Ό0πŸ—¨οΈ0