Bash CSV Handling

From TheBeard Science Project Wiki
Revision as of 14:22, 27 June 2019 by Beard (talk | contribs) (Quoted Strings with Commas)

Jump to: navigation, search

I'm sick of screwing this up, so I'm creating this definitive guide for my own reference. I will update this as I learn to handle individual issues.


For the following examples, I will use this sample CSV file test.csv (assuming Unix line ending "\n"). I've included several "problems" in it.

id,firstname,lastname,email,phone,hiredate
1111,"Colin","Quinn","c.quinn@example.com",1112223333,"02/12/2017"
2222,"Richard","Vos, Richard","r.vos@example.com",2223334444,"06/27/2019"
3333,"Patrice","O'Neal","p.o'neal@example.com",3334445555,"08/11/2018"
4444,"Jim","Jefferies","j.jefferies@example.com",,"09/11/2001"

Quoted Strings with Commas

When your fields are separated by commas, and one of the fields has a quoted string with a comma in it, it can screw up the parsing of the fields.

The problem in the sample file is in the second line. Someone accidentally entered "Vos, Richard" in a field that is only supposed to contain the last name "Vos". This kind of clerical error happens all the time. Not only is it incorrect, it has a comma in the text which is the character we use to delimit the fields.

if we try to naively break out the fields into separate columns, we see a problem:

cat test.csv | column -t -s ","
id    firstname  lastname     email                      phone                hiredate
1111  "Colin"    "Quinn"      "c.quinn@example.com"      1112223333           "02/12/2017"
2222  "Richard"  "Vos          Richard"                  "r.vos@example.com"  2223334444    "06/27/2019"
3333  "Patrice"  "O'Neal"     "p.o'neal@example.com"     3334445555           "08/11/2018"
4444  "Jim"      "Jefferies"  "j.jefferies@example.com"  "09/11/2001"

The comma in the lastname field was identified as a field as delimiter (field separator), so now the columns on the second line are shifted to the right.

The best way I've found to handle this without using a utility that is specifically for CSV parsing is this:

cat test.csv | sed 's/,\("[^"]*"\)*/<FS>\1/g' | column -t -s "<FS>"
id    firstname  lastname        email                      phone         hiredate
1111  "Colin"    "Quinn"         "c.quinn@example.com"      1112223333    "02/12/2017"
2222  "Richard"  "Vos, Richard"  "r.vos@example.com"        2223334444    "06/27/2019"
3333  "Patrice"  "O'Neal"        "p.o'neal@example.com"     3334445555    "08/11/2018"
4444  "Jim"      "Jefferies"     "j.jefferies@example.com"  "09/11/2001"