Difference between revisions of "Bash CSV Handling"

From TheBeard Science Project Wiki
Jump to: navigation, search
(Quoted Strings with Commas)
(Quoted Strings with Commas)
Line 1: Line 1:
 
I'm sick of screwing this up, so I'm creating this definitive guide for my own reference. I will update this as I learn to handle individual issues.
 
I'm sick of screwing this up, so I'm creating this definitive guide for my own reference. I will update this as I learn to handle individual issues.
 +
 +
 +
For the following examples, I will use this sample CSV file <code>test.csv</code> (assuming Unix line ending "\n"). I've included several "problems" in it.
 +
<pre>
 +
id,firstname,lastname,email,phone,hiredate
 +
1111,"Colin","Quinn","c.quinn@example.com",1112223333,"02/12/2017"
 +
2222,"Richard","Vos, Richard","r.vos@example.com",2223334444,"06/27/2019"
 +
3333,"Patrice","O'Neal","p.o'neal@example.com",3334445555,"08/11/2018"
 +
4444,"Jim","Jefferies","j.jefferies@example.com",,"09/11/2001"
 +
</pre>
  
 
= Quoted Strings with Commas =
 
= Quoted Strings with Commas =
Line 5: Line 15:
 
When your fields are separated by commas, and one of the fields has a quoted string with a comma in it, it can screw up the parsing of the fields.
 
When your fields are separated by commas, and one of the fields has a quoted string with a comma in it, it can screw up the parsing of the fields.
  
Here is a sample CSV file <code>test.csv</code>:
+
The problem in the sample file is in the second line. Someone accidentally entered "Vos, Richard" in a field that is only supposed to contain the last name "Vos". This kind of clerical error happens all the time. Not only is it incorrect, it has a comma in the text which is the character we use to delimit the fields.
<pre>
+
 
id,firstname,lastname,email,phone,hiredate
+
if we try to naively break out the fields into separate columns, we see a problem:
1122,"Colin","Quinn","c.quinn@example.com",1112223333,"02/12/2017"
+
<source lang="shell">
2233,"Richard","Vos, Richard","r.vos@example.com",2223334444,"06/27/2019"
+
cat test.csv | column -t -s ","
3344,"Patrice","O'Neal","p.o'neal@example.com",3334445555,"08/11/2018"
+
</source>
4455,"Jim","Jefferies","j.jefferies@example.com",,"09/11/2001"
+
 
 +
<pre class="result">
 +
id    firstname  lastname    email                      phone                hiredate
 +
1111  "Colin"    "Quinn"      "c.quinn@example.com"      1112223333          "02/12/2017"
 +
2222  "Richard"  "Vos          Richard"                  "r.vos@example.com"  2223334444    "06/27/2019"
 +
3333  "Patrice"  "O'Neal"    "p.o'neal@example.com"    3334445555          "08/11/2018"
 +
4444  "Jim"      "Jefferies"  "j.jefferies@example.com"  "09/11/2001"
 +
</pre>
 +
 
 +
The comma in the lastname field was identified as a field as delimiter (field separator), so now the columns on the second line are shifted to the right.
 +
 
 +
The best way I've found to handle this without using a utility that is specifically for CSV parsing is this:
 +
<source lang="shell">
 +
cat test.csv | sed 's/,\("[^"]*"\)*/<FS>\1/g' | column -t -s "<FS>"
 +
</source>
 +
 
 +
<pre class="result">
 +
id   firstname lastname       email                     phone         hiredate
 +
1111  "Colin"   "Quinn"         "c.quinn@example.com"     1112223333   "02/12/2017"
 +
2222  "Richard" "Vos, Richard" "r.vos@example.com"       2223334444   "06/27/2019"
 +
3333  "Patrice" "O'Neal"       "p.o'neal@example.com"     3334445555   "08/11/2018"
 +
4444  "Jim"     "Jefferies"     "j.jefferies@example.com" "09/11/2001"
 
</pre>
 
</pre>

Revision as of 14:22, 27 June 2019

I'm sick of screwing this up, so I'm creating this definitive guide for my own reference. I will update this as I learn to handle individual issues.


For the following examples, I will use this sample CSV file test.csv (assuming Unix line ending "\n"). I've included several "problems" in it.

id,firstname,lastname,email,phone,hiredate
1111,"Colin","Quinn","c.quinn@example.com",1112223333,"02/12/2017"
2222,"Richard","Vos, Richard","r.vos@example.com",2223334444,"06/27/2019"
3333,"Patrice","O'Neal","p.o'neal@example.com",3334445555,"08/11/2018"
4444,"Jim","Jefferies","j.jefferies@example.com",,"09/11/2001"

Quoted Strings with Commas

When your fields are separated by commas, and one of the fields has a quoted string with a comma in it, it can screw up the parsing of the fields.

The problem in the sample file is in the second line. Someone accidentally entered "Vos, Richard" in a field that is only supposed to contain the last name "Vos". This kind of clerical error happens all the time. Not only is it incorrect, it has a comma in the text which is the character we use to delimit the fields.

if we try to naively break out the fields into separate columns, we see a problem:

cat test.csv | column -t -s ","
id    firstname  lastname     email                      phone                hiredate
1111  "Colin"    "Quinn"      "c.quinn@example.com"      1112223333           "02/12/2017"
2222  "Richard"  "Vos          Richard"                  "r.vos@example.com"  2223334444    "06/27/2019"
3333  "Patrice"  "O'Neal"     "p.o'neal@example.com"     3334445555           "08/11/2018"
4444  "Jim"      "Jefferies"  "j.jefferies@example.com"  "09/11/2001"

The comma in the lastname field was identified as a field as delimiter (field separator), so now the columns on the second line are shifted to the right.

The best way I've found to handle this without using a utility that is specifically for CSV parsing is this:

cat test.csv | sed 's/,\("[^"]*"\)*/<FS>\1/g' | column -t -s "<FS>"
id    firstname  lastname        email                      phone         hiredate
1111  "Colin"    "Quinn"         "c.quinn@example.com"      1112223333    "02/12/2017"
2222  "Richard"  "Vos, Richard"  "r.vos@example.com"        2223334444    "06/27/2019"
3333  "Patrice"  "O'Neal"        "p.o'neal@example.com"     3334445555    "08/11/2018"
4444  "Jim"      "Jefferies"     "j.jefferies@example.com"  "09/11/2001"