Difference between revisions of "Bash CSV Handling"
(→Quoted Strings with Commas) |
(→Quoted Strings with Commas) |
||
| Line 1: | Line 1: | ||
I'm sick of screwing this up, so I'm creating this definitive guide for my own reference. I will update this as I learn to handle individual issues. | I'm sick of screwing this up, so I'm creating this definitive guide for my own reference. I will update this as I learn to handle individual issues. | ||
| + | |||
| + | |||
| + | For the following examples, I will use this sample CSV file <code>test.csv</code> (assuming Unix line ending "\n"). I've included several "problems" in it. | ||
| + | <pre> | ||
| + | id,firstname,lastname,email,phone,hiredate | ||
| + | 1111,"Colin","Quinn","c.quinn@example.com",1112223333,"02/12/2017" | ||
| + | 2222,"Richard","Vos, Richard","r.vos@example.com",2223334444,"06/27/2019" | ||
| + | 3333,"Patrice","O'Neal","p.o'neal@example.com",3334445555,"08/11/2018" | ||
| + | 4444,"Jim","Jefferies","j.jefferies@example.com",,"09/11/2001" | ||
| + | </pre> | ||
= Quoted Strings with Commas = | = Quoted Strings with Commas = | ||
| Line 5: | Line 15: | ||
When your fields are separated by commas, and one of the fields has a quoted string with a comma in it, it can screw up the parsing of the fields. | When your fields are separated by commas, and one of the fields has a quoted string with a comma in it, it can screw up the parsing of the fields. | ||
| − | + | The problem in the sample file is in the second line. Someone accidentally entered "Vos, Richard" in a field that is only supposed to contain the last name "Vos". This kind of clerical error happens all the time. Not only is it incorrect, it has a comma in the text which is the character we use to delimit the fields. | |
| − | <pre> | + | |
| − | id | + | if we try to naively break out the fields into separate columns, we see a problem: |
| − | + | <source lang="shell"> | |
| − | + | cat test.csv | column -t -s "," | |
| − | + | </source> | |
| − | + | ||
| + | <pre class="result"> | ||
| + | id firstname lastname email phone hiredate | ||
| + | 1111 "Colin" "Quinn" "c.quinn@example.com" 1112223333 "02/12/2017" | ||
| + | 2222 "Richard" "Vos Richard" "r.vos@example.com" 2223334444 "06/27/2019" | ||
| + | 3333 "Patrice" "O'Neal" "p.o'neal@example.com" 3334445555 "08/11/2018" | ||
| + | 4444 "Jim" "Jefferies" "j.jefferies@example.com" "09/11/2001" | ||
| + | </pre> | ||
| + | |||
| + | The comma in the lastname field was identified as a field as delimiter (field separator), so now the columns on the second line are shifted to the right. | ||
| + | |||
| + | The best way I've found to handle this without using a utility that is specifically for CSV parsing is this: | ||
| + | <source lang="shell"> | ||
| + | cat test.csv | sed 's/,\("[^"]*"\)*/<FS>\1/g' | column -t -s "<FS>" | ||
| + | </source> | ||
| + | |||
| + | <pre class="result"> | ||
| + | id firstname lastname email phone hiredate | ||
| + | 1111 "Colin" "Quinn" "c.quinn@example.com" 1112223333 "02/12/2017" | ||
| + | 2222 "Richard" "Vos, Richard" "r.vos@example.com" 2223334444 "06/27/2019" | ||
| + | 3333 "Patrice" "O'Neal" "p.o'neal@example.com" 3334445555 "08/11/2018" | ||
| + | 4444 "Jim" "Jefferies" "j.jefferies@example.com" "09/11/2001" | ||
</pre> | </pre> | ||
Revision as of 14:22, 27 June 2019
I'm sick of screwing this up, so I'm creating this definitive guide for my own reference. I will update this as I learn to handle individual issues.
For the following examples, I will use this sample CSV file test.csv (assuming Unix line ending "\n"). I've included several "problems" in it.
id,firstname,lastname,email,phone,hiredate 1111,"Colin","Quinn","c.quinn@example.com",1112223333,"02/12/2017" 2222,"Richard","Vos, Richard","r.vos@example.com",2223334444,"06/27/2019" 3333,"Patrice","O'Neal","p.o'neal@example.com",3334445555,"08/11/2018" 4444,"Jim","Jefferies","j.jefferies@example.com",,"09/11/2001"
Quoted Strings with Commas
When your fields are separated by commas, and one of the fields has a quoted string with a comma in it, it can screw up the parsing of the fields.
The problem in the sample file is in the second line. Someone accidentally entered "Vos, Richard" in a field that is only supposed to contain the last name "Vos". This kind of clerical error happens all the time. Not only is it incorrect, it has a comma in the text which is the character we use to delimit the fields.
if we try to naively break out the fields into separate columns, we see a problem:
cat test.csv | column -t -s ","
id firstname lastname email phone hiredate 1111 "Colin" "Quinn" "c.quinn@example.com" 1112223333 "02/12/2017" 2222 "Richard" "Vos Richard" "r.vos@example.com" 2223334444 "06/27/2019" 3333 "Patrice" "O'Neal" "p.o'neal@example.com" 3334445555 "08/11/2018" 4444 "Jim" "Jefferies" "j.jefferies@example.com" "09/11/2001"
The comma in the lastname field was identified as a field as delimiter (field separator), so now the columns on the second line are shifted to the right.
The best way I've found to handle this without using a utility that is specifically for CSV parsing is this:
cat test.csv | sed 's/,\("[^"]*"\)*/<FS>\1/g' | column -t -s "<FS>"
id firstname lastname email phone hiredate 1111 "Colin" "Quinn" "c.quinn@example.com" 1112223333 "02/12/2017" 2222 "Richard" "Vos, Richard" "r.vos@example.com" 2223334444 "06/27/2019" 3333 "Patrice" "O'Neal" "p.o'neal@example.com" 3334445555 "08/11/2018" 4444 "Jim" "Jefferies" "j.jefferies@example.com" "09/11/2001"