Question:
How do I parse a CSV file where the field separator could be part of a field?
ABC
2014-09-29 15:54:59 UTC
I'm trying to parse a CSV file in C++. The field separator is a comma (,), however, some fields may also contain a comma (not as a separator/delimiter, but as part of the string). Such fields that contain a comma are typically delimited/surrounded by quotation marks (like any string), however, fields that do not contain a comma are generally (but not necessarily) not surrounded by quotation marks.

I am currently reading from the CSV file with an ifstream and using getline to parse the input stream using a comma as the delimiter. This works in most cases, except when the field contains a comma, in which case it mistakenly divides what should be one field/string into two. How can I resolve this problem?
Five answers:
husoski
2014-09-29 18:01:40 UTC
Who thumbed down Tanisha's answer? It's right on the money.



You can't use getline to do the parsing. (...but you can use it to get a full line into a string so your parsing doesn't inadvertently wrap across line boundaries.)



However, you can't write the parser unless you know enough about the field syntax to know if a comma is part of the field or not. That was Tanisha's point, as I read it.



The most powerful tool in the standard library are regular expressions from the library.

http://www.cplusplus.com/reference/regex/



However: Color me old fashioned, but for something that could be done with a simple loop and five ifs or less, I'd rather debug C++ than a regex after re-learning the nuances of regex, regex_match, match_results, sub_match, and maybe more, and then figure out how to generate sensible error messages when things go wrong.



However, if you plan to handle different syntax styles (maybe a based on a command line option, or maybe auto-sensing by trying all known options to see which gives the best results) then a regex solution could be very attractive.



IMHO, the hardest part is not the quotes, but how quotes-within-quotes are "escaped". There are two common styles. Use \" or \' as in C, or double the quote character as "" or ''. I don't know which to expect in a CSV.
?
2014-09-30 03:07:19 UTC
Go through the file character by character, keep track of whether you have parsed an odd number of quotes or not, and if yes, ignore a comma you find as not being a delimiter.
justme
2014-09-30 07:42:43 UTC
The simplest way is to keep track of the quotes, and set a bit or increment a counter whenever you hit one. If the bit is set (or counter is odd) ignore a comma (consider it part of the value) if you see one. If the bit is cleared (or counter is even) consider the comma a delimiter.



As far as quotes within quotes go, you need to look at the preceding character (I store it) and if the quote is preceded by \ or " don't flip the bit or increment the counter.



In all of this, if you have just the " then do not include it as part of the value, if you see \" or "" include just the " as part of the value.



Done CSV file parsing before.
Ivan N
2014-09-29 16:06:10 UTC
I think an escape character before the string comma should ensure that this is not interpreted as a delimiter.
Tanisha
2014-09-29 15:59:19 UTC
Magic or prayer - without delimiters or other context which describes the data field how can any algorithm decide whether a character is a delimiter or not?


This content was originally posted on Y! Answers, a Q&A website that shut down in 2021.
Loading...