-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow '\0' '\n' in quoted columns #91
base: master
Are you sure you want to change the base?
allow '\0' '\n' in quoted columns #91
Conversation
orasraf
commented
Dec 18, 2019
- main addition is templating LineReader to allow it the following functionality: next_line() will stop on the first end_line char it finds. this is not ideal since quoted column may include a line_end char which will make the new_line() return prematurely. in addition, find_next_column_end() will return prematurely as will if it encounters a null-char inside a quoted column. my usage of this parser happens to include both end-line and null chars (and other non-ascii chars) inside quoted columns and I had to add those changes to make it work. I added a template argument [null_terminated] to the double-quote policy that by default will make it behave the same as before the change. but if changed to 'false', it will allow '\0' inside quoted column instead of throwing exception. 2. moved no_quote_escape policy up (compiler warning). 3. cast 2 methods to int to match return type. (also compiler warning)
…unctionality: next_line() will stop on the first end_line char it finds. this is not ideal since quoted column may include a line_end char which will make the new_line() return prematurely. in addition, find_next_column_end() will return prematurely as will if it encounters a null-char inside a quoted column. my usage of this parser happens to include both end-line and null chars (and other non-ascii chars) inside quoted columns and I had to add those changes to make it work. 2. moved no_quote_escape policy up (compiler warning). 3. cast 2 methods to int to match return type. (also compiler warning)
namespace io{ | ||
//////////////////////////////////////////////////////////////////////////// | ||
// LineReader // | ||
//////////////////////////////////////////////////////////////////////////// | ||
|
||
template<char sep, bool null_terminated = true> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null_terminated by default is true which makes no difference for existing user code.
@@ -314,7 +337,7 @@ namespace io{ | |||
int desired_byte_count; | |||
}; | |||
} | |||
|
|||
template<class quote_policy = no_quote_escape<','>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made LineReader templated so it could call the quote policy functions
quote_policy::hasQuote();
quote_policy::getQuoteChar();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks code using LineReader directly. Previously, people could write io::LineReader my_reader. With this change, they have to write io::LineReader<> my_reader. This is a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following then:
change
template<...>
class io::LineReader
to
template<...>
class io::LineReaderT (TBD)
and add
typedef LineReaderT<> LineReader
This way the change will not break anything and allow LineReader to find the next line with regards to quoting policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the only option, if LineReader must be changed. A downside of this is that error message get more complex as users are now confronted with a type name they did not use themselves. This is the same situation as std::string vs std::basic_string<char, ...>.
A different option would be to have two LineReaders and make them an argument to CSVReader
int line_end = data_begin; | ||
while(buffer[line_end] != '\n' && line_end != data_end){ | ||
while( (quoted_data || buffer[line_end] != '\n') && line_end != data_end){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when using double quote policy, new_line() should not stop prematurely on the first \n it sees but rather on the first un-quoted \n.
if(*col_begin != quote) | ||
++col_begin; | ||
else{ | ||
do{ | ||
++col_begin; | ||
while(*col_begin != quote){ | ||
if(*col_begin == '\0') | ||
if(( null_terminated && *col_begin == '\0')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if null_terminated template flag is false, encountering a '\0' inside a quoted column is not considered an escaped_string_not_closed error.
There is a use-case where quoted columns can have binary data inside them.
return true; | ||
} | ||
static char getQuoteChar() { | ||
return quote; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reduces the flexibility given by the policy system. For example, before it was possible to write a policy that parses a,b,c,d as three columns with values "a", "b,c" and "d".
I doubt anybody used this freedom, but it is an introduced restriction.
@@ -753,30 +787,26 @@ namespace io{ | |||
} | |||
}; | |||
|
|||
template<char sep> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did no_quote_escape go? Removing this is a no-go. 90% of CSV files are just numbers without quotes. Parsing these must be fast and therefore the default policy setting should not involve any quote handling. Without this policy, this is no longer possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was moved up since some compilers are giving error for the declaration location.
this change can be omitted
while(buffer[line_end] != '\n' && line_end != data_end){ | ||
while( (quoted_data || buffer[line_end] != '\n') && line_end != data_end){ | ||
if (hasQuote && buffer[line_end] == quoteChar) { | ||
quoted_data = !quoted_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work, if somebody has a backslash-quote-policy? As far as I see, you are counting the number of quotes. This happens to work for double quote escape but fails for backslash quote escape.
If I understand the code correctly, then you do no do any Windows- to Linux-newline conversion inside quoted strings. This seems like a problem in the making. Suppose you have an application that writes \n and expects \n when reading it in. Next somebody open the csv file under Windows using the wrong editor, change a single digit somewhere, and saves again. Now all Linux newlines got changed to Windows \r\n newlines. When reading the CSV file back in again, your program fails as it expects \n. The typical Windows user is not aware of the \r\n and \n distinction. All he sees is your application failing. Now you get a cryptic "I change one number and the program broke!" bug report without any mention of Windows nor the editor used. I do not want to debug that. Not allowing newlines in quoted seems bad at first, but it does prevent people from running into the problem above. If the library supported automatic \r\n translation in strings, then I am certain that somebody would come with a usecase where this is the wrong behavior. This means that there should be a policy to do newline translation where \r\n -> \n translation should be the default. |