allow '\0' '\n' in quoted columns #91

orasraf · 2019-12-18T14:57:29Z

main addition is templating LineReader to allow it the following functionality: next_line() will stop on the first end_line char it finds. this is not ideal since quoted column may include a line_end char which will make the new_line() return prematurely. in addition, find_next_column_end() will return prematurely as will if it encounters a null-char inside a quoted column. my usage of this parser happens to include both end-line and null chars (and other non-ascii chars) inside quoted columns and I had to add those changes to make it work. I added a template argument [null_terminated] to the double-quote policy that by default will make it behave the same as before the change. but if changed to 'false', it will allow '\0' inside quoted column instead of throwing exception. 2. moved no_quote_escape policy up (compiler warning). 3. cast 2 methods to int to match return type. (also compiler warning)

…unctionality: next_line() will stop on the first end_line char it finds. this is not ideal since quoted column may include a line_end char which will make the new_line() return prematurely. in addition, find_next_column_end() will return prematurely as will if it encounters a null-char inside a quoted column. my usage of this parser happens to include both end-line and null chars (and other non-ascii chars) inside quoted columns and I had to add those changes to make it work. 2. moved no_quote_escape policy up (compiler warning). 3. cast 2 methods to int to match return type. (also compiler warning)

orasraf · 2019-12-18T15:24:01Z

csv.h

 namespace io{
        ////////////////////////////////////////////////////////////////////////////
        //                                 LineReader                             //
        ////////////////////////////////////////////////////////////////////////////

+        template<char sep, bool null_terminated = true>


null_terminated by default is true which makes no difference for existing user code.

orasraf · 2019-12-18T15:25:46Z

csv.h

@@ -314,7 +337,7 @@ namespace io{
                        int desired_byte_count;
                };
        }
-
+		template<class quote_policy = no_quote_escape<','>>


made LineReader templated so it could call the quote policy functions
quote_policy::hasQuote();
quote_policy::getQuoteChar();

This breaks code using LineReader directly. Previously, people could write io::LineReader my_reader. With this change, they have to write io::LineReader<> my_reader. This is a breaking change.

How about the following then:
change
template<...>
class io::LineReader
to
template<...>
class io::LineReaderT (TBD)

and add
typedef LineReaderT<> LineReader

This way the change will not break anything and allow LineReader to find the next line with regards to quoting policy?

That is the only option, if LineReader must be changed. A downside of this is that error message get more complex as users are now confronted with a type name they did not use themselves. This is the same situation as std::string vs std::basic_string<char, ...>.

A different option would be to have two LineReaders and make them an argument to CSVReader

orasraf · 2019-12-18T15:26:55Z

csv.h

                        int line_end = data_begin;
-                        while(buffer[line_end] != '\n' && line_end != data_end){
+                        while( (quoted_data || buffer[line_end] != '\n') && line_end != data_end){


when using double quote policy, new_line() should not stop prematurely on the first \n it sees but rather on the first un-quoted \n.

orasraf · 2019-12-18T15:28:56Z

csv.h

                                if(*col_begin != quote)
                                        ++col_begin;
                                else{
                                        do{
                                                ++col_begin;
                                                while(*col_begin != quote){
-                                                        if(*col_begin == '\0')
+                                                        if(( null_terminated && *col_begin == '\0'))


if null_terminated template flag is false, encountering a '\0' inside a quoted column is not considered an escaped_string_not_closed error.
There is a use-case where quoted columns can have binary data inside them.

ben-strasser · 2019-12-18T20:58:15Z

csv.h

+					return true;
+				}
+				static char getQuoteChar() {
+					return quote;


This reduces the flexibility given by the policy system. For example, before it was possible to write a policy that parses a,b,c,d as three columns with values "a", "b,c" and "d".

I doubt anybody used this freedom, but it is an introduced restriction.

ben-strasser · 2019-12-18T20:59:51Z

csv.h

@@ -753,30 +787,26 @@ namespace io{
                }
        };

-        template<char sep>


Where did no_quote_escape go? Removing this is a no-go. 90% of CSV files are just numbers without quotes. Parsing these must be fast and therefore the default policy setting should not involve any quote handling. Without this policy, this is no longer possible.

It was moved up since some compilers are giving error for the declaration location.
this change can be omitted

ben-strasser · 2019-12-18T21:02:28Z

csv.h

-                        while(buffer[line_end] != '\n' && line_end != data_end){
+                        while( (quoted_data || buffer[line_end] != '\n') && line_end != data_end){
+							if (hasQuote && buffer[line_end] == quoteChar) {
+								quoted_data = !quoted_data;


How does this work, if somebody has a backslash-quote-policy? As far as I see, you are counting the number of quotes. This happens to work for double quote escape but fails for backslash quote escape.

ben-strasser · 2019-12-21T09:37:05Z

If I understand the code correctly, then you do no do any Windows- to Linux-newline conversion inside quoted strings. This seems like a problem in the making.

Suppose you have an application that writes \n and expects \n when reading it in. Next somebody open the csv file under Windows using the wrong editor, change a single digit somewhere, and saves again. Now all Linux newlines got changed to Windows \r\n newlines. When reading the CSV file back in again, your program fails as it expects \n.

The typical Windows user is not aware of the \r\n and \n distinction. All he sees is your application failing. Now you get a cryptic "I change one number and the program broke!" bug report without any mention of Windows nor the editor used. I do not want to debug that.

Not allowing newlines in quoted seems bad at first, but it does prevent people from running into the problem above.

If the library supported automatic \r\n translation in strings, then I am certain that somebody would come with a usecase where this is the wrong behavior. This means that there should be a policy to do newline translation where \r\n -> \n translation should be the default.

orasraf commented Dec 18, 2019

View reviewed changes

ben-strasser reviewed Dec 18, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

allow '\0' '\n' in quoted columns #91

allow '\0' '\n' in quoted columns #91

Uh oh!

orasraf commented Dec 18, 2019

Uh oh!

orasraf Dec 18, 2019

Uh oh!

orasraf Dec 18, 2019

Uh oh!

ben-strasser Dec 18, 2019

Uh oh!

orasraf Dec 19, 2019

Uh oh!

ben-strasser Dec 21, 2019

Uh oh!

orasraf Dec 18, 2019

Uh oh!

orasraf Dec 18, 2019

Uh oh!

ben-strasser Dec 18, 2019

Uh oh!

ben-strasser Dec 18, 2019

Uh oh!

orasraf Dec 19, 2019

Uh oh!

ben-strasser Dec 18, 2019

Uh oh!

ben-strasser commented Dec 21, 2019

Uh oh!

Uh oh!

allow '\0' '\n' in quoted columns #91

Are you sure you want to change the base?

allow '\0' '\n' in quoted columns #91

Uh oh!

Conversation

orasraf commented Dec 18, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ben-strasser commented Dec 21, 2019

Uh oh!

Uh oh!