Update to fix bug for csv files with real commas in text #214

ianrowan · 2020-06-01T16:12:12Z

Currently the load_dataset method will load the following csv row example: "I ate pie, ice cream, and cookies\n" in line 37 into reader list as reader=[["I ate pie", "ice cream", "and cookies"]]` due to the commas read as columns delimiters.

When reader[0] is iterated as row in line 39 only row[0](I ate pie) is taken which would yield "<|startoftext|>I ate pie"<|endoftext|>"
This is because the commas created multiple list items in row.

Proposed change is to join each entire row list to use full text from each csv row. in this case ''.join(row) yields "<|startoftext|>I ate pie ice cream and cookies"<|endoftext|>" as it compresses the list into a single string. I have tested both of these scenarios in the code base as well.

Currently the load_dataset method will load the following csv row example: "I ate pie, ice cream, and cookies\n" in line 37 into reader list as reader=[["I ate pie", "ice cream", "and cookies"]] due to the commas read as columns delimiters. When reader[0] is iterated as row in line 39 only row[0](I ate pie) is taken which would yield "<|startoftext|>I ate pie"<|endoftext|>" This is because the commas created multiple list items in row. Proposed change is to join each entire row list to use full text from each csv row. in this case ''.join(row) yields "<|startoftext|>I ate pie ice cream and cookies"<|endoftext|>" as it compresses the list into a single string. I have tested both of these scenarios in the code base as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to fix bug for csv files with real commas in text #214

Update to fix bug for csv files with real commas in text #214

ianrowan commented Jun 1, 2020

Update to fix bug for csv files with real commas in text #214

Are you sure you want to change the base?

Update to fix bug for csv files with real commas in text #214

Conversation

ianrowan commented Jun 1, 2020