CSV Renderer may guess a TSV file is comma delimited even when tab delimited #16558

richmahn · 2021-07-27T14:28:40Z

Gitea version (or commit ref): 4b6370c (main)
Can you reproduce the bug at https://try.gitea.io:
- Yes (provide example URL)
- No

https://try.gitea.io/richmahn/test/src/branch/master/is_tsv_but_uses_commas.tsv

See the raw....it actually has 8 rows, yet because the "intro" row has many commas in its text, only rows that have NO commas (because the header has no commas), thus only one field, get shown.

Description

If a file is a .tsv file, which, while isn't a standard itself and thus uses CSV but with tab as the delimiter, but has more commas than tabs, the function here thinks the delimiter is comma, and thus all rows that have more fields than the header (which has no commas thus 1 field) get removed from the rendering.

I suggest, which I have done in my fork of Gitea, make a separate markup renderer like the "csv" one but call it "tsv" and always use \t as the delimiter due to the filename. Either that, or make sure the function that guesses the delimiter gets the filename of the file being rendered so it can determine from that.

Screenshots

Here is the blame for my .tsv file on try.gitea.io which shows it is tab delimited, with the 2nd row having LOTS of texts with commas:

Yet the table rendering removes many of the rows:

The text was updated successfully, but these errors were encountered:

lunny · 2021-07-27T15:21:38Z

So I think that's a bug of the function guesses the delimiter.

silverwind · 2021-07-27T15:23:17Z

make a separate markup renderer like the "csv" one but call it "tsv" and always use \t as the delimiter due to the filename

I would not make a separate renderer but simply skip over guessDelimiter in case the extension provides a strong indication of the delimiter in use. That way, we can extend with more similar extensions later.

One such extension may be .psv for Pipe Separated Values (|).

lunny · 2021-07-27T16:47:50Z

Hm, We should allow external renderer has their own parameters in app.ini.

richmahn · 2021-07-27T17:02:45Z

Yes, we need to have the filename available so that it can be use in determining which delimiter to use as well.

richmahn · 2021-07-27T17:06:18Z

make a separate markup renderer like the "csv" one but call it "tsv" and always use \t as the delimiter due to the filename

I would not make a separate renderer but simply skip over guessDelimiter in case the extension provides a strong indication of the delimiter in use. That way, we can extend with more similar extensions later.

One such extension may be .psv for Pipe Separated Values (|).

Yes, but the problem is currently there is no extension passed to the renderer, as it is invoked by the list of extensions it renders. I offered that as a solution though, if we can get it passed in.

richmahn · 2021-07-27T19:48:22Z

Hm, We should allow external renderer has their own parameters in app.ini.

@lunny What kind of parameters? You mean what delimiter to use if it is a tsv, a csv, or a psv? Still csv is actually the only standard, which can be any character delimited (some say csv means character-separated-values), so one repo might use it different than another. Yet tsv and psv are csv files yet kind of hint at the delimiter in their extension.

richmahn · 2021-07-27T19:57:27Z

From a coworker:

The history of this puzzles me a little @rich Mahn. If I was writing code to autodetect TSV vs CSV, I'd look at the first/header line. And that line only has the column headers and tabs and zero commas. Why would it need to look further? How many lines was the old code scanning?

I do wonder if the first row, whether a header or not, is enough to go by, rather than counting all tabs and commas and other delimiters ALL the content as it currently does. You'd think that would be using more tabs than commas, or commas than tabs especially if it is a real header. But even if not, seems like still the best way to score the characters.

Edit: I see now it scores based on the first 10 lines, yet as my example has, my 2nd row actually has a bunch of commas in the text. Headers wouldn't normally do that.

richmahn · 2021-07-27T20:19:29Z

@lunny @silverwind

I suggest two things:

get the file extension passed to the Renderer so it can go by that if it isn't just .csv
When it is .csv, to guess the delimiter, we could do it from the first 10k or 10 rows, but why not then quickly see if the top scoring one matches with all 10 rows with the same number of fields, if not, the 2nd, 3rd, or 4th (assuming score > 0) and use the one that works, and if none do, return the top scoring one. If all have no score, then return , (comma). (how well it matches each row can be part of the score)

Thoughts?

lunny · 2021-07-28T01:44:41Z

I agree both. We can read ten lines to detect the delimiter. But one we need notice that some column have been quoted with double quote and contains delimiter characters in the double quote. i.e.

a,b,"c,b,d,",e

It should be 4 columns but not 7. If we just simple count single quote, that will be wrong.

richmahn · 2021-07-28T12:51:52Z

@lunny yeah, I thought of that and well aware of that. I assume to actually use the encoding/cvs module with those 10 lines and the delimiter in question and see what it returns. We'd still be making a best guess, as some rows might have the wrong number of fields, but if it has a higher accuracy than other delimiters, will use that delimiter. That is why we can kind of do a score with the matching of the delimiter as well.

richmahn · 2021-07-29T20:04:56Z

So just to be clear, making a fix for this that should also work with the CSV diff feature which has the same bug.

zeripath · 2021-07-31T13:06:23Z

Why don't we provide a mechanism for choosing whether to guess the delimiter or not.

richmahn · 2021-08-05T13:35:52Z

@zeripath So some sort of user interface the repo file view page? Yet would still have to guess something when first showing the file.

richmahn · 2021-08-05T13:36:45Z

@zeripath Also, the DIFF view for .csv files has this bug, and thus you'd have to have a selection for each file diff.

silverwind · 2021-08-05T14:17:43Z

Why don't we provide a mechanism for choosing whether to guess the delimiter or not.

I wouldn't. This is a kind of thing that can be handled automatically and should not need configuration for the most common types (csv,tsv,psv).

#16558 (comment) sounds good to me.

zeripath · 2021-08-05T18:14:11Z

Every heuristic mechanism will get things wrong for some pathological case.

richmahn · 2021-10-20T19:28:14Z

I still need to look at this. Will see if I can work on this. Found another bug with CSV diffs in that if the diff is after 32 lines, it fails to show the CSV table.

* Fixes #16558 CSV delimiter determiner * Fixes #16558 - properly determine CSV delmiiter * Moves quoteString to a new function * Adds big test with lots of commas for tab delimited csv * Adds comments * Shortens the text of the test * Removes single quotes from regexp as only double quotes need to be searched * Fixes spelling * Fixes check of length as it probalby will only be 1e4, not greater * Makes sample size a const, properly removes truncated line * Makes sample size a const, properly removes truncated line * Fixes comment * Fixes comment * tests for FormatError() function * Adds logic to find the limiter before or after a quoted value * Simplifies regex * Error tests * Error tests * Update modules/csv/csv.go Co-authored-by: delvh <dev.lh@web.de> * Update modules/csv/csv.go Co-authored-by: delvh <dev.lh@web.de> * Adds comments * Update modules/csv/csv.go Co-authored-by: delvh <dev.lh@web.de> Co-authored-by: wxiaoguang <wxiaoguang@gmail.com> Co-authored-by: zeripath <art27@cantab.net> Co-authored-by: delvh <dev.lh@web.de>

* Fixes go-gitea#16558 CSV delimiter determiner * Fixes go-gitea#16558 - properly determine CSV delmiiter * Moves quoteString to a new function * Adds big test with lots of commas for tab delimited csv * Adds comments * Shortens the text of the test * Removes single quotes from regexp as only double quotes need to be searched * Fixes spelling * Fixes check of length as it probalby will only be 1e4, not greater * Makes sample size a const, properly removes truncated line * Makes sample size a const, properly removes truncated line * Fixes comment * Fixes comment * tests for FormatError() function * Adds logic to find the limiter before or after a quoted value * Simplifies regex * Error tests * Error tests * Update modules/csv/csv.go Co-authored-by: delvh <dev.lh@web.de> * Update modules/csv/csv.go Co-authored-by: delvh <dev.lh@web.de> * Adds comments * Update modules/csv/csv.go Co-authored-by: delvh <dev.lh@web.de> Co-authored-by: wxiaoguang <wxiaoguang@gmail.com> Co-authored-by: zeripath <art27@cantab.net> Co-authored-by: delvh <dev.lh@web.de>

richmahn self-assigned this Jul 27, 2021

richmahn changed the title ~~CSV Renderer thinks a TSV file comma delimited since many commas in field~~ CSV Renderer thinks a TSV file is comma delimited when many commas are in text Jul 27, 2021

richmahn changed the title ~~CSV Renderer thinks a TSV file is comma delimited when many commas are in text~~ CSV Renderer guesses a TSV file is comma delimited when many commas are in text Jul 27, 2021

richmahn changed the title ~~CSV Renderer guesses a TSV file is comma delimited when many commas are in text~~ CSV Renderer may guess a TSV file is comma delimited even when tab delimited Jul 27, 2021

richmahn added a commit to richmahn/gitea that referenced this issue Oct 27, 2021

Fixes go-gitea#16558 CSV delimiter determiner

fc8dd28

richmahn added a commit to richmahn/gitea that referenced this issue Oct 27, 2021

Fixes go-gitea#16558 - properly determine CSV delmiiter

56f0995

richmahn mentioned this issue Oct 27, 2021

Properly determine CSV delimiter #17459

Merged

lunny closed this as completed in #17459 Oct 30, 2021

go-gitea locked and limited conversation to collaborators Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV Renderer may guess a TSV file is comma delimited even when tab delimited #16558

CSV Renderer may guess a TSV file is comma delimited even when tab delimited #16558

richmahn commented Jul 27, 2021

lunny commented Jul 27, 2021

silverwind commented Jul 27, 2021 •

edited

Loading

lunny commented Jul 27, 2021

richmahn commented Jul 27, 2021

richmahn commented Jul 27, 2021

richmahn commented Jul 27, 2021 •

edited

Loading

richmahn commented Jul 27, 2021 •

edited

Loading

richmahn commented Jul 27, 2021 •

edited

Loading

lunny commented Jul 28, 2021 •

edited

Loading

richmahn commented Jul 28, 2021

richmahn commented Jul 29, 2021

zeripath commented Jul 31, 2021

richmahn commented Aug 5, 2021

richmahn commented Aug 5, 2021

silverwind commented Aug 5, 2021

zeripath commented Aug 5, 2021

richmahn commented Oct 20, 2021

CSV Renderer may guess a TSV file is comma delimited even when tab delimited #16558

CSV Renderer may guess a TSV file is comma delimited even when tab delimited #16558

Comments

richmahn commented Jul 27, 2021

Description

Screenshots

lunny commented Jul 27, 2021

silverwind commented Jul 27, 2021 • edited Loading

lunny commented Jul 27, 2021

richmahn commented Jul 27, 2021

richmahn commented Jul 27, 2021

richmahn commented Jul 27, 2021 • edited Loading

richmahn commented Jul 27, 2021 • edited Loading

richmahn commented Jul 27, 2021 • edited Loading

lunny commented Jul 28, 2021 • edited Loading

richmahn commented Jul 28, 2021

richmahn commented Jul 29, 2021

zeripath commented Jul 31, 2021

richmahn commented Aug 5, 2021

richmahn commented Aug 5, 2021

silverwind commented Aug 5, 2021

zeripath commented Aug 5, 2021

richmahn commented Oct 20, 2021

silverwind commented Jul 27, 2021 •

edited

Loading

richmahn commented Jul 27, 2021 •

edited

Loading

richmahn commented Jul 27, 2021 •

edited

Loading

richmahn commented Jul 27, 2021 •

edited

Loading

lunny commented Jul 28, 2021 •

edited

Loading