-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV Renderer may guess a TSV file is comma delimited even when tab delimited #16558
Comments
So I think that's a bug of the function guesses the delimiter. |
I would not make a separate renderer but simply skip over One such extension may be |
Hm, We should allow external renderer has their own parameters in |
Yes, we need to have the filename available so that it can be use in determining which delimiter to use as well. |
Yes, but the problem is currently there is no extension passed to the renderer, as it is invoked by the list of extensions it renders. I offered that as a solution though, if we can get it passed in. |
@lunny What kind of parameters? You mean what delimiter to use if it is a tsv, a csv, or a psv? Still csv is actually the only standard, which can be any character delimited (some say csv means character-separated-values), so one repo might use it different than another. Yet tsv and psv are csv files yet kind of hint at the delimiter in their extension. |
From a coworker:
I do wonder if the first row, whether a header or not, is enough to go by, rather than counting all tabs and commas and other delimiters ALL the content as it currently does. You'd think that would be using more tabs than commas, or commas than tabs especially if it is a real header. But even if not, seems like still the best way to score the characters. Edit: I see now it scores based on the first 10 lines, yet as my example has, my 2nd row actually has a bunch of commas in the text. Headers wouldn't normally do that. |
I suggest two things:
Thoughts? |
I agree both. We can read ten lines to detect the delimiter. But one we need notice that some column have been quoted with double quote and contains delimiter characters in the double quote. i.e.
It should be 4 columns but not 7. If we just simple count single quote, that will be wrong. |
@lunny yeah, I thought of that and well aware of that. I assume to actually use the encoding/cvs module with those 10 lines and the delimiter in question and see what it returns. We'd still be making a best guess, as some rows might have the wrong number of fields, but if it has a higher accuracy than other delimiters, will use that delimiter. That is why we can kind of do a score with the matching of the delimiter as well. |
So just to be clear, making a fix for this that should also work with the CSV diff feature which has the same bug. |
Why don't we provide a mechanism for choosing whether to guess the delimiter or not. |
@zeripath So some sort of user interface the repo file view page? Yet would still have to guess something when first showing the file. |
@zeripath Also, the DIFF view for .csv files has this bug, and thus you'd have to have a selection for each file diff. |
I wouldn't. This is a kind of thing that can be handled automatically and should not need configuration for the most common types (csv,tsv,psv). #16558 (comment) sounds good to me. |
Every heuristic mechanism will get things wrong for some pathological case. |
I still need to look at this. Will see if I can work on this. Found another bug with CSV diffs in that if the diff is after 32 lines, it fails to show the CSV table. |
* Fixes #16558 CSV delimiter determiner * Fixes #16558 - properly determine CSV delmiiter * Moves quoteString to a new function * Adds big test with lots of commas for tab delimited csv * Adds comments * Shortens the text of the test * Removes single quotes from regexp as only double quotes need to be searched * Fixes spelling * Fixes check of length as it probalby will only be 1e4, not greater * Makes sample size a const, properly removes truncated line * Makes sample size a const, properly removes truncated line * Fixes comment * Fixes comment * tests for FormatError() function * Adds logic to find the limiter before or after a quoted value * Simplifies regex * Error tests * Error tests * Update modules/csv/csv.go Co-authored-by: delvh <dev.lh@web.de> * Update modules/csv/csv.go Co-authored-by: delvh <dev.lh@web.de> * Adds comments * Update modules/csv/csv.go Co-authored-by: delvh <dev.lh@web.de> Co-authored-by: wxiaoguang <wxiaoguang@gmail.com> Co-authored-by: zeripath <art27@cantab.net> Co-authored-by: delvh <dev.lh@web.de>
* Fixes go-gitea#16558 CSV delimiter determiner * Fixes go-gitea#16558 - properly determine CSV delmiiter * Moves quoteString to a new function * Adds big test with lots of commas for tab delimited csv * Adds comments * Shortens the text of the test * Removes single quotes from regexp as only double quotes need to be searched * Fixes spelling * Fixes check of length as it probalby will only be 1e4, not greater * Makes sample size a const, properly removes truncated line * Makes sample size a const, properly removes truncated line * Fixes comment * Fixes comment * tests for FormatError() function * Adds logic to find the limiter before or after a quoted value * Simplifies regex * Error tests * Error tests * Update modules/csv/csv.go Co-authored-by: delvh <dev.lh@web.de> * Update modules/csv/csv.go Co-authored-by: delvh <dev.lh@web.de> * Adds comments * Update modules/csv/csv.go Co-authored-by: delvh <dev.lh@web.de> Co-authored-by: wxiaoguang <wxiaoguang@gmail.com> Co-authored-by: zeripath <art27@cantab.net> Co-authored-by: delvh <dev.lh@web.de>
https://try.gitea.io/richmahn/test/src/branch/master/is_tsv_but_uses_commas.tsv
See the raw....it actually has 8 rows, yet because the "intro" row has many commas in its text, only rows that have NO commas (because the header has no commas), thus only one field, get shown.
Description
If a file is a .tsv file, which, while isn't a standard itself and thus uses CSV but with tab as the delimiter, but has more commas than tabs, the function here thinks the delimiter is comma, and thus all rows that have more fields than the header (which has no commas thus 1 field) get removed from the rendering.
I suggest, which I have done in my fork of Gitea, make a separate markup renderer like the "csv" one but call it "tsv" and always use
\t
as the delimiter due to the filename. Either that, or make sure the function that guesses the delimiter gets the filename of the file being rendered so it can determine from that.Screenshots
Here is the blame for my .tsv file on try.gitea.io which shows it is tab delimited, with the 2nd row having LOTS of texts with commas:
Yet the table rendering removes many of the rows:
The text was updated successfully, but these errors were encountered: