[ML] Allow a certain number of ill-formatted rows when delimited format is specified #55735

benwtrent · 2020-04-24T16:54:56Z

While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.).

This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format.

This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means.

related to #38890

…ified

elasticmachine · 2020-04-24T16:54:58Z

Pinging @elastic/ml-core (:ml)

...java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinderFactory.java

...src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/FileStructureFinderManager.java

...src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/FileStructureFinderFactory.java

...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java

...java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinderFactory.java

droberts195

I think you should try out the end-to-end UI experience with this change. Try:

The CSV file with different numbers of fields from [ML] Data Visualizer to accept data without timestamp kibana#60196 (comment) with no overrides
The CSV file with different numbers of fields from [ML] Data Visualizer to accept data without timestamp kibana#60196 (comment) with the delimiter overridden to ,
The CSV file with different numbers of fields from [ML] Data Visualizer to accept data without timestamp kibana#60196 (comment) with the structure overridden to delimited, but no explicit delimiter specified
A semi-structured application log that is not CSV with the number of lines to analyse overridden to 100000

...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java

benwtrent · 2020-04-27T12:06:07Z

@droberts195 to test the end-to-end UI experience, the uploader needs to allow me to supply overrides after the initial parsing failed. When I last tried, this is not possible.

I will do some patching locally to work around this.

droberts195 · 2020-04-27T12:21:41Z

to test the end-to-end UI experience, the uploader needs to allow me to supply overrides after the initial parsing failed. When I last tried, this is not possible.

Good point.

You could still test the case of a semi-structured log file though. Check that the explanations of why it wasn't delimited are formatted correctly.

I will do some patching locally to work around this.

Yes, you could just hardcode the overrides onto every call to find_file_structure in a local Kibana build. Obviously that local build will only be any good for the single test case, but it's a way to escape the chicken-and-egg situation of needing both this backend change and a UI change to see the full benefit.

benwtrent · 2020-04-27T15:24:29Z

Scenarios tested:

It fails. Saying the row it failed on and why (number of fields differing than the first row).
It succeeds and data was able to be imported without issue
It succeeds and data was able to be imported without issue
I did not experience any timeouts. Without overrides, it behaves much like it has before as the very first line that has a mismatch causes the failure.

benwtrent · 2020-04-27T15:31:33Z

@elasticmachine update branch

…-when-specified

droberts195

The more I look at this the more complicated a seemingly simple idea becomes 😢

...t/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinderTests.java

...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java

…low-lenient-delim-parsing-when-specified

…ed' of github.com:benwtrent/elasticsearch into feature/ml-fsf-allow-lenient-delim-parsing-when-specified

benwtrent · 2020-04-28T16:49:41Z

run elasticsearch-ci/packaging-sample-unix-docker

droberts195

LGTM if you could just adjust a comment to make clear that totalNumberOfRows is an approximation.

Thanks for all the iterations on this and end-to-end testing with the UI.

The bug with lineMergeSizeLimit not being considered for CSV can be left to another PR or fixed in this one if you prefer.

...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java

…structurefinder/DelimitedFileStructureFinder.java Co-Authored-By: David Roberts <dave.roberts@elastic.co>

benwtrent · 2020-04-29T12:08:12Z

@elasticmachine update branch

…-when-specified

przemekwitek

LGTM

…at is specified (elastic#55735) While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.). This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format. This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means. related to elastic#38890

…at is specified (#55735) (#55944) While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.). This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format. This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means. related to #38890

Docs for elastic#55735

…56288) Docs for #55735 Co-authored-by: Lisa Cawley <lcawley@elastic.co>

[ML] Allow a certain number of ill-formatted rows when format is spec…

bb31b13

…ified

benwtrent added >enhancement :ml Machine learning v8.0.0 v7.8.0 labels Apr 24, 2020

benwtrent commented Apr 24, 2020

View reviewed changes

...java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinderFactory.java Outdated Show resolved Hide resolved

benwtrent commented Apr 24, 2020

View reviewed changes

...src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/FileStructureFinderManager.java Outdated Show resolved Hide resolved

przemekwitek reviewed Apr 24, 2020

View reviewed changes

droberts195 reviewed Apr 27, 2020

View reviewed changes

addressing PR comments

0cd2d28

benwtrent requested review from droberts195 and przemekwitek April 27, 2020 15:30

Merge branch 'master' into feature/ml-fsf-allow-lenient-delim-parsing…

9ed2baa

…-when-specified

droberts195 reviewed Apr 27, 2020

View reviewed changes

benwtrent added 4 commits April 28, 2020 11:10

more accurately detect bad rows

bf49f1b

Merge remote-tracking branch 'upstream/master' into feature/ml-fsf-al…

0cbadec

…low-lenient-delim-parsing-when-specified

Merge branch 'feature/ml-fsf-allow-lenient-delim-parsing-when-specifi…

b4d1f06

…ed' of github.com:benwtrent/elasticsearch into feature/ml-fsf-allow-lenient-delim-parsing-when-specified

fixing comments

5b8c4cf

benwtrent requested a review from droberts195 April 28, 2020 17:10

droberts195 approved these changes Apr 29, 2020

View reviewed changes

...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java Outdated Show resolved Hide resolved

Update x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/file…

f7b65b5

…structurefinder/DelimitedFileStructureFinder.java Co-Authored-By: David Roberts <dave.roberts@elastic.co>

Merge branch 'master' into feature/ml-fsf-allow-lenient-delim-parsing…

ed6d2d6

…-when-specified

przemekwitek approved these changes Apr 29, 2020

View reviewed changes

benwtrent merged commit fd554d9 into elastic:master Apr 29, 2020

benwtrent deleted the feature/ml-fsf-allow-lenient-delim-parsing-when-specified branch April 29, 2020 14:24

benwtrent mentioned this pull request Apr 29, 2020

[7.x] [ML] Allow a certain number of ill-formatted rows when delimited format is specified (#55735) #55944

Merged

droberts195 added a commit to droberts195/elasticsearch that referenced this pull request May 6, 2020

[DOCS] Docs changes for overridden delimiter in find_file_structure

658fecc

Docs for elastic#55735

droberts195 mentioned this pull request May 6, 2020

[DOCS] Docs changes for overridden delimiter in find_file_structure #56288

Merged

droberts195 added a commit that referenced this pull request May 14, 2020

[DOCS] Docs changes for overridden delimiter in find_file_structure (#…

cbb8b17

…56288) Docs for #55735 Co-authored-by: Lisa Cawley <lcawley@elastic.co>

droberts195 added a commit that referenced this pull request May 14, 2020

[DOCS] Docs changes for overridden delimiter in find_file_structure (#…

4438115

…56288) Docs for #55735 Co-authored-by: Lisa Cawley <lcawley@elastic.co>

droberts195 added a commit that referenced this pull request May 14, 2020

[DOCS] Docs changes for overridden delimiter in find_file_structure (#…

53cc42e

…56288) Docs for #55735 Co-authored-by: Lisa Cawley <lcawley@elastic.co>

droberts195 mentioned this pull request Jun 23, 2020

[ML] Need to be able to quickly/temporarily adjust timeout when uploading to File Data Visualizer elastic/kibana#69624

Open

droberts195 mentioned this pull request Jul 3, 2020

[ML] More tolerant delimited file parsing when structure is overridden #38890

Closed

droberts195 mentioned this pull request Dec 3, 2020

Data Visualizer fails to import data without a timestamp elastic/kibana#63526

Open

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Allow a certain number of ill-formatted rows when delimited format is specified #55735

[ML] Allow a certain number of ill-formatted rows when delimited format is specified #55735

benwtrent commented Apr 24, 2020

elasticmachine commented Apr 24, 2020

droberts195 left a comment

benwtrent commented Apr 27, 2020

droberts195 commented Apr 27, 2020

benwtrent commented Apr 27, 2020

benwtrent commented Apr 27, 2020

droberts195 left a comment

benwtrent commented Apr 28, 2020

droberts195 left a comment

benwtrent commented Apr 29, 2020

przemekwitek left a comment

[ML] Allow a certain number of ill-formatted rows when delimited format is specified #55735

[ML] Allow a certain number of ill-formatted rows when delimited format is specified #55735

Conversation

benwtrent commented Apr 24, 2020

elasticmachine commented Apr 24, 2020

droberts195 left a comment

Choose a reason for hiding this comment

benwtrent commented Apr 27, 2020

droberts195 commented Apr 27, 2020

benwtrent commented Apr 27, 2020

benwtrent commented Apr 27, 2020

droberts195 left a comment

Choose a reason for hiding this comment

benwtrent commented Apr 28, 2020

droberts195 left a comment

Choose a reason for hiding this comment

benwtrent commented Apr 29, 2020

przemekwitek left a comment

Choose a reason for hiding this comment