Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Allow a certain number of ill-formatted rows when delimited format is specified #55735

Conversation

benwtrent
Copy link
Member

While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.).

This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format.

This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means.

related to #38890

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should try out the end-to-end UI experience with this change. Try:

  1. The CSV file with different numbers of fields from [ML] Data Visualizer to accept data without timestamp kibana#60196 (comment) with no overrides
  2. The CSV file with different numbers of fields from [ML] Data Visualizer to accept data without timestamp kibana#60196 (comment) with the delimiter overridden to ,
  3. The CSV file with different numbers of fields from [ML] Data Visualizer to accept data without timestamp kibana#60196 (comment) with the structure overridden to delimited, but no explicit delimiter specified
  4. A semi-structured application log that is not CSV with the number of lines to analyse overridden to 100000

@benwtrent
Copy link
Member Author

@droberts195 to test the end-to-end UI experience, the uploader needs to allow me to supply overrides after the initial parsing failed. When I last tried, this is not possible.

I will do some patching locally to work around this.

@droberts195
Copy link
Contributor

to test the end-to-end UI experience, the uploader needs to allow me to supply overrides after the initial parsing failed. When I last tried, this is not possible.

Good point.

You could still test the case of a semi-structured log file though. Check that the explanations of why it wasn't delimited are formatted correctly.

I will do some patching locally to work around this.

Yes, you could just hardcode the overrides onto every call to find_file_structure in a local Kibana build. Obviously that local build will only be any good for the single test case, but it's a way to escape the chicken-and-egg situation of needing both this backend change and a UI change to see the full benefit.

@benwtrent
Copy link
Member Author

Scenarios tested:

  1. It fails. Saying the row it failed on and why (number of fields differing than the first row).
  2. It succeeds and data was able to be imported without issue
  3. It succeeds and data was able to be imported without issue
  4. I did not experience any timeouts. Without overrides, it behaves much like it has before as the very first line that has a mismatch causes the failure.

@benwtrent
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I look at this the more complicated a seemingly simple idea becomes 😢

…ed' of github.com:benwtrent/elasticsearch into feature/ml-fsf-allow-lenient-delim-parsing-when-specified
@benwtrent
Copy link
Member Author

run elasticsearch-ci/packaging-sample-unix-docker

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if you could just adjust a comment to make clear that totalNumberOfRows is an approximation.

Thanks for all the iterations on this and end-to-end testing with the UI.

The bug with lineMergeSizeLimit not being considered for CSV can be left to another PR or fixed in this one if you prefer.

…structurefinder/DelimitedFileStructureFinder.java

Co-Authored-By: David Roberts <dave.roberts@elastic.co>
@benwtrent
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit fd554d9 into elastic:master Apr 29, 2020
@benwtrent benwtrent deleted the feature/ml-fsf-allow-lenient-delim-parsing-when-specified branch April 29, 2020 14:24
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Apr 29, 2020
…at is specified (elastic#55735)

While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.).

This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format.

This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means.

related to elastic#38890
benwtrent added a commit that referenced this pull request Apr 29, 2020
…at is specified (#55735) (#55944)

While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.).

This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format.

This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means.

related to #38890
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request May 6, 2020
droberts195 added a commit that referenced this pull request May 14, 2020
…56288)

Docs for #55735

Co-authored-by: Lisa Cawley <lcawley@elastic.co>
droberts195 added a commit that referenced this pull request May 14, 2020
…56288)

Docs for #55735

Co-authored-by: Lisa Cawley <lcawley@elastic.co>
droberts195 added a commit that referenced this pull request May 14, 2020
…56288)

Docs for #55735

Co-authored-by: Lisa Cawley <lcawley@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants