-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Allow a certain number of ill-formatted rows when delimited format is specified #55735
[ML] Allow a certain number of ill-formatted rows when delimited format is specified #55735
Conversation
Pinging @elastic/ml-core (:ml) |
...java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinderFactory.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/FileStructureFinderManager.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/FileStructureFinderFactory.java
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java
Outdated
Show resolved
Hide resolved
...java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinderFactory.java
Outdated
Show resolved
Hide resolved
...java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinderFactory.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should try out the end-to-end UI experience with this change. Try:
- The CSV file with different numbers of fields from [ML] Data Visualizer to accept data without timestamp kibana#60196 (comment) with no overrides
- The CSV file with different numbers of fields from [ML] Data Visualizer to accept data without timestamp kibana#60196 (comment) with the delimiter overridden to
,
- The CSV file with different numbers of fields from [ML] Data Visualizer to accept data without timestamp kibana#60196 (comment) with the structure overridden to delimited, but no explicit delimiter specified
- A semi-structured application log that is not CSV with the number of lines to analyse overridden to 100000
...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java
Show resolved
Hide resolved
@droberts195 to test the end-to-end UI experience, the uploader needs to allow me to supply overrides after the initial parsing failed. When I last tried, this is not possible. I will do some patching locally to work around this. |
Good point. You could still test the case of a semi-structured log file though. Check that the explanations of why it wasn't delimited are formatted correctly.
Yes, you could just hardcode the overrides onto every call to |
Scenarios tested:
|
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I look at this the more complicated a seemingly simple idea becomes 😢
...t/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinderTests.java
Outdated
Show resolved
Hide resolved
...t/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinderTests.java
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java
Outdated
Show resolved
Hide resolved
…low-lenient-delim-parsing-when-specified
…ed' of github.com:benwtrent/elasticsearch into feature/ml-fsf-allow-lenient-delim-parsing-when-specified
run elasticsearch-ci/packaging-sample-unix-docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if you could just adjust a comment to make clear that totalNumberOfRows
is an approximation.
Thanks for all the iterations on this and end-to-end testing with the UI.
The bug with lineMergeSizeLimit
not being considered for CSV can be left to another PR or fixed in this one if you prefer.
...c/main/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java
Outdated
Show resolved
Hide resolved
…structurefinder/DelimitedFileStructureFinder.java Co-Authored-By: David Roberts <dave.roberts@elastic.co>
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…at is specified (elastic#55735) While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.). This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format. This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means. related to elastic#38890
…at is specified (#55735) (#55944) While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.). This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format. This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means. related to #38890
While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.).
This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format.
This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means.
related to #38890