-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] More tolerant delimited file parsing when structure is overridden #38890
Labels
Comments
Pinging @elastic/ml-core |
benwtrent
added a commit
that referenced
this issue
Apr 29, 2020
…at is specified (#55735) While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.). This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format. This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means. related to #38890
benwtrent
added a commit
to benwtrent/elasticsearch
that referenced
this issue
Apr 29, 2020
…at is specified (elastic#55735) While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.). This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format. This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means. related to elastic#38890
benwtrent
added a commit
that referenced
this issue
Apr 29, 2020
…at is specified (#55735) (#55944) While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.). This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format. This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means. related to #38890
FIxed by #55735 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Inspired by elastic/kibana#31065.
At present the file structure finder will only detect a delimited file if all rows have the same number of columns. This is sensible when determining the structure from scratch, but when the structure has been explicitly specified as delimited using an override and the exact delimiter is also supplied it makes more sense to believe the user and try to create a structure using the specified format even if it means there are different numbers of columns per row.
Additionally, when doing timestamp format determination for delimited files it would be nice to have an options to detect a timestamp field when a small percentage of rows did not match. We could still default to requiring 100% matches but offer the option to reduce this to, say, 95%.
The text was updated successfully, but these errors were encountered: