Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] More tolerant delimited file parsing when structure is overridden #38890

Closed
droberts195 opened this issue Feb 14, 2019 · 2 comments
Closed
Assignees
Labels
>enhancement :ml Machine learning

Comments

@droberts195
Copy link
Contributor

droberts195 commented Feb 14, 2019

Inspired by elastic/kibana#31065.

At present the file structure finder will only detect a delimited file if all rows have the same number of columns. This is sensible when determining the structure from scratch, but when the structure has been explicitly specified as delimited using an override and the exact delimiter is also supplied it makes more sense to believe the user and try to create a structure using the specified format even if it means there are different numbers of columns per row.

Additionally, when doing timestamp format determination for delimited files it would be nice to have an options to detect a timestamp field when a small percentage of rows did not match. We could still default to requiring 100% matches but offer the option to reduce this to, say, 95%.

@droberts195 droberts195 added >enhancement :ml Machine learning labels Feb 14, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195 droberts195 self-assigned this May 30, 2019
@droberts195 droberts195 assigned benwtrent and unassigned droberts195 Apr 27, 2020
benwtrent added a commit that referenced this issue Apr 29, 2020
…at is specified (#55735)

While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.).

This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format.

This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means.

related to #38890
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Apr 29, 2020
…at is specified (elastic#55735)

While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.).

This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format.

This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means.

related to elastic#38890
benwtrent added a commit that referenced this issue Apr 29, 2020
…at is specified (#55735) (#55944)

While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.).

This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format.

This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means.

related to #38890
@droberts195
Copy link
Contributor Author

FIxed by #55735

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning
Projects
None yet
Development

No branches or pull requests

3 participants