Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Data Visualizer to accept data without timestamp #60196

Closed
kiju98 opened this issue Mar 14, 2020 · 5 comments
Closed

[ML] Data Visualizer to accept data without timestamp #60196

kiju98 opened this issue Mar 14, 2020 · 5 comments
Labels
Feature:File and Index Data Viz ML file and index data visualizer :ml v7.7.0

Comments

@kiju98
Copy link

kiju98 commented Mar 14, 2020

Kibana version: 7.6.1

Describe the feature:
Currently Data Visualizer produces the following error when we try to load to upload a data file without timestamp:

File could not be read
[illegal_argument_exception] Could not find a timestamp in the sample provided

data_visualizer

It would be more helpful if Data Visualizer accepts data without timestamp.

Describe a specific use case for the feature:
It was enough for Data Visualizer to load only data with timestamp because Elastic Machine Learning (anomaly detection) only handled data with timestamp, but 7.6 introduced other features like classification which does not require timestamp and I think it would be helpful if Data Visualizer accepts data without timestamp.

@kiju98 kiju98 added Feature:File and Index Data Viz ML file and index data visualizer v7.6.1 labels Mar 14, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui (:ml)

@peteharverson peteharverson changed the title Data Visualizer to accept data without timestamp [ML] Data Visualizer to accept data without timestamp Mar 16, 2020
@droberts195
Copy link
Contributor

It does accept data without a timestamp, providing the data is in a highly structured format like NDJSON, CSV, TSV, semi-colon separated, etc.

It needs a timestamp for semi-structured log data because the rule for "what is the first line of each message" is "the line with the timestamp on".

If you think your data was CSV or some other delimited format then the real question here is, what was the thing that made the file structure finder think it was not possible to import as CSV? Sending the file direct to the backend find_file_structure endpoint and using the ?explain option will give more insight into this.

There are a number of things that could come out of this if the file was CSV:

  1. It would be useful if the UI error message included the explanation from the backend endpoint
  2. The backend endpoint should be more tolerant about detecting CSV when the format is explicitly overridden - see [ML] More tolerant delimited file parsing when structure is overridden elasticsearch#38890
  3. The UI should let you give hints by setting overrides if the initial import fails - see [ML] File Data Viz should allow retry with overrides when initial analysis fails #38868 - it's not what the title of the issue says but would be covered by:

It would be nice if the user was able to enter overrides after an error on the initial analysis, to enable them to import a file in situations when giving the structure analysis hints would allow it to succeed.

@kiju98
Copy link
Author

kiju98 commented Mar 16, 2020

Thank you @droberts195
The data file was CSV.
I tried the find_file_structure endpoint and the result was:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Could not find a timestamp in the sample provided"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Could not find a timestamp in the sample provided",
    "suppressed" : [
      {
        "type" : "exception",
        "reason" : "Explanation so far:\n[Using character encoding [UTF-8], which matched the input with [100%] confidence]\n[Not NDJSON because there was a parsing exception: [Unrecognized token 'korean_title': was expecting ('true', 'false' or 'null') at [Source: \"korean_title,title,year,country,length,genre,like,director,company\"; line: 1, column: 13]]]\n[Not XML because there was a parsing exception: [ParseError at [row,col]:[1,1] Message: Content is not allowed in prolog.]]\n[Not CSV because row [82] has a different number of fields to the first row: [9] and [8]]\n[Not TSV because the first row has fewer than [2] fields: [1]]\n[Not semicolon delimited values because the first row has fewer than [4] fields: [1]]\n[Not vertical line delimited values because the first row has fewer than [5] fields: [1]]\n[Deciding sample is text]\n"
      }
    ]
  },
  "status" : 400
}

The data file is
movies.zip

I think the error was due to the missing values in the company field. I deleted the company fields and successfully loaded the CSV file. The modified CSV file is
movies2.zip

I hope Data Visualizer would be more generous to allow missing values and agree that it would be helpful if we can see the explanation from Kibana.

@droberts195
Copy link
Contributor

Yes, so Not CSV because row [82] has a different number of fields to the first row: [9] and [8] is the relevant part of the explanation.

It's hard for the file structure finder to ignore discrepancies in numbers of CSV fields per row because then there could be a lot of semi-structured text log files that would get misdetected as CSV.

However, if the format was allowed to be overridden even when the initial analysis fails then elastic/elasticsearch#38890 would help because if you explicitly said your file was CSV then differences in numbers of fields per line could be treated as some lines having blanks at the end.

@kiju98
Copy link
Author

kiju98 commented Mar 18, 2020

Sounds great! Let me close it in favor of elastic/elasticsearch#38890

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:File and Index Data Viz ML file and index data visualizer :ml v7.7.0
Projects
None yet
Development

No branches or pull requests

4 participants