Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory estimation endpoint returns "0" for non-empty dataset. #49140

Closed
przemekwitek opened this issue Nov 15, 2019 · 3 comments
Closed

Memory estimation endpoint returns "0" for non-empty dataset. #49140

przemekwitek opened this issue Nov 15, 2019 · 3 comments
Labels
:ml Machine learning

Comments

@przemekwitek
Copy link
Contributor

przemekwitek commented Nov 15, 2019

Dataset: barcelona_accidents

With the following request:

{
  "source": {
    "index": "barcelona_accidents"
  },
  "analysis": {
    "outlier_detection": {}
  }
}

'_estimate_memory_usage' endpoint returns the following response:

{
  "expected_memory_without_disk" : "0",
  "expected_memory_with_disk" : "0"
}

Apparently problem lies in data extraction, as the following search query produced by data extractor yields no results:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "exists": {
            "field": "day",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.day",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.hour",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.location.lat",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.location.lon",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.mild_injuries",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.serious_injuries",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.vehicles_involved",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.victims",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "hour",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "mild_injuries",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "serious_injuries",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "vehicles_involved",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "victims",
            "boost": 1
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "track_total_hits": 2147483647
}

It starts working fine, however if fields without a doc. prefix are removed from the query.

@przemekwitek przemekwitek added the :ml Machine learning label Nov 15, 2019
@przemekwitek przemekwitek self-assigned this Nov 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@przemekwitek
Copy link
Contributor Author

Turns out that although fields like day, hour, etc. exist in the mapping, they do not exist in any document. That's why the query fails to find results.

Mapping:

{
  "barcelona_accidents": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "day": {
          "type": "short"
        },
        "district_name": {
          "type": "keyword"
        },
        "doc": {
          "properties": {
            "@timestamp": {
              "type": "date"
            },
            "day": {
              "type": "long"
            },
            "district_name": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "hour": {
              "type": "long"
            },
            "id": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "location": {
              "properties": {
                "lat": {
                  "type": "float"
                },
                "lon": {
                  "type": "float"
                }
              }
            },
            "mild_injuries": {
              "type": "long"
            },
            "month": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "part_of_the_day": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "serious_injuries": {
              "type": "long"
            },
            "street": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "vehicles_involved": {
              "type": "long"
            },
            "victims": {
              "type": "long"
            },
            "weekday": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "hour": {
          "type": "short"
        },
        "id": {
          "type": "text"
        },
        "location": {
          "type": "geo_point"
        },
        "mild_injuries": {
          "type": "short"
        },
        "month": {
          "type": "keyword"
        },
        "part_of_the_day": {
          "type": "keyword"
        },
        "serious_injuries": {
          "type": "short"
        },
        "street": {
          "type": "keyword"
        },
        "vehicles_involved": {
          "type": "short"
        },
        "victims": {
          "type": "short"
        },
        "weekday": {
          "type": "keyword"
        }
      }
    }
  }
}

Documents:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "barcelona_accidents",
        "_id": "NAawx2gBoWuqnhwrb0Ry",
        "_score": 1,
        "_source": {
          "doc": {
            "id": "2017S000002    ",
            "district_name": "Eixample",
            "street": "GV CORTS CATALANES                                ",
            "weekday": "Sunday",
            "month": "January",
            "day": 1,
            "hour": 2,
            "part_of_the_day": "Night",
            "mild_injuries": 1,
            "serious_injuries": 0,
            "victims": 1,
            "vehicles_involved": 2,
            "location": {
              "lat": 41.39968,
              "lon": 2.1823759999999996
            },
            "@timestamp": "2017-01-01T00:00:00"
          }
        }
      }
    ]
  }
}

@przemekwitek
Copy link
Contributor Author

przemekwitek commented Nov 15, 2019

After talking to @dimitris-athanasiou, we agreed that the data extractor behavior is correct, i.e. the extractor should require all the analysable fields to exist for the outlier detection analysis (which does not support missing values).

So in order to run the analysis correctly the user should exclude the missing fields explicitly from their analysis using:

{
  ...
  "analyzed_fields": {
    "excludes": [ "day", "hour", "mild_injuries", "serious_injuries", "vehicles_involved", "victims" ]
  },
  ...
}

In order to improve the UX, the memory estimation endpoint should throw early if it sees no analysable data so that the user is not confused that they receive "0" estimation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning
Projects
None yet
Development

No branches or pull requests

2 participants