Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Enable date detection for data stream naming scheme by default #109381

Open
ruflin opened this issue Jun 5, 2024 · 4 comments
Open
Labels
:Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team

Comments

@ruflin
Copy link
Contributor

ruflin commented Jun 5, 2024

Currently in the templates for logs-*-*, metrics-*-* etc. date detection is set to false. These templates were introduced with #57629 and the discussion around the defaults happened here. Recently the discussion popped up to potentially change this default. This issue is to have a place for this discussion and persist the decision.

Reasons to keep it disabled

One of the initial reasons to keep it disabled was that it could lead to documents fail to ingest. For example a field is detected as date but the follow up documents contain different values. This concern is become less sever as we have introduced ignore_malformed for all fields and failure store is coming along. Having date_detection on by default could also have a performance impact.

If we change the default now, we also need to discuss if some users would consider this a breaking change.

ECS mappings enough?

Another change that has happened since the initial discussion is the introduction of dynamic ECS templates. These ECS templates contain a block for matching various names to date, for example *.timestamp. Is this enough? Can we encourage users that want to have automatic matching to use one of the names here for their fields?

Overwriting the default

Today it is already possible to overwrite the default by using logs@custom. As soon as data streams roll over, the new default is applied and it will also be persisted during upgrades. If users are using an integration, logs-{your-dataset}@custom can be overwritten to make the change more local. But can we make a change even easier for example by bringing it to the UI or make it a setting per data stream instead of the template? Could we have a switch in Stack Management to toggle date detection on and off?

There is also work ongoing by the team around @flash1293 to detect wrong mappings in fields. Can we detect that a field should be a date and recommend a mapping change?

@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 5, 2024
@felixbarny felixbarny added the :Data Management/Data streams Data streams and their lifecycles label Jun 5, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jun 5, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@eyalkoren
Copy link
Contributor

eyalkoren commented Sep 4, 2024

Another change that has happened since the initial discussion is the introduction of dynamic ECS templates. These ECS templates contain a block for matching various names to date, for example *.timestamp. Is this enough? Can we encourage users that want to have automatic matching to use one of the names here for their fields?

My expectation was that date detection is implicitly disabled anyway when using ecs@mappings due to our default mapping for string fields to keyword. #112398 shows that this wasn't the case for some fields (🤔 ), but to me this looks like a bug that needs to be fixed, meaning that it should be disabled with or without the explicit setting.
So in my opinion - no need to change this setting as it would be additional overhead for no value.

With #112444 we extended our test coverage and our dynamic mappings coverage to verify that all current and future ECS fields are supported when date detection is explicitly disabled.

As you say, there are multiple options for field name patterns to choose from and if nothing specifically meets one's need for a field name, I believe *_timestamp is generic enough to cover anything.

@felixbarny
Copy link
Member

I'm not certain that this is a bug. Seems more like a deliberate decision that date (and numeric) detection has a higher precedence than dynamic templates. Certainly, changing the precedence would be a breaking change.

See also

// We refuse to match pure numbers, which are too likely to be
// false positives with date formats that include eg.
// `epoch_millis` or `YYYY`
for (DateFormatter dateTimeFormatter : context.root().dynamicDateTimeFormatters()) {
try {
dateTimeFormatter.parseMillis(text);
} catch (DateTimeException | ElasticsearchParseException | IllegalArgumentException e) {
// failure to parse this, continue
continue;
}
return createDynamicDateField(
context,
name,
dateTimeFormatter,
() -> strategy.newDynamicDateField(context, name, dateTimeFormatter)
);
}
return createDynamicField(
context,
name,
DynamicTemplate.XContentFieldType.STRING,
() -> strategy.newDynamicStringField(context, name)
);

I do see benefits of enabling date detection by default. Essentially, the value is already telling us it's type. There's a risk of false positives but due to ignore_malformed, at least that doesn't lead to the whole document being rejected. It's also easy to fix by adding a mapping for the field. There's some performance overhead, but that only comes into play for new string fields that aren't in the mapping, yet (this includes rollovers).

@eyalkoren
Copy link
Contributor

I'm not certain that this is a bug. Seems more like a deliberate decision that date (and numeric) detection has a higher precedence than dynamic templates.

Really, you think it's deliberate? If so, I am sure there are good reasons, but it's very counter-intuitive to me that my custom setting would be ignored because of a default behavior, that I consider a fallback - use if nothing else is defined for a field.
Dynamic templates having lower precedence than static mappings make sense to me for the same rationale - they serve as fallbacks, but they are still custom mappings provided explicitly by the user, so I would definitely expect them to have higher precedence.

Certainly, changing the precedence would be a breaking change.

Yes, that's right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

4 participants