Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix categorize csv test #113089

Merged
merged 5 commits into from
Sep 19, 2024
Merged

Fix categorize csv test #113089

merged 5 commits into from
Sep 19, 2024

Conversation

jan-elastic
Copy link
Contributor

No description provided.

@jan-elastic jan-elastic added >test Issues or PRs that are addressing/adding tests Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL labels Sep 18, 2024
@jan-elastic jan-elastic requested a review from a team as a code owner September 18, 2024 08:19
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@astefan astefan added the test-release Trigger CI checks against release build label Sep 18, 2024
/**
* Supported the text categorization function "CATEGORIZE".
*/
CATEGORIZE(true);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: does this piece of code work as intended?

If you add a non-existing required capability to a CSV test, it always passes.

Copy link
Contributor

@ivancea ivancea Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a capability presence check here:

if (Build.current().isSnapshot()) {
assertThat(
"Capability is not included in the enabled list capabilities on a snapshot build. Spelling mistake?",
testCase.requiredCapabilities,
everyItem(in(EsqlCapabilities.CAPABILITIES))
);
} else {

It was moved to snapshot-only for reasons I don't remember. It probably shouldn't.
I'm not sure if the CI runs a snapshot or not actually (I guess not); it feels weird that it didn't fail in any part of the process

Edit: Thinking about this again, the features are also automatically included as capabilities, so in snapshot mode, it maybe worked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the CI runs snapshot, and therefore it succeeded

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a tag you can add to make the tests run in release mode if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run the test locally with and without snapshot and both work, so I think we're good.

@@ -1,13 +1,15 @@
categorize
required_capability: categorize

# Drop the category ID, because it's non-deterministic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not deterministic because of the order of the rows? What about sorting the data before it enters the STATS?
Dumping the tested function output is weird 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-added the category column and a SORT message inbetween. Thanks for the suggestion.

For my understanding: does the additional SORT step mean that every CATEGORIZE is happening on the coordinating node? I think it must be, otherwise this isn't working correctly, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding: does the additional SORT step mean that every CATEGORIZE is happening on the coordinating node? I think it must be, otherwise this isn't working correctly, right?

Yes. Your instincts for where we split are almost certainly right. We'll just run the agg after the sort. It's not ideal testing, but it's fine for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, the CSV tests are executed in different ways. The CsvTests class, I think it executes like an unit test, like a single node. There are also tests like https://github.com/elastic/elasticsearch/blob/c1daf18bf5f5178d43dc17d3a3d1f5db9773098e/x-pack/plugin/esql/qa/server/multi-clusters/src/javaRestTest/java/org/elasticsearch/xpack/esql/ccq/MultiClusterSpecIT.java, which use a full cluster.
But now thinking about the categorize function and how it works, I don't really know how is the sort executed. And specially as this is the first grouping function depending on order, maybe it makes sense in fact to not check the category column...

Sorry, I think the original solution was correct 🙏
Let's see if @nik9000 can confirm this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failing test suggests that data does not arrive at the categorizer in sorted order.

@nik9000 do you understand why?

I'll leave the test muted for now. Once we replace the (arbitrary) IDs by regexes, the output will be deterministic.

/**
* Supported the text categorization function "CATEGORIZE".
*/
CATEGORIZE(true);

Copy link
Contributor

@ivancea ivancea Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a capability presence check here:

if (Build.current().isSnapshot()) {
assertThat(
"Capability is not included in the enabled list capabilities on a snapshot build. Spelling mistake?",
testCase.requiredCapabilities,
everyItem(in(EsqlCapabilities.CAPABILITIES))
);
} else {

It was moved to snapshot-only for reasons I don't remember. It probably shouldn't.
I'm not sure if the CI runs a snapshot or not actually (I guess not); it feels weird that it didn't fail in any part of the process

Edit: Thinking about this again, the features are also automatically included as capabilities, so in snapshot mode, it maybe worked?

/**
* Supported the text categorization function "CATEGORIZE".
*/
CATEGORIZE(true);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a tag you can add to make the tests run in release mode if you want.

/**
* Support categorize
*/
public static final NodeFeature CATEGORIZE = new NodeFeature("esql.categorize");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yikes. I hadn't noticed the feature vs capability. You can't really remove these once they've left snapshot-land. But this one stayed in snapshot-land forever so I think it's safe to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I missed the pretty explicit:

Don't make more of those - add them to {@link EsqlCapabilities} instead.

@jan-elastic jan-elastic merged commit 47f62f5 into main Sep 19, 2024
17 of 18 checks passed
@jan-elastic jan-elastic deleted the fix-categorize-csv-test branch September 19, 2024 06:52
nik9000 pushed a commit to nik9000/elasticsearch that referenced this pull request Sep 19, 2024
* Move CATEGORIZE from EsqlFeatures to EsqlCapabilities

* Make CATEGORIZE csv test deterministic

* Unmute categorize test

* spotless

* Deterministic categorize csv test with category IDs
nik9000 pushed a commit to nik9000/elasticsearch that referenced this pull request Sep 19, 2024
* Move CATEGORIZE from EsqlFeatures to EsqlCapabilities

* Make CATEGORIZE csv test deterministic

* Unmute categorize test

* spotless

* Deterministic categorize csv test with category IDs
elasticsearchmachine pushed a commit that referenced this pull request Sep 19, 2024
* ES|QL categorize v1 (#112860)

* Prepare TokenListCategorizer for usage in ES|QL

* Expose text categorization from ML module

* Let esql plugin depend on ml plugin

* Fix/suppress this-escape

* Incomplete v1 of ES|QL Categorize

* Refactor / remove CategorizeInternal

* Fix categorize csv test (#113089)

* Move CATEGORIZE from EsqlFeatures to EsqlCapabilities

* Make CATEGORIZE csv test deterministic

* Unmute categorize test

* spotless

* Deterministic categorize csv test with category IDs

---------

Co-authored-by: Jan Kuipers <148754765+jan-elastic@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test Issues or PRs that are addressing/adding tests test-release Trigger CI checks against release build v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants