Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pull-based Ingestion][WIP] Introduce the new pull-based ingestion engine, APIs, and Kafka plugin #16958

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

yupeng9
Copy link

@yupeng9 yupeng9 commented Jan 6, 2025

Description

This PR implements the basics of the pull-based ingestion described in this RFC, including:

  1. The APIs for the pull-based ingestion source
  2. A Kafka plugin that implements the ingestion source API
  3. A new IngestionEngine that pulls data from the ingestion sources

Currently WIP, and there are a few improvements to make and test coverage to increase

Related Issues

Resolves #16927 #16929 #16928

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing labels Jan 6, 2025
Copy link
Contributor

github-actions bot commented Jan 6, 2025

❌ Gradle check result for 16dd9d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment on lines +126 to +140
new Translog.Snapshot() {
@Override
public void close() {}

@Override
public int totalOperations() {
return 0;
}

@Override
public Translog.Operation next() {
return null;
}
}
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe create a static EMPTY_TRANSLOG_SNAPSHOT and reuse across this and NoOpEngine

Comment on lines +147 to +151
String clientId = engineConfig.getIndexSettings().getNodeName()
+ "-"
+ engineConfig.getIndexSettings().getIndex().getName()
+ "-"
+ engineConfig.getShardId().getId();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use ids instead of names like index uuid, node id etc

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how would the FGAC security model work, espl with security plugin which intercepts transport actions to validate if authorised users can perform bulk actions on certain indices. Is the intent to handle permissions at a Kafka "partition level"
Another aspect is maintaining Kafka checkpoints durably, I'm yet to read that part but would be good to understand how are we handling fail overs and recoveries

*
* @opensearch.api
*/
public interface IngestionConsumerPlugin {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put the @ExperimentalApi annotation on this as well

*/

/** Indices ingestion module package. */
package org.opensearch.indices.ingest;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "ingest" is definitely overloaded. _bulk is a type of ingestion, there are ingest pipelines, etc. I'd suggest using polling.ingest or pollingingest or anything else that helps disambiguate this area of the code from the ingest related pieces.

/**
* Start the poller
*/
void start();;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have the LifecycleComponent interface (and an abstract implementation). I don't know if that would be useful here but please take a look if you hadn't considered extending it.

private final TranslogManager translogManager;
private final DocumentMapperForType documentMapperForType;
private final IngestionConsumerFactory ingestionConsumerFactory;
protected StreamPoller streamPoller;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like streamPoller is assigned in the constructor and never accessed outside this class. Why is it not private final?

}

versions << [
'kafka': '2.8.2',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite old (September 2022 according to https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients). Why not use the newest available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Pull-based ingestion source APIs
3 participants