[Pull-based Ingestion][WIP] Introduce the new pull-based ingestion engine, APIs, and Kafka plugin #16958

yupeng9 · 2025-01-06T19:02:51Z

Description

This PR implements the basics of the pull-based ingestion described in this RFC, including:

The APIs for the pull-based ingestion source
A Kafka plugin that implements the ingestion source API
A new IngestionEngine that pulls data from the ingestion sources

Currently WIP, and there are a few improvements to make and test coverage to increase

Related Issues

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…cessing

github-actions · 2025-01-06T19:44:38Z

❌ Gradle check result for 16dd9d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Bukhtawar · 2025-01-08T08:21:18Z

server/src/main/java/org/opensearch/index/engine/IngestionEngine.java

+                new Translog.Snapshot() {
+                    @Override
+                    public void close() {}
+
+                    @Override
+                    public int totalOperations() {
+                        return 0;
+                    }
+
+                    @Override
+                    public Translog.Operation next() {
+                        return null;
+                    }
+                }
+            );


Maybe create a static EMPTY_TRANSLOG_SNAPSHOT and reuse across this and NoOpEngine

Bukhtawar · 2025-01-08T08:22:46Z

server/src/main/java/org/opensearch/index/engine/IngestionEngine.java

+            String clientId = engineConfig.getIndexSettings().getNodeName()
+                + "-"
+                + engineConfig.getIndexSettings().getIndex().getName()
+                + "-"
+                + engineConfig.getShardId().getId();


Should we use ids instead of names like index uuid, node id etc

Bukhtawar

Curious how would the FGAC security model work, espl with security plugin which intercepts transport actions to validate if authorised users can perform bulk actions on certain indices. Is the intent to handle permissions at a Kafka "partition level"
Another aspect is maintaining Kafka checkpoints durably, I'm yet to read that part but would be good to understand how are we handling fail overs and recoveries

andrross · 2025-01-08T17:34:38Z

server/src/main/java/org/opensearch/plugins/IngestionConsumerPlugin.java

+ *
+ * @opensearch.api
+ */
+public interface IngestionConsumerPlugin {


Let's put the @ExperimentalApi annotation on this as well

andrross · 2025-01-08T17:39:19Z

server/src/main/java/org/opensearch/indices/ingest/package-info.java

+ */
+
+/** Indices ingestion module package. */
+package org.opensearch.indices.ingest;


The term "ingest" is definitely overloaded. _bulk is a type of ingestion, there are ingest pipelines, etc. I'd suggest using polling.ingest or pollingingest or anything else that helps disambiguate this area of the code from the ingest related pieces.

andrross · 2025-01-08T17:41:01Z

server/src/main/java/org/opensearch/indices/ingest/StreamPoller.java

+    /**
+     * Start the poller
+     */
+    void start();;


We do have the LifecycleComponent interface (and an abstract implementation). I don't know if that would be useful here but please take a look if you hadn't considered extending it.

andrross · 2025-01-08T17:45:15Z

server/src/main/java/org/opensearch/index/engine/IngestionEngine.java

+    private final TranslogManager translogManager;
+    private final DocumentMapperForType documentMapperForType;
+    private final IngestionConsumerFactory ingestionConsumerFactory;
+    protected StreamPoller streamPoller;


It looks like streamPoller is assigned in the constructor and never accessed outside this class. Why is it not private final?

andrross · 2025-01-08T17:51:34Z

plugins/ingestion-kafka/build.gradle

+}
+
+versions << [
+  'kafka': '2.8.2',


This looks quite old (September 2022 according to https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients). Why not use the newest available?

yupeng9 added 24 commits December 27, 2024 17:07

add ingestion modules

244d2d9

stream poller wip

a221824

update ingestion engine

d0ff1cf

kafka container

2795d5e

more updates

5cecc54

local update

e269aa2

add batch_start/end to stream poller

e728863

add index settings

1d7491e

local change

8eab373

pass docmapper

094a9aa

basic recovery

dc4ae2b

add kafka ingestion as plugin

e8e4c72

add integration test for kafka plugin

74d539e

cleanup

f2e9b08

use byte[] for message payload type

a9167f2

javadocs

1752268

add ingestionEngineTest

211859a

test recovery test in ingestionEngineTest

a4dfd36

unit tests for kafka plugin

9e3202c

style fix

0ef937e

add license

353e4b8

more unit tests

204c6ba

cleanup

08f0712

use a blocking queue to pass polled messages to the processor for pro…

16dd9d0

…cessing

yupeng9 requested review from anasalkouz, andrross, ashking94, Bukhtawar, CEHENKLE and dblock as code owners January 6, 2025 19:02

yupeng9 requested review from dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami, VachaShah, jainankitk and linuxpi as code owners January 6, 2025 19:02

github-actions bot added enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing labels Jan 6, 2025

Bukhtawar reviewed Jan 8, 2025

View reviewed changes

andrross reviewed Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pull-based Ingestion][WIP] Introduce the new pull-based ingestion engine, APIs, and Kafka plugin #16958

[Pull-based Ingestion][WIP] Introduce the new pull-based ingestion engine, APIs, and Kafka plugin #16958

yupeng9 commented Jan 6, 2025

github-actions bot commented Jan 6, 2025

Bukhtawar Jan 8, 2025

Bukhtawar Jan 8, 2025

Bukhtawar left a comment •

edited

Loading

andrross Jan 8, 2025

andrross Jan 8, 2025

andrross Jan 8, 2025

andrross Jan 8, 2025

andrross Jan 8, 2025

[Pull-based Ingestion][WIP] Introduce the new pull-based ingestion engine, APIs, and Kafka plugin #16958

Are you sure you want to change the base?

[Pull-based Ingestion][WIP] Introduce the new pull-based ingestion engine, APIs, and Kafka plugin #16958

Conversation

yupeng9 commented Jan 6, 2025

Description

Related Issues

github-actions bot commented Jan 6, 2025

Bukhtawar Jan 8, 2025

Choose a reason for hiding this comment

Bukhtawar Jan 8, 2025

Choose a reason for hiding this comment

Bukhtawar left a comment • edited Loading

Choose a reason for hiding this comment

andrross Jan 8, 2025

Choose a reason for hiding this comment

andrross Jan 8, 2025

Choose a reason for hiding this comment

andrross Jan 8, 2025

Choose a reason for hiding this comment

andrross Jan 8, 2025

Choose a reason for hiding this comment

andrross Jan 8, 2025

Choose a reason for hiding this comment

Bukhtawar left a comment •

edited

Loading