Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Documentation: ingestion process #16914

Open
pkovac2 opened this issue Dec 27, 2024 · 1 comment
Open

[DISCUSS] Documentation: ingestion process #16914

pkovac2 opened this issue Dec 27, 2024 · 1 comment
Labels
bug Something isn't working Other untriaged

Comments

@pkovac2
Copy link

pkovac2 commented Dec 27, 2024

Ingest process information missing

Hey all,

recently we have deployed multiple instances of OpenSearch via K8S operator . We are fairly new to OS so we're trying to understand how the things work under the hood, so that we can investigate properly in case of problems. At this moment we're trying to understand how the data ingestion process works internally in the OpenSearch. Unfortunately there's literally nothing in the OS documentation how the data ingestion process is handled in detail. Our current setup looks like below (we don't use Data Prepper):

FluentBit -> OpenSearch ingest service LB -> Dedicated ingest nodes -> ?? (master-> data nodes)

What we are trying to understand is:

  1. Do we need dedicated ingest nodes if there's no ingest pipeline configured? Based on the docs, dedicated ingest nodes are only useful to run ingest pipelines, is there anything else the dedicated ingest nodes do?
  2. How is the ingest process working in general? Can this be described in detail in the official documentation? We'd like to understand the whole process, let's say we have an OS cluster where we have:

3 dedicated cluster manager nodes
3 dedicated ingest nodes
x dedicated datanodes

How does the ingest flow look like once the data is received? I'd assume it's something like:
ingest node -> ask master node which data node(s) to use -> data node(s).

But this is not described in the documentation at all (among with who and how is it decided which data nodes to use).

Also based on the official docs, each node is a coordinating node unless dedicated coordinating nodes are specified. How to measure / based on what to decide whether dedicated coordinating node is necessary?

I think we can consider this issue as documentation request type as well.

Many thanks!

Related component

Other

To Reproduce

  1. visit the https://opensearch.org/docs/latest/
  2. search for ingestion process related docs e.g https://opensearch.org/docs/latest/observing-your-data/log-ingestion/
  3. no detailed information is found

Expected behavior

  1. visit the https://opensearch.org/docs/latest/
  2. ingestion process is documented in detail

Additional Details

No response

@pkovac2 pkovac2 added bug Something isn't working untriaged labels Dec 27, 2024
@github-actions github-actions bot added the Other label Dec 27, 2024
@kkewwei
Copy link
Contributor

kkewwei commented Jan 5, 2025

@pkovac2.

  1. If there's no ingest pipeline configured, ingest nodes is not needed.

  2. Ingest process working worked as follows:

  • Coordinated receive the doc with pipeline, it will send request to the request to ingest node.
  • Ingest node resolve the pipe and create the new request.
  • Ingest node will classify the documents according to shardId and then send the request out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Other untriaged
Projects
None yet
Development

No branches or pull requests

2 participants