[Feature Request] Make batch ingestion automatic, not a parameter on _bulk #14283

andrross · 2024-06-13T18:20:09Z

Is your feature request related to a problem? Please describe

A new batch method was added to o.o.ingest.Processor interface in #12457 that allows ingest processor to operate on multiple documents simultaneous, instead of one by one. For certain processors, this can allow for much faster and more efficient processing. However, a new batch_size parameter was added to the _bulk API with a default value of 1. This means in order to benefit from batch processing of any of my ingest processors, I have to do at a minimum two this: determine how many documents to include in my _bulk request and determine the optimal value to set for this batch_size parameter. I must also change all my ingestion tooling to support this new batch_size parameter and to specify it.

Describe the solution you'd like

I want the developers of my ingestion processors to determine good defaults for how they want to handle batches of documents, and then I can see increased performance with no change to my ingestion tooling by simply updating to the latest software. I acknowledge I may still have to experiment with finding the optimal number of documents to include in each _bulk request (this is the status quo for this API and not specific to ingest processors). Also, certain ingest processors may define expert-level configuration options to further optimize if necessary, but I expect the defaults to work well most of the time and to almost always be better than the performance I saw before batching was implemented.

Related component

Indexing

Additional context

I believe this can be implemented as follows:

[required] Increase the default value of batch_size from 1 to Integer.MAX_VALUE. This means that by default the entire content of my bulk request will be passed to each ingest processor. However, the default implemention of batchExecute just operates on one document at a time, so unless my ingest processor is updated to leverage the new batchExecute method, I will see exactly the same behavior that existed previously.
[optional] Emit a deprecation warning if batch_size is specified in the _bulk API
[optional] Remove the functionality of the batch_size parameter in the _bulk API on main

Additional discussion exists on the original RFC starting here: #12457 (comment)

The text was updated successfully, but these errors were encountered:

dbwiddis · 2024-06-13T20:44:43Z

Makes a lot of sense, @andrross ! I support this.

navneet1v · 2024-06-14T05:49:22Z

+1 on the feature request.

dblock · 2024-06-14T13:59:48Z

Semantically speaking, because it's not a max, I think batch_size should be unspecified (null) by default, not Integer.MAX_VALUE. If a user specifies it in the API call, then use that.

andrross · 2024-06-14T15:01:43Z

I think batch_size should be unspecified (null) by default

@dblock Agreed, Integer.MAX_VALUE is just a work-around as it is practically the same as "unspecified", but you're right.

I want to work through a use case to illustrate why I think this needs to change:

Let's say I'm the OpenSearch admin for an organization that uses OpenSearch for log analytics. I have about a dozen different teams writing logs into the cluster I manage. I'm using an ingestion feature that can benefit from batching. The experience I want is that I can test out the new changes, run some performance tests, and then set a "batch_size" parameter in my system configuration that optimizes throughput and latency for my use case. Then I can deploy these changes and all my users see a performance improvement.

I do not want to have to chase down a dozen teams, force them to deploy an update to the tool they use to send _bulk requests to the cluster, and get them to configure a good "batch_size". I do not want to repeat this exercise if I change the host type or anything else that results in a different optimal setting for "batch_size".

dblock · 2024-06-14T16:30:52Z

@andrross A different default batch_size per analyzer makes sense. It's also possible that you'll want a max value per analyzer as well.

chishui · 2024-06-21T00:59:25Z

Regarding a default batch size other than 1: Different ML servers have different maximum batch size limitations. For example, Cohere requires batch size to be less than 96 and OpenAI requires it to be less than 2048. Exceeding the limit could lead to data loss. I'm afraid that to enable a default batch size other than 1 can't suit all users' situations. Additionally, we already added batchExecute support in two neural search processors: text_embedding and sparse_encoding, only changing default batch_size value could impact user immediately.

Based on your suggestion, I think we could do as such:

Add batch_size parameter in text_embedding and sparse_encoding processors, and set it to 1 by default (Seems like we can't make it a required parameter as it's backward incompatible). Another option is to set the parameter in AbstractProcessor similar to tag and description, it'll be ignored by most processors, but we can have the cutting batches logic as a default logic for all ingest processor which have batch_size parameter set.
Change the default (or unspecified) behavior of _bulk API's batch_size to include all bulk documents.
Mark the batch_size of bulk API as deprecated and start deprecating process.

This method won't give user a performance gain automatically either due to limitation mentioned in the first paragraph. For users who haven't started to use bulk with batch_size, they'll see no change. For users who already started to use bulk with batch_size, they'll see the performance degradation (fall back to when they don't use batch_size), but there should be no data loss. And once they update their processor with batch_size, they could see performance improvement.

andrross · 2024-06-21T16:02:16Z

Thanks @chishui. Your suggestions above are exactly what I've had in mind here. I realize some of what I suggested above may be unclear. I never intended to suggest that the effective behavior of the sub-batching in the neural-search processors should change. The key point is that the sub-batching of documents into batches smaller than what was provided in the original _bulk request should happen inside the processors and not at the IngestService layer.

Another option is to set the parameter in AbstractProcessor

If this sub-batching behavior is a common pattern, we could certainly create an AbstractBatchingProcessor that defines a batch size parameter and even implements the logic to break the list from _bulk into smaller batches. I'd recommend starting with this pattern inside the neural-search plugin to be shared across its processors, and then we can promote the base class up to a common location if it proves useful and reusable.

andrross added enhancement Enhancement or improvement to existing feature or request untriaged labels Jun 13, 2024

github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Jun 13, 2024

andrross mentioned this issue Jun 13, 2024

[RFC] Parallel & Batch Ingestion #12457

Closed

dblock removed the untriaged label Jun 14, 2024

chishui mentioned this issue Jun 14, 2024

[DOC] Documentation of batch ingestion feature opensearch-project/documentation-website#7305

Closed

4 tasks

chishui mentioned this issue Jun 26, 2024

Add batching processor base type AbstractBatchingProcessor #14554

Merged

3 tasks

chishui mentioned this issue Jul 4, 2024

Use AbstractBatchingProcessor for InferenceProcessor opensearch-project/neural-search#820

Merged

5 tasks

This was referenced Jul 11, 2024

Change default batch size of bulk API to Integer.MAX_VALUE #14725

Merged

[DOC] Deprecate batch_size in bulk and move it to two processors opensearch-project/documentation-website#7717

Closed

chishui mentioned this issue Jul 29, 2024

Remove batch_size of bulk API from tests & refactor BWC version check opensearch-project/neural-search#852

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Make batch ingestion automatic, not a parameter on _bulk #14283

[Feature Request] Make batch ingestion automatic, not a parameter on _bulk #14283

andrross commented Jun 13, 2024 •

edited

Loading

dbwiddis commented Jun 13, 2024

navneet1v commented Jun 14, 2024 •

edited

Loading

dblock commented Jun 14, 2024

andrross commented Jun 14, 2024

dblock commented Jun 14, 2024

chishui commented Jun 21, 2024

andrross commented Jun 21, 2024 •

edited

Loading

[Feature Request] Make batch ingestion automatic, not a parameter on _bulk #14283

[Feature Request] Make batch ingestion automatic, not a parameter on _bulk #14283

Comments

andrross commented Jun 13, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Additional context

dbwiddis commented Jun 13, 2024

navneet1v commented Jun 14, 2024 • edited Loading

dblock commented Jun 14, 2024

andrross commented Jun 14, 2024

dblock commented Jun 14, 2024

chishui commented Jun 21, 2024

andrross commented Jun 21, 2024 • edited Loading

andrross commented Jun 13, 2024 •

edited

Loading

navneet1v commented Jun 14, 2024 •

edited

Loading

andrross commented Jun 21, 2024 •

edited

Loading