Add Opensearch embedding store implementation #213

sebastienblanc · 2024-01-09T07:33:12Z

Maybe @sboeckelmann can also review this since you are the author of the opensearch extension (and also added support in langchain4j)

formatting and doc

geoand · 2024-01-09T07:37:17Z

...ch/runtime/src/main/java/io/quarkiverse/langchain4j/opensearch/OpenSearchEmbeddingStore.java

+import software.amazon.awssdk.http.SdkHttpClient;
+import software.amazon.awssdk.http.apache.ApacheHttpClient;


Do we really have to use these clients?

I would much rather use the Quarkus HTTP stack since it leads to lower resource usage while also guaranteeing things will work in native mode.

Another option would be the JDK's HttpClient

yeah I need to dive a bit into this, tbh, this class is just a copy/paste from langchain4j, I just replaced the logger for now.

Except that the JDK one can leak connection - so I would rather use the quarkus one.

So this is to connect to an AWS OpenSearch server, the opensearch extension also uses those https://github.com/quarkiverse/quarkus-opensearch/blob/main/opensearch-java-client/runtime/src/main/java/io/quarkiverse/opensearch/client/runtime/OpenSearchTransportHelper.java , this is where it would be nice to have @sboeckelmann feedback

Do we really have to use these clients?

I would much rather use the Quarkus HTTP stack since it leads to lower resource usage while also guaranteeing things will work in native mode.

If you mean it would lead to not duplicating thread pools, sure, but I'd like to mention Apache HTTP clients have an async/reactive implementation as well, so you don't necessarily need to rely on blocking code.

FWIW in Hibernate Search we just use the Elasticsearch low-level client (a thin wrapper on top of Apache HTTP client -- the async one) to connect to either Elasticsearch and OpenSearch, and have a custom Apache HTTP plugin based on the AWS SDK to support signing. That way we handle it all with a single impl: Elasticsearch or OpenSearch, AWS or non-AWS. Though I'll admit it feels a bit messy. Another advantage is that AWS request signing in that case is well-tested (it's harder than you'd think, especially with chunked requests).

I would recommend that you rely on the existing quarkus-elasticsearch-rest-client extension but... then you won't get AWS signing (since the only support we have right now is specific to Hibernate Search).

Eventually we'll need a more consolidated approach for the Elasticsearch low-level client. quarkusio/quarkus#26991 will get us one step closer, as it will force me to move AWS signing to something not specific to Hibernate Search.

We could also consider dropping the Elasticsearch low-level client and shipping a low-level client of our own based on the Quarkus HTTP stack, but I really would only do this if we can agree to use it on all extensions related to Elasticearch/OpenSearch, and if we can afford the API break.

If you mean it would lead to not duplicating thread pools, sure, but I'd like to mention Apache HTTP clients have an async/reactive implementation as well, so you don't necessarily need to rely on blocking code

I'm more thinking about the impact loading 100s of classes of yet another HTTP library, than the impact of another thread pool which is smaller (but of course ideally would not exist)

geoand · 2024-01-09T07:38:09Z

@yrodiere might also be interested in this :)

docs/modules/ROOT/pages/opensearch-store.adoc

opensearch/deployment/pom.xml

opensearch/runtime/pom.xml

cescoffier · 2024-01-09T07:39:34Z

...ch/runtime/src/main/java/io/quarkiverse/langchain4j/opensearch/OpenSearchEmbeddingStore.java

+import software.amazon.awssdk.http.SdkHttpClient;
+import software.amazon.awssdk.http.apache.ApacheHttpClient;


Except that the JDK one can leak connection - so I would rather use the quarkus one.

docs/modules/ROOT/pages/opensearch-store.adoc

Co-authored-by: Clement Escoffier <clement.escoffier@gmail.com>

yrodiere

Thanks for pinging me @geoand, here are my two cents.

As explained below, consolidating our approach to connecting to OpenSearch/Elasticsearch is certainly going to be an issue.

In the long run (not possible right now), I wonder if we wouldn't want to use Hibernate Search in standalone mode (not yet supported in Quarkus, see quarkusio/quarkus#26182 ) for this store and the Elasticsearch one?

Hibernate Search supports connecting to multiple versions of Elasticsearch/OpenSearch transparently, and it has built-in support for mass indexing, search (full-text, spatial, ... and vector in 7.1). Perhaps more interesting, it supports keeping the Elasticsearch/OpenSearch index in sync with a primary source of truth (relational database) without reindexing everything all the time. What it does not (and will not) support is transforming content into embeddings (vectors), so that part would certainly still be langchain4j's job.

Or perhaps we'd just want to keep this embedding store "pristine", and simply have the Hibernate Search extension provide an integration with Lang4j to transform. I.e. expect people to either use the langchain4j API directly without Hibernate Search, or the Hibernate Search API which would delegate some things to langchain4j? At this point I'm not sure what's best, we'd need to discuss it.

yrodiere · 2024-01-09T09:30:24Z

...ch/runtime/src/main/java/io/quarkiverse/langchain4j/opensearch/OpenSearchEmbeddingStore.java

+import software.amazon.awssdk.http.SdkHttpClient;
+import software.amazon.awssdk.http.apache.ApacheHttpClient;


Do we really have to use these clients?

I would much rather use the Quarkus HTTP stack since it leads to lower resource usage while also guaranteeing things will work in native mode.

If you mean it would lead to not duplicating thread pools, sure, but I'd like to mention Apache HTTP clients have an async/reactive implementation as well, so you don't necessarily need to rely on blocking code.

FWIW in Hibernate Search we just use the Elasticsearch low-level client (a thin wrapper on top of Apache HTTP client -- the async one) to connect to either Elasticsearch and OpenSearch, and have a custom Apache HTTP plugin based on the AWS SDK to support signing. That way we handle it all with a single impl: Elasticsearch or OpenSearch, AWS or non-AWS. Though I'll admit it feels a bit messy. Another advantage is that AWS request signing in that case is well-tested (it's harder than you'd think, especially with chunked requests).

I would recommend that you rely on the existing quarkus-elasticsearch-rest-client extension but... then you won't get AWS signing (since the only support we have right now is specific to Hibernate Search).

Eventually we'll need a more consolidated approach for the Elasticsearch low-level client. quarkusio/quarkus#26991 will get us one step closer, as it will force me to move AWS signing to something not specific to Hibernate Search.

We could also consider dropping the Elasticsearch low-level client and shipping a low-level client of our own based on the Quarkus HTTP stack, but I really would only do this if we can agree to use it on all extensions related to Elasticearch/OpenSearch, and if we can afford the API break.

sboeckelmann · 2024-01-09T09:47:52Z

don't use the old rest-client, those are deprecated.
The AWS2 SDK unfortunately needs to have the Apache HTTP Client stack being setup. Use the new Async Java Client

yrodiere · 2024-01-09T09:50:31Z

...ch/runtime/src/main/java/io/quarkiverse/langchain4j/opensearch/OpenSearchEmbeddingStore.java

+import dev.langchain4j.store.embedding.EmbeddingStore;
+import io.quarkus.logging.Log;
+import software.amazon.awssdk.http.SdkHttpClient;
+import software.amazon.awssdk.http.apache.ApacheHttpClient;


Ha, it just occurred to me that by bypassing the existing quarkus-elasticsearch-rest-client extension, you're also bypassing the dev service support for OpenSearch... Might be a bit annoying?

Indeed, very good point!

yeah IMO OpenSearch dev service is a "must have" (and it's also used by the integration test)

And @sboeckelmann mentioned that the rest client is deprecated.
So, what is the conclusion for this part ? Do we keep it as is for now ?

And @sboeckelmann mentioned that the rest client is deprecated.

Only the one being used in this PR. There's another one with async/reactive support in Apache HTTP Client, and I'd be surprised if the AWS SDK didn't provide a wrapper around that, too.

Anyway, here's my suggestion: I'd recommend relying on quarkus-elasticsearch-rest-client. It is quite low level (you'll have to write JSON), but it's consistent with what we expose in other Elasticsearch/OpenSearch extensions, and using it will get you dev services for (almost) free. Almost, because you'll probably still need to use a build item somewhere to let the Dev Services know you target OpenSearch, not Elasticsearch: see DevservicesElasticsearchBuildItem.

Users would have to add AWS request signing manually for now, though. We can add built-in support to that client, though ideally we'd put that in https://github.com/quarkiverse/quarkus-amazon-services somewhere.

FWIW, you can find a relatively simple implementation of that stuff here (the entry point is ElasticsearchAwsHttpClientConfigurer); it's LGPL, but I'm the author, and I'm hereby licensing this code to anyone interested as ASL2.

I'd personally suggest dropping AWS request signing for now and adding separately, but it's your call.

If nobody works on AWS request signing for quarkus-elasticsearch-rest-client, I'll eventually have to do it when dealing with quarkusio/quarkus#26991, but I can't guarantee this will happen soon.

@yrodiere Thanks for all the pointers ! TBH, I have currently no free cycles to refactor this to be based on quarkus-elasticsearch-rest-client , if anyone else wants to pick this up. In the mean time, this current impl is working so maybe we can open a issue to refactor this later? wdyt @geoand ?

I'm afraid that if we do leave it as is, we'll never get around to improving it :).

yrodiere · 2024-01-09T09:53:35Z

...ch/runtime/src/main/java/io/quarkiverse/langchain4j/opensearch/OpenSearchEmbeddingStore.java

+        List<String> ids = embeddings.stream()
+                .map(ignored -> randomUUID())
+                .collect(toList());
+        addAllInternal(ids, embeddings, embedded);


If it's not already handled by the caller, you might want to limit the number of documents you add to your bulk request.

I have no idea of the size and number of documents involved here, but I know that for more than a few hundred large documents on a low-RAM install of OpenSearch (e.g. what we provide with Dev Services, ~1GB heap), the request buffer of OpenSearch/Elasticsearch can easily get overloaded and then OpenSearch/Elasticsearch will start rejecting requests.

In Hibernate Search we have a configurable amount of parallel queues, each sending at most one bulk request at a time with a maximum number of documents in each bulk request (see https://github.com/quarkusio/search.quarkus.io/blob/8accfb39e06372805ce41d108a927e00828b1e25/src/main/resources/application.properties#L52C1-L56). This helps users tune their set up to avoid problems.

That being said, if you know your documents are small and the number of documents is reasonable (i.e. not millions of documents), you might be able to do away with such complexity.

Most of the time it should not be that many documents and their size should be pretty small but we can add maybe a "nice to have" for this.

cescoffier · 2024-11-17T15:22:40Z

@geoand what should we do about this one? The HTTP client issue is concerning.

geoand · 2024-11-17T16:10:56Z

I think we can close this

sboeckelmann · 2024-11-17T16:57:23Z

I was hoping for the apache httpclient 5 to become officially supported by Quarkus some day.
OpenSearch depends on httpclient5, but Quarkus only provides for version 4.

So in my opinion there are 2 options:

Have Opensearch use httpclient 4 (which is not possible)
Have general httpclient5 support within Quarkus

geoand · 2024-11-17T17:14:41Z

I was hoping for the apache httpclient 5 to become officially supported by Quarkus some day

That is highly unlikely unfortunately as it would be a suboptimal use of our (limited) resources

sboeckelmann · 2024-11-17T19:32:51Z

That is highly unlikely unfortunately as it would be a suboptimal use of our (limited) resources

Yes, exactly, in addition to that, I think having support for httpclient5 within Quarkus would not really make much of a difference either

I actually don't see such a big isse with the httpclient5 dependency within opensearch. It's not nice to have two artifacts doing the same thing - but they are totally separate and independent and except for bloating things up with additional packages, it's not really doing any harm:
the one included in Quarkus has the artifactId httpclient, the other one's artifactId is httpclient5

cescoffier · 2024-11-24T16:13:53Z

@sboeckelmann There is a bit more to the story.

We don't know if it works in native OOTB. Generally, when we have extensions, it's because we have to tune things.
Also, it would not follow the underlying network architecture of Quarkus, which will likely miss context propagation, security propagation, and so on.

sboeckelmann · 2024-11-24T16:30:57Z

@sboeckelmann There is a bit more to the story.

I really understand that. But like mentioned above: there's no httpcient 4 support in opensearch. The only option I can think of is moving the TranportProducer for httpclient5 out into an optional additional artifact. AWS should be fine.

cescoffier · 2024-11-24T17:03:07Z

Even httpclient4 is not great. Is there any possibility of plugging an HTTP client managed by the framework itself?

sboeckelmann · 2024-11-25T08:49:21Z

Even httpclient4 is not great. Is there any possibility of plugging an HTTP client managed by the framework itself?

Thank you @cescoffier ! I’m really happy to see that the discussion has been picked up again and is moving forward.

Regarding the possibility of plugging in an HTTP client managed by the framework: I can definitely look into this. It would be very helpful if you could provide a hint or reference to where this has already been implemented.
This would greatly assist in identifying what needs to be done and how best to support this feature in a clean and effective way.

geoand · 2024-11-25T08:53:56Z

What @cescoffier is referring is what for example what the Kubernetes Client or Testcontainers do, where the is an HTTP related SPI that various modules (like OkHttp, Vertx Http Client, JDK Http Client etc) can implement.

sboeckelmann · 2024-11-25T08:58:25Z

What @cescoffier is referring is what for example what the Kubernetes Client or Testcontainers do, where the is an HTTP related SPI that various modules (like OkHttp, Vertx Http Client, JDK Http Client etc) can implement.

Can you pinpoint me to where this was done in Kubernetes Client and what the SPI looks like over there? Would be nice to be able to follow some sort of blueprint.

geoand · 2024-11-25T09:01:19Z

You can start looking at io.fabric8.kubernetes.client.http.HttpClient . FWIW, the Kubernetes Client probably does more than most other such integrations would need.

cescoffier · 2024-12-09T07:45:21Z

@sboeckelmann, any update on your side?

sboeckelmann · 2024-12-09T08:01:29Z

@sboeckelmann, any update on your side?

I am planning on providing httpclient5 and aws as separate HttpTransport provider modules.
This should resolve the problem (or at least provide requried flexibitliy).
But I won't be able to work on it before the beginning of next year.

adding opensearch

2e88bf5

formatting and doc

sebastienblanc requested a review from a team as a code owner January 9, 2024 07:33

geoand changed the title ~~adding opensearch as embedding store~~ Add Opensearch embedding store implementation Jan 9, 2024

geoand reviewed Jan 9, 2024

View reviewed changes

cescoffier requested changes Jan 9, 2024

View reviewed changes

Apply suggestions from code review

c4dd60e

Co-authored-by: Clement Escoffier <clement.escoffier@gmail.com>

yrodiere reviewed Jan 9, 2024

View reviewed changes

extracting to property, doc update

b5afbd2

sebastienblanc mentioned this pull request Mar 10, 2024

Elasticsearch integration #357

Open

		import software.amazon.awssdk.http.SdkHttpClient;
		import software.amazon.awssdk.http.apache.ApacheHttpClient;

Add Opensearch embedding store implementation #213

Are you sure you want to change the base?

Add Opensearch embedding store implementation #213

Conversation

sebastienblanc commented Jan 9, 2024

Choose a reason for hiding this comment

geoand Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yrodiere Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoand commented Jan 9, 2024

Choose a reason for hiding this comment

yrodiere left a comment • edited Loading

Choose a reason for hiding this comment

yrodiere Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

sboeckelmann commented Jan 9, 2024

Choose a reason for hiding this comment

geoand Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yrodiere Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cescoffier commented Nov 17, 2024

geoand commented Nov 17, 2024

sboeckelmann commented Nov 17, 2024

geoand commented Nov 17, 2024

sboeckelmann commented Nov 17, 2024 • edited Loading

cescoffier commented Nov 24, 2024

sboeckelmann commented Nov 24, 2024

cescoffier commented Nov 24, 2024

sboeckelmann commented Nov 25, 2024

geoand commented Nov 25, 2024

sboeckelmann commented Nov 25, 2024

geoand commented Nov 25, 2024 • edited Loading

cescoffier commented Dec 9, 2024

sboeckelmann commented Dec 9, 2024

geoand Jan 9, 2024 •

edited

Loading

yrodiere Jan 9, 2024 •

edited

Loading

yrodiere left a comment •

edited

Loading

yrodiere Jan 9, 2024 •

edited

Loading

geoand Jan 9, 2024 •

edited

Loading

yrodiere Jan 9, 2024 •

edited

Loading

sboeckelmann commented Nov 17, 2024 •

edited

Loading

geoand commented Nov 25, 2024 •

edited

Loading