Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Change Proposal] Support Knowledge Base Packages #693

Open
spong opened this issue Jan 18, 2024 · 19 comments · Fixed by #807
Open

[Change Proposal] Support Knowledge Base Packages #693

spong opened this issue Jan 18, 2024 · 19 comments · Fixed by #807
Assignees
Labels
discuss Issue needs discussion Team:Ecosystem Label for the Packages Ecosystem team

Comments

@spong
Copy link
Member

spong commented Jan 18, 2024

Summary

This is a proposal much like #346 or #351 for enabling the ability to bundle static data stream data within a package, such that when the package is installed, the data stream is created, and the bundled data is ingested into the data stream.

The specific use case here is for shipping 'Knowledge Base' content for use by the Elastic Assistants. For example, both Security and Observability Assistants are currently bundling our ES|QL docs with the Kibana distribution for each release. We then take this data, optionally chunk it, and then embed/ingest it using ELSER into a 'knowledge base' data stream so the assistants can query it for their ES|QL query generation features. Each release we'll need to update this content, and ship it as part of the Kibana distribution, with no ability to ship intermediate content updates outside of the Kibana release cycle.

Additionally, as mentioned in #346 (comment), this essentially provides us the ability to ship 'Custom GPTs' that can integrate with our assistants, and so opens up a world of possibilities for users to configure and expand the capabilities of the Security and Observability Assistants.

Requirement Details

Configuration

The core requirement here is for the ability to include the following when creating a package:

  • Any number of data streams to create, though realistically one is probably sufficient
  • An arbitrary number of documents, perhaps in json format, or zipped as detailed in [discuss] Support (fairly large) sample data set package  #346
  • Some configuration for the destination data stream of the bundled documents. If we include a raw dump of the documents from ES, perhaps we can use just the _index fields to route them accordingly?

Behavior

Upon installation, the package should install the included data streams, then ingest the bundled documents into their destination data stream. This initial data should stick around for as long as the package is installed. If the package is removed, the data stream + initial data should be removed as well. When the package is updated, it would be fine to wipe the data stream/initial data and treat it as a fresh install. Whatever is easiest/most resilient would be fine for the first iteration here. No need to worry about appending new data on upgrade, or dealing with mapping changes, just delete the data streams and re-install/re-ingest the initial data.


The above would be sufficient enough for us to start bundling knowledge base documents in packages, at which point we could install as needed in support of specific assistant features.

@spong spong added discuss Issue needs discussion Team:Ecosystem Label for the Packages Ecosystem team labels Jan 18, 2024
@spong
Copy link
Member Author

spong commented Jan 26, 2024

I know it's only been a week, but would there be any way to expedite the assessment of this change proposal? We're extremely motivated for this effort on the @elastic/security-generative-ai team, and are happy to help provide resources in putting an MVP together, just let us know -- thanks! 🙂

@jen-huang
Copy link
Contributor

@jsoriano @kpollich The rationale behind this type of package seems sound to me. Anything the GenAI team should consider as part of MVP?

@jsoriano
Copy link
Member

An exercise we can do is to create a potential example package, and from it see what we would need to do to support this in our packages ecosystem, we may find that we can add this as a normal feature for packages in the package-spec and Fleet, without needing support for big contents or "DLC" packages.

Later if we find that the size of the data set is a blocker, then we would also need #346.
And we find that for the same package we may want to have additional knowledge bases, then we may need #351.

@jsoriano
Copy link
Member

@spong could you provide a potential example package for this?

@spong
Copy link
Member Author

spong commented Jan 29, 2024

Absolutely @jsoriano! I'll get an example package put together today 🙂

@spong
Copy link
Member Author

spong commented Jan 30, 2024

Got an initial pass up as a PR to the integrations repo here: elastic/integrations#9007. Still some more I need to read through/update, but this includes the data stream and the raw sample documents to be ingested at least.

I see inference ingest pipelines are supported in the spec, which could be nice for performing the embedding on initial ingest (and so enabling different embedding/chunking strategies), however that would add overhead in dealing with the trained_model dependency (is there an ML node, enough memory, correct model installed, etc). Perhaps there's already support for these scenarios since ML modules are a supported asset type?

@jsoriano
Copy link
Member

jsoriano commented Jan 31, 2024

@spong and I met today over the example package in elastic/integrations#9007, and we have a proposal to move forward:

  • Package Spec:

    • We will add a new folder to the root of integration packages, called knowledge_base.
    • This knowledge_base directory can contain multiple knowledge bases.
    • Each knowledge base will be defined by a set of documents, and an optional custom mapping for them. The documents can be ready to be ingested, or in a raw form, that needs to be "built", more on this later.
    • This new feature will be experimental, intended to iterate.
  • Fleet:

    • When installing the package, it will create a new index or data stream, with the provided documents for each knowledge base. We need to define the convention for the names of these documents.
    • Fleet will have a base component template with mappings for these indexes. Each knowledge base can provide additional mappings.
    • Each knowledge base is tied to a model (elser_1, elser_2...), Fleet should be able to discover and install only the compatible knowledge bases with the models available in the deployment. Fleet will just install them and assistants will take care of discovering the ones that work with the models they have.
    • Permissions should be limited as much as possible so nothing can write to these indexes.
    • Fleet will somehow let the assistant know about the knowledge bases installed, explicitly via callbacks, or by convention by known index patterns.
  • Elastic-Package:

    • At some point it would be nice to make it capable to build knowledge bases. For this, it will require a stack with ML capabilities. It will install a pipeline with inference capabilities and will ingest the raw documents with them. It will then export the resulting knowledge base ready to be installed. The input for this process could be included in knowledge_base/_dev.
  • Use cases:

    • Standalone knowledge bases. Installed just to increase the capabilities of the model, for example to teach it ES|QL. These will be distributed in integration packages with knowledge bases but without data streams, similar to some packages we have now with ML models.
    • Knowledge bases related to integrations. Installed to teach the model about specific solutions or services. For example to teach the model about apache when installing the apache integration. These will be distributed in existing integration packages, adding the knowledge bases apart of the existing integrations.

@spong
Copy link
Member Author

spong commented Jan 31, 2024

That looks great @jsoriano! Thanks for meeting with me and distilling the above proposal.

Just want to make a couple notes/clarifications:

Each knowledge base is tied to a model (elser_1, elser_2...), Fleet should be able to discover and install only the compatible knowledge bases with the models available in the deployment.

While true, I'm not sure we need this validation/gating within fleet itself. Models can be uninstalled, upgraded, or re-installed after a KB package has been installed, so the assistants will already need to handle any missing or mis-matched model scenarios when it checks for available/compatible knowledge bases.

Fleet will somehow let the assistant know about the knowledge bases installed.

I don't think a callback on install is needed at this time since assistants will need to query for 'compatible' knowledge bases on their own (as above), but if it works with the existing registerExternalCallback interface then all the better 🙂

@jsoriano
Copy link
Member

jsoriano commented Feb 1, 2024

Each knowledge base is tied to a model (elser_1, elser_2...), Fleet should be able to discover and install only the compatible knowledge bases with the models available in the deployment.

While true, I'm not sure we need this validation/gating within fleet itself. Models can be uninstalled, upgraded, or re-installed after a KB package has been installed, so the assistants will already need to handle any missing or mis-matched model scenarios when it checks for available/compatible knowledge bases.

Ok, I guess this makes things easier for Fleet 🙂 It just installs the knowledge bases and assistants get the ones they can use.

Fleet will somehow let the assistant know about the knowledge bases installed.

I don't think a callback on install is needed at this time since assistants will need to query for 'compatible' knowledge bases on their own (as above), but if it works with the existing registerExternalCallback interface then all the better 🙂

So knowledge bases would be discovered by convention on some index pattern? I am ok with that, in any case this is something we need to think about.

@spong
Copy link
Member Author

spong commented Feb 1, 2024

So knowledge bases would be discovered by convention on some index pattern? I am ok with that, in any case this is something we need to think about.

Yeah, I'm thinking for this first pass the assistants can do self discovery based on index naming convention, or hitting the fleet API for packages with the tag "Knowledge Base". Then either read further metadata from the fleet manifest like we discussed, or later push that metadata/descriptor state to a document in the knowledge base itself.

@spong
Copy link
Member Author

spong commented Feb 14, 2024

Slight update here. Didn't have much bandwidth this past week, but I was able to put together a pretty rough POC following the above proposal.

If you're okay with it, I'm happy to round out this POC and push the fleet, package-spec, integrations and elastic-package changes as draft PR's for feedback/collaboration and then we can go from there? If you prefer to manage any of these changes though let me know!

Quick demo video below. User installs package assets, fleet code sets up index/ingests KB docs, and then functionality immediately becomes available within the Assistant

kb-integrations-e2e.mov

@jsoriano
Copy link
Member

@spong wow, this looks great! Yes, please open PRs and we can continue the discussion there. Thanks!

@pgayvallet
Copy link
Contributor

pgayvallet commented Sep 10, 2024

With all the work around the "Unified AI Assistant", and the corresponding initiative for the unification of the knowledge base, the responsibility of maintaining and distributing knowledge base "bits" is somewhat moving to the newly formed @elastic/appex-ai-infra team.

I think @spong did a great job here with his requierements. The "revisited" version from our side is very similar, but just for clarity, I will write it down:

Context

For the Kibana AI assistant(s), what we call "Knowledge base" (or "KB") is, to simplify, a set of sources the assistant can use to retrieve documents related to their current context. For example if the user asks the assistant a question about Kibana's configuration settings, the assistant can search and retrieve from its knowledge base articles/documents in relation with this question / context.

What we want to do

We want to be able to ship KB sources as packages (more specifically, index sources, as there can be different kind of KB sources, but I won't elaborate on that point given that only index sources are relevant here).

A KB source is composed of:

  • an index or data stream (and its mapping definition / index settings)
    • indices would be sufficient for our usecase, so if data stream support is more complex (e.g. for uninstall), we can ditch them from the requierement
  • an arbitrary number of documents present in / bound to this index or data stream.
  • (indirectly, but coupled to) a model/inference endpoint ID that will be referenced by the semantic_search type of fields in the mapping (see next section for that point)

Installation behavior is straightforward:

  • create the index/datastream
  • then ingest/index the associated document to the index

Uninstall behavior is too:

  • delete all the assets that were installed by this package

For updates, we would simply follow a uninstall old =>install new workflow (it is ok to purge the indices)

Additional questions

System indices

For KB sources, we're (ideally) planning on using system indices. Would that be an issue with the way package installation works? Which user is being used under the hood, the Kibana internal user?

semantic_text referencing model ids

Just one technical detail worth mentioning - knowledge base ("KB") retrieval is based on semantic search. In practice, it means there is a strong coupling between the index (mapping) and a specific model / inference endpoint.

Which raises the question of how to manage that coupling for packages. We are planning on pre-installing the model that will be used for all our KB sources, but I think we still need a way for the package installer to "check" some install condition - e.g. only allow installation if the model with this specific ID is present, or something similar (as semantic search requires an ML node, meaning that not all cluster will be able to support it - and we should not allow to install the package in that case). I have no idea if that kind of programmatical or scripted checks are possible today, but we will likely need to find a solution

indexing documents in indices not "maintained" by the package

For our specific needs, we would ideally be able to create a document in another index (our KB source listing index) during package installation, to flag the source as being available. It means that during uninstall, we would need to delete this specific document from the said index, without purging it (as it wasn't installed by the package).

That's not a strict requirement though, we should be able to work around if we don't have that.

Which approach should we take

from #693 (comment):

We will add a new folder to the root of integration packages, called knowledge_base

That's only my 2cps, I'm not sure to agree in the "specialized" approach that was discussed in that issue.

I really feel like the right approach would be to be as generic as possible and "simply" make the spec evolve to be able to add ES documents bound to an index to packages. Not to create "content only" packages (#351), or to do something absolutely specific such as this knowledge_base folder that was discussed in that issue. "Just" to add ES documents as a supported type for packages.

Now, if the generic approach is significantly more work, I would be very fine with something more specific to our exact need here. I just feel like having content in package could be something that could benefit more than just this exact use case?

@spong
Copy link
Member Author

spong commented Sep 10, 2024

That's a good summary of where we're at and what's needed here, thanks @pgayvallet!

I really feel like the right approach would be to be as generic as possible and "simply" make the spec evolve to be able to add ES documents bound to an index to packages. Not to create "content only" packages (#351), or to do something absolutely specific such as this knowledge_base folder that was discussed in that issue. "Just" to add ES documents as a supported type for packages.

I'm also in agreement for going with a more generalized solution. I started with that thought by trying to work with the existing 'sample data' issue, but ended up being directed to a more specialized initial implementation. So if we can make this work with the Content Packages RFC (internal), or something else more generic, all the better.

At the end of the day these packages are just data, with a pre-install requirement for a model/inference endpoint ID (though technically not if the data is already embedded and we're able to target the default deployed model). We don't even need an ingest pipeline anymore either with semantic_text. So an MVP is pretty straightforward. That said, I think there are interesting questions to explore around managing chunking strategies, including 'serialized tools' as assets, and so forth, but wouldn't let those get in the way of delivering on a clean MVP so we can make progress and start getting feedback.

@jsoriano
Copy link
Member

I really feel like the right approach would be to be as generic as possible and "simply" make the spec evolve to be able to add ES documents bound to an index to packages.

The generic approach sounds good to me, but we still need a way to indicate Fleet to run the pre-install steps, that may be different depending on the type of data. So maybe the approach could be something like having index_data/{name} directories, each one with the data to ingest, the field mappings definitions and some metadata file indicating the type of data, if it should use ES indexes or data streams and so on.

Not to create "content only" packages (#351), or to do something absolutely specific such as this knowledge_base folder that was discussed in that issue.

"content only" packages are not specific to this use case, they are useful in other use cases. I think data distribution will fit better in this kind of package than in "integration" or "input" packages.

@pgayvallet
Copy link
Contributor

pgayvallet commented Sep 24, 2024

I started taking a look at the package-spec and integration repositories, and given what the spec currently supports, I'm more and more leaning toward doing something specific for knowledge base, as @jsoriano proposed initially, rather than something fully generic to allow indexing and removing documents from any arbitrary index, as I suggested in my previous reply. I doubt we will really be able to do something generic to suit everybody's need in term of adding arbitrary documents, so it's probably better to stay humble and focus on our specific need here.

I have a few questions:

1. Storage format for knowledge base documents in the package

I see that Kibana entities (such as dashboards) are stored each in their own individual file, following a kibana/{entityType}/{id}.json filepath pattern.

For KB, we will have large amounts of documents (100's to 1000's) per KB "source", so I'm not sure what the best option would be here:

  • doing the same than for Kibana entities, and follow a "1 file per document" approach, or
  • having a single file containing all the documents for a given index (=kb source) instead, either in a ndjson or json array format.

One file per document would result in very large folder contents, but we're still far below the volume where it becomes a problem. One single file containing everything is ihmo more elegant, but then it may lead to other issues (parsing/loading the whole file in memory during installation could be problematic).

I know packages are zipped in the registry, but I'm starting to wonder if using an internal archive for such large amount of documents wouldn't be a smart move. Compressed format have an index, allowing to load entries individually, which would get rid of the memory problems. The downside is that it fully kills diffs by introducing data in a binary format within the package's sources...

So yeah, really not sure what the best approach would be here, insights or opinions are very welcome.

2. Spec changes structure

I see we now have a spec/content folder, with the content type spec relying massively on references toward the integration type spec. Do we assume knowledge bases will ever only be used by content packages (so should I directly add what I need under spec/content, or should I instead allow integration packages to also support the feature (and do as it's done for kibana atm, with spec/content/kibana referencing spec/integration/kibana? Any preferences?

3. Package size

I did a quick test, and the KB source for Kibana 8.15 documentation is around 600 documents, for a total of 45mb (uncompressed) and 12mb (compressed) - yeah, embeddings take a lot of space. And Kibana is one of the smallest source (ES is twice that, security almost 10 times that size - in terms of number of documents at least).

So the question is simple: are we fine adding such large packages to the integrations repository? If not, what would be our alternatives?

@jsoriano
Copy link
Member

Storage format for knowledge base documents in the package

We can have a mix of both, with a directory with files, each file potentially containing many documents in ndjson format, and all documents in all files are ingested. For simple use cases a single file will be enough, for more complex use cases the multiple files might help to organize the data and help with maintenance. Multiple files can also be useful to workaround size limits in repositories.

Regarding memory usage, it doesn't have to be necessarily an issue, as the file could be downloaded to disk and the files streamed to ES as needed, avoiding to have the package or the data in memory. This needs some work in Kibana/Fleet though, but I think this is an effort we should to do in any case to optimize usage of resources.

I wouldn't go in any case with the approach of a single document per file, I don't see any advantage on this. We don't need to follow the approach used for kibana assets.

Spec changes structure

Don't worry too much about this. I would start by adding this only to content packages, and if needed in other packages in the future we can reorganize the files. We also plan to use content packages to test a new installation code path for big packages.

Package size

We have options here depending on how these packages are going to be managed.
For example if they have big files, but they don't change a lot, I think they are fine in the integrations repository. If they have really big files, we might try Git LFS.

are we fine adding such large packages to the integrations repository? If not, what would be our alternatives?

Our tooling and infra supports having packages on different repositories, this is something we have been doing mainly for organisational reasons. If we feel that these packages are going to have special needings, we could have a different repo for them, or even one repository per package.

@pgayvallet
Copy link
Contributor

I opened #807 with my spec update proposal

@jsoriano
Copy link
Member

Reopening as this was reverted in #813

@jsoriano jsoriano reopened this Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs discussion Team:Ecosystem Label for the Packages Ecosystem team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants