-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Change Proposal] Support Knowledge Base Packages #693
Comments
I know it's only been a week, but would there be any way to expedite the assessment of this change proposal? We're extremely motivated for this effort on the @elastic/security-generative-ai team, and are happy to help provide resources in putting an MVP together, just let us know -- thanks! 🙂 |
An exercise we can do is to create a potential example package, and from it see what we would need to do to support this in our packages ecosystem, we may find that we can add this as a normal feature for packages in the package-spec and Fleet, without needing support for big contents or "DLC" packages. Later if we find that the size of the data set is a blocker, then we would also need #346. |
@spong could you provide a potential example package for this? |
Absolutely @jsoriano! I'll get an example package put together today 🙂 |
Got an initial pass up as a PR to the integrations repo here: elastic/integrations#9007. Still some more I need to read through/update, but this includes the data stream and the raw sample documents to be ingested at least. I see inference ingest pipelines are supported in the spec, which could be nice for performing the embedding on initial ingest (and so enabling different embedding/chunking strategies), however that would add overhead in dealing with the trained_model dependency (is there an ML node, enough memory, correct model installed, etc). Perhaps there's already support for these scenarios since ML modules are a supported asset type? |
@spong and I met today over the example package in elastic/integrations#9007, and we have a proposal to move forward:
|
That looks great @jsoriano! Thanks for meeting with me and distilling the above proposal. Just want to make a couple notes/clarifications:
While true, I'm not sure we need this validation/gating within fleet itself. Models can be uninstalled, upgraded, or re-installed after a KB package has been installed, so the assistants will already need to handle any missing or mis-matched model scenarios when it checks for available/compatible knowledge bases.
I don't think a callback on install is needed at this time since assistants will need to query for 'compatible' knowledge bases on their own (as above), but if it works with the existing |
Ok, I guess this makes things easier for Fleet 🙂 It just installs the knowledge bases and assistants get the ones they can use.
So knowledge bases would be discovered by convention on some index pattern? I am ok with that, in any case this is something we need to think about. |
Yeah, I'm thinking for this first pass the assistants can do self discovery based on index naming convention, or hitting the fleet API for packages with the tag "Knowledge Base". Then either read further metadata from the fleet manifest like we discussed, or later push that metadata/descriptor state to a document in the knowledge base itself. |
Slight update here. Didn't have much bandwidth this past week, but I was able to put together a pretty rough POC following the above proposal. If you're okay with it, I'm happy to round out this POC and push the Quick demo video below. User installs package assets, fleet code sets up index/ingests KB docs, and then functionality immediately becomes available within the Assistant kb-integrations-e2e.mov |
@spong wow, this looks great! Yes, please open PRs and we can continue the discussion there. Thanks! |
With all the work around the "Unified AI Assistant", and the corresponding initiative for the unification of the knowledge base, the responsibility of maintaining and distributing knowledge base "bits" is somewhat moving to the newly formed @elastic/appex-ai-infra team. I think @spong did a great job here with his requierements. The "revisited" version from our side is very similar, but just for clarity, I will write it down: ContextFor the Kibana AI assistant(s), what we call "Knowledge base" (or "KB") is, to simplify, a set of sources the assistant can use to retrieve documents related to their current context. For example if the user asks the assistant a question about Kibana's configuration settings, the assistant can search and retrieve from its knowledge base articles/documents in relation with this question / context. What we want to doWe want to be able to ship KB sources as packages (more specifically, index sources, as there can be different kind of KB sources, but I won't elaborate on that point given that only index sources are relevant here). A KB source is composed of:
Installation behavior is straightforward:
Uninstall behavior is too:
For updates, we would simply follow a uninstall old =>install new workflow (it is ok to purge the indices) Additional questionsSystem indicesFor KB sources, we're (ideally) planning on using system indices. Would that be an issue with the way package installation works? Which user is being used under the hood, the Kibana internal user? semantic_text referencing model idsJust one technical detail worth mentioning - knowledge base ("KB") retrieval is based on semantic search. In practice, it means there is a strong coupling between the index (mapping) and a specific model / inference endpoint. Which raises the question of how to manage that coupling for packages. We are planning on pre-installing the model that will be used for all our KB sources, but I think we still need a way for the package installer to "check" some install condition - e.g. only allow installation if the model with this specific ID is present, or something similar (as semantic search requires an ML node, meaning that not all cluster will be able to support it - and we should not allow to install the package in that case). I have no idea if that kind of programmatical or scripted checks are possible today, but we will likely need to find a solution indexing documents in indices not "maintained" by the packageFor our specific needs, we would ideally be able to create a document in another index (our KB source listing index) during package installation, to flag the source as being available. It means that during uninstall, we would need to delete this specific document from the said index, without purging it (as it wasn't installed by the package). That's not a strict requirement though, we should be able to work around if we don't have that. Which approach should we takefrom #693 (comment):
That's only my 2cps, I'm not sure to agree in the "specialized" approach that was discussed in that issue. I really feel like the right approach would be to be as generic as possible and "simply" make the spec evolve to be able to add ES documents bound to an index to packages. Not to create "content only" packages (#351), or to do something absolutely specific such as this Now, if the generic approach is significantly more work, I would be very fine with something more specific to our exact need here. I just feel like having content in package could be something that could benefit more than just this exact use case? |
That's a good summary of where we're at and what's needed here, thanks @pgayvallet!
I'm also in agreement for going with a more generalized solution. I started with that thought by trying to work with the existing 'sample data' issue, but ended up being directed to a more specialized initial implementation. So if we can make this work with the Content Packages RFC (internal), or something else more generic, all the better. At the end of the day these packages are just data, with a pre-install requirement for a model/inference endpoint ID (though technically not if the data is already embedded and we're able to target the default deployed model). We don't even need an ingest pipeline anymore either with |
The generic approach sounds good to me, but we still need a way to indicate Fleet to run the pre-install steps, that may be different depending on the type of data. So maybe the approach could be something like having
"content only" packages are not specific to this use case, they are useful in other use cases. I think data distribution will fit better in this kind of package than in "integration" or "input" packages. |
I started taking a look at the I have a few questions: 1. Storage format for knowledge base documents in the package I see that Kibana entities (such as dashboards) are stored each in their own individual file, following a For KB, we will have large amounts of documents (100's to 1000's) per KB "source", so I'm not sure what the best option would be here:
One file per document would result in very large folder contents, but we're still far below the volume where it becomes a problem. One single file containing everything is ihmo more elegant, but then it may lead to other issues (parsing/loading the whole file in memory during installation could be problematic). I know packages are zipped in the registry, but I'm starting to wonder if using an internal archive for such large amount of documents wouldn't be a smart move. Compressed format have an index, allowing to load entries individually, which would get rid of the memory problems. The downside is that it fully kills diffs by introducing data in a binary format within the package's sources... So yeah, really not sure what the best approach would be here, insights or opinions are very welcome. 2. Spec changes structure I see we now have a 3. Package size I did a quick test, and the KB source for Kibana 8.15 documentation is around 600 documents, for a total of 45mb (uncompressed) and 12mb (compressed) - yeah, embeddings take a lot of space. And Kibana is one of the smallest source (ES is twice that, security almost 10 times that size - in terms of number of documents at least). So the question is simple: are we fine adding such large packages to the |
We can have a mix of both, with a directory with files, each file potentially containing many documents in ndjson format, and all documents in all files are ingested. For simple use cases a single file will be enough, for more complex use cases the multiple files might help to organize the data and help with maintenance. Multiple files can also be useful to workaround size limits in repositories. Regarding memory usage, it doesn't have to be necessarily an issue, as the file could be downloaded to disk and the files streamed to ES as needed, avoiding to have the package or the data in memory. This needs some work in Kibana/Fleet though, but I think this is an effort we should to do in any case to optimize usage of resources. I wouldn't go in any case with the approach of a single document per file, I don't see any advantage on this. We don't need to follow the approach used for kibana assets.
Don't worry too much about this. I would start by adding this only to content packages, and if needed in other packages in the future we can reorganize the files. We also plan to use content packages to test a new installation code path for big packages.
We have options here depending on how these packages are going to be managed.
Our tooling and infra supports having packages on different repositories, this is something we have been doing mainly for organisational reasons. If we feel that these packages are going to have special needings, we could have a different repo for them, or even one repository per package. |
I opened #807 with my spec update proposal |
Reopening as this was reverted in #813 |
Summary
This is a proposal much like #346 or #351 for enabling the ability to bundle static data stream data within a package, such that when the package is installed, the data stream is created, and the bundled data is ingested into the data stream.
The specific use case here is for shipping 'Knowledge Base' content for use by the Elastic Assistants. For example, both Security and Observability Assistants are currently bundling our ES|QL docs with the Kibana distribution for each release. We then take this data, optionally chunk it, and then embed/ingest it using ELSER into a 'knowledge base' data stream so the assistants can query it for their ES|QL query generation features. Each release we'll need to update this content, and ship it as part of the Kibana distribution, with no ability to ship intermediate content updates outside of the Kibana release cycle.
Additionally, as mentioned in #346 (comment), this essentially provides us the ability to ship 'Custom GPTs' that can integrate with our assistants, and so opens up a world of possibilities for users to configure and expand the capabilities of the Security and Observability Assistants.
Requirement Details
Configuration
The core requirement here is for the ability to include the following when creating a package:
_index
fields to route them accordingly?Behavior
Upon installation, the package should install the included data streams, then ingest the bundled documents into their destination data stream. This initial data should stick around for as long as the package is installed. If the package is removed, the data stream + initial data should be removed as well. When the package is updated, it would be fine to wipe the data stream/initial data and treat it as a fresh install. Whatever is easiest/most resilient would be fine for the first iteration here. No need to worry about appending new data on upgrade, or dealing with mapping changes, just delete the data streams and re-install/re-ingest the initial data.
The above would be sufficient enough for us to start bundling knowledge base documents in packages, at which point we could install as needed in support of specific assistant features.
The text was updated successfully, but these errors were encountered: