Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds example ES|QL Knowledge Base integration with static data #9007

Closed
wants to merge 3 commits into from

Conversation

spong
Copy link
Member

@spong spong commented Jan 30, 2024

Important

This is a work-in-progress example integration for proving out elastic/package-spec#693. Contents subject to change.

Proposed commit message

This is an example integration in support of the elastic/package-spec#693 change proposal for creating a 'Knowledge Base' integration that provides both data streams, and corresponding static content to be loaded into those data streams.

After discussion with @jsoriano, we have moved the content/mappings from the data_stream directory to a new knowledge_base directory, which contains any number of directories for the knowledge bases you want to install (similar to data_stream's). And in those directories are both the fields and static documents (as a JSON array in a *.json file) within a documents folder. See spec to update: package-spec/spec/integration/data_stream/spec.yml

Example structure:

| esql_knowledge_base <-- package root
   | docs
   | img
   | knowledge_base
     | documents
        | esql-kb-docs.json
     | fields
       | base_fields.yml
       | fields.yml
   | changelog.yml
   | LICENSE.txt
   | manifest.yml

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.

Author's Checklist

  • [TBD]

How to test this PR locally

Related issues

elastic/package-spec#693

Screenshots

  • [TBD]

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this package, this clarifies some things to me.

What I see clear is that we need:

  • Support to install static documents.
  • Support for rank_features.

What is still not clear to me is how this interact with other data, mainly:

  • Is this used to work with data in other indexes or data streams?
  • Does it use elastic-agent?

If these packages only manage one data stream, and the data is not collected by agents, I would say that we need a new specific package type.

type: keyword
description: Model used to generate the vector
- name: tokens
type: rank_features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to add support for this type.

Comment on lines +7 to +8
* "Generate an ES|QL query for the top 10 countries with the most sales"
* "Generate an ES|QL query for my most recent open security detection alerts of high risk"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the data queried supposed to be in data streams managed by this package, or in other data streams or indexes?

Copy link
Member Author

@spong spong Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assistant generated query would be querying other data streams/indices not managed by this package. It can get kind of confusing but the flow is as follows:

  • User prompts assistant to generate ES|QL query for fetching some data in their cluster as in examples above
  • Assistant identifies user wants to generate an ES|QL query, and so calls the custom ES|QL Query Generation Tool
  • ES|QL Query Generation Tool does a vector search against this package's data streams for relevant documents to aid in the generation of the query
  • Assistant uses documents as context and returns an ES|QL query to the user
  • User executes ES|QL query (which should be querying whatever index the model inferred from the original user request)

@spong
Copy link
Member Author

spong commented Jan 30, 2024

What is still not clear to me is how this interact with other data, mainly:

  • Is this used to work with data in other indexes or data streams?

I'm not sure I understand the implications here, but generally speaking the static data in these package's are a standalone/independent resource to be referenced by the assistants when it needs to perform a specific task. Perhaps the ES|QL query generation example muddied the waters a bit here, but a different example would be a Knowledge Base package containing embeddings for the entirety of a book or of transcripts from a podcast. On initialization the assistant would fetch existing KB indices/packages, parse the package description for what it contains/provides, then register a tool that says 'query these data streams when asked about this topic'. So by just installing the Lex Fridmen Transcripts or The Great Gatsby knowledge base package, you could then start asking questions about any of the content provided by the package. This data might have utility to others, but would be intended to exist as a standalone resource to be queried by the assistant to aid in Q/A tasks.

  • Does it use elastic-agent?

No immediate need for an agent configuration to append to the static data, but I definitely could see future use cases where you might want to keep adding data to these knowledge base indices. Or put differently, I could see use cases for existing integrations to include static data like this. E.g. what if the Amazon RDS integration included embeddings of their documentation and API, thus allowing the assistant to answer questions about the user's infrastructure without further user input. They're using Amazon RDS and now all of a sudden their assistants know this and can provide additional value without having to fetch/write additional information after-the-fact.

If these packages only manage one data stream, and the data is not collected by agents, I would say that we need a new specific package type.

From an architectural perspective, are you thinking a separate package and it is referenced by the 'main/ingest' package as you outlined in elastic/package-spec#351? As a solutions dev, and from the user's perspective, I currently hold preference to this functionality being in the same integration for the capability and flexibility mentioned above. Now I'm just coming back up to speed with the package-spec and surrounding infrastructure, so totally understand if that goes against some explicit separation of concerns that we're trying to keep here...and in that case don't mind me 😅.

@elasticmachine
Copy link

elasticmachine commented Jan 31, 2024

💔 Build Failed

Failed CI Steps

History

cc @spong

@botelastic
Copy link

botelastic bot commented Mar 1, 2024

Hi! We just realized that we haven't looked into this PR in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Mar 1, 2024
@spong
Copy link
Member Author

spong commented Mar 5, 2024

Hoping to pick this back up sometime next week. Will open corresponding package-spec and kibana PR's as noted here and we can continue collaboration from there 🙂

@botelastic botelastic bot removed the Stalled label Mar 5, 2024
@botelastic
Copy link

botelastic bot commented Apr 5, 2024

Hi! We just realized that we haven't looked into this PR in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Apr 5, 2024
@spong
Copy link
Member Author

spong commented Apr 5, 2024

I've had to re-focus on some immediate items for 8.14, so will need to see those through. Hopefully can get back to this shortly after feature freeze.

@botelastic botelastic bot removed the Stalled label Apr 5, 2024
@botelastic
Copy link

botelastic bot commented May 5, 2024

Hi! We just realized that we haven't looked into this PR in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic
Copy link

botelastic bot commented Jun 4, 2024

Hi! This PR has been stale for a while and we're going to close it as part of our cleanup procedure. We appreciate your contribution and would like to apologize if we have not been able to review it, due to the current heavy load of the team. Feel free to re-open this PR if you think it should stay open and is worth rebasing. Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants