-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic metadata generation using genAI #1599
Comments
### Feature - Feature ### Detail - Automated metadata generation using gen AI. MVP phase ### Related #1599 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dlpzx <dlpzx@amazon.com>
DesignUser ExperienceTable/Folder metadata generation
Dataset metadata generation
There will be a limit of Generate Metadata API calls performed per day or per day/team. If the number is surpassed, a comprehensive error message will appear in the top banner. Data analysisFor this use-case it is relevant to describe the different types of data and metadata that would serve as input to the generation of metadata. Depending on the data there will be different genAI workflows. Data.all S3 Datasets: (S3 Bucket + Glue database)
Data.all Redshift Datasets [v.2.7.0] : We need to keep it in mind for the design, but the feature won’t be implementing metadata in Redshift in its first release.
Data scenarios For column metadata generation (column name and column description):
For Table and Folder metadata generation:
For Dataset metadata generation
High Level Design
|
Problem statement
Is your feature request related to a problem? Please describe.
Current metadata creation processes in data.all are manual and time-consuming, leading to incomplete, inconsistent, and outdated metadata. Inconsistency in metadata across datasets makes it difficult to understand and compare the information. Incomplete metadata reduces the value and usability of the data, while outdated metadata can hinder the ability to properly utilize the datasets. Additionally, the quality of manual metadata can vary significantly from dataset to dataset, depending on the data producer's expertise and available time and resources. Crucially, the burden of this undifferentiated heavy lifting falls on data producers, who must spend valuable time and resources on manual metadata creation instead of focusing on their core business problems.
The automated metadata recommendation feature can address these challenges by leveraging GenAI techniques, the metadata recommendation process can be streamlined, standardized, and kept up-to-date. This feature tries to solve the pain point of inconsistent, incomplete, and outdated metadata that exists due to manual approaches. This feature aims to improve metadata quality and consistency across data.all, while freeing producers to focus on their core competencies.
User Stories
Describe the solution you'd like
US1.
As a Data Producer, I want automated metadata recommendation for data.all datasets, including but not limited to dataset description, tags, topics, table description and column description, so that I can ensure datasets are discoverable and well-documented without manual effort.
Acceptance Criteria
US2.
As a data producer, I want the ability to run the automated metadata recommendation feature on demand, so that I can keep the data catalog information up-to-date as my data assets evolve.
Acceptance Criteria:
US3.
As a Data Producer, I want the ability to review, edit, and annotate automatically recommended metadata, so that I can ensure its accuracy and relevance while leveraging the automated process.
Acceptance Criteria:
US4.
As a Data Consumer, I want to use advanced search and filtering options based on enriched metadata to find relevant datasets quickly and efficiently.
Acceptance Criteria:
US5.
As a data.all developer and maintainer, I want the automated metadata recommendation feature to be secure and respect data governance access permissions.
Acceptance Criteria:
###US6.
As a data.all developer and maintainer, I want the automated metadata recommendation feature to be configurable, scalable, reliable, and seamlessly integrated into the data.all platform, so that I can ensure a smooth and efficient user experience for all data.all users.
Acceptance Criteria:
US7.
As a data.all developer and maintainer, I want to be able to configure rate limits for the automated metadata recommendation feature so that I can prevent overuse and ensure responsible access to the feature.
Acceptance Criteria:
###US8.
As a data.all developer and maintainer, I want the automated metadata recommendation feature to clearly display a disclaimer about the limitations and confidentiality of the responses, so that I understand the context and boundaries of the AI-generated information.
Acceptance Criteria:
US9.
As a data.all developer and maintainer, I want the automated metadata recommendation feature to provide feedback functionality so that users can easily indicate if the response was helpful or not, which can then be used to improve the quality of future responses.
Acceptance Criteria:
Scope
1/ Metadata Generation:
2/ Metadata Review and Acceptance:
3/ Backward Compatibility for Existing Datasets:
4/ On-demand Metadata Refresh:
5/ Metadata-driven Search and Filtering:
Out of Scope
Guardrails
Describe alternatives you've considered
See design below
Additional context
This feature will be first implemented as an MVP and then reowrked a bit to make it prod-ready.
P.S. Please Don't attach files. Add code snippets directly in the message body instead.
The text was updated successfully, but these errors were encountered: