-
Notifications
You must be signed in to change notification settings - Fork 73
(EAI-1001) enable search by type #703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
"tech-docs" indicates documents from the mongodb.com/docs site. SnootyDataSource has this type | ||
"devcenter" indicates documents from the mongodb.com/developer site. DevCenterDataSource has this type | ||
*/ | ||
pageType?: "tech-docs" | "devcenter"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few things:
- pls rename this to
contentType
. more descriptive - do not hard code in the values on the Page type. keep as
string
. remember, this is a flexible construct. also this type is used by pages from other sources, such as web data sources, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
We're using contentType already. I ran an aggregation against the pages collection in prod to see what the values were: Article, article, News & Announcements, website, Event, Code Example, Tutorial, Podcast, Quickstart, Video. contentType is mostly used by devcenter sources and the web scape sources (they have "website" and "article" as contentTypes.
-
I thought it was valuable to allow only specific pageTypes to avoid something like "article" and "Article", which happens on contentType. We would only introduce a new pageType through a code change (its not coming from metadata from a page scrape or anything), so we can update the type then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok. in that case can you put the value inside of the Page.metadata.page object? this way its not included in the chunked data.
so the property would be Page.metadata.page.type?: string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some discussion, we decided to put the property at the root of the document so that it would not be chunked, but would appear on the embedded_content document. Renaming the property to sourceType
@@ -88,6 +93,7 @@ export type QueryFilters = { | |||
current?: boolean; | |||
label?: string; | |||
}; | |||
sourceType?: string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shuld this also be Page["sourceType"] ?
"tech-docs" indicates documents from the mongodb.com/docs site. SnootyDataSource has this type | ||
"devcenter" indicates documents from the mongodb.com/developer site. DevCenterDataSource has this type | ||
*/ | ||
sourceType?: "tech-docs" | "devcenter"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, do not put this typing here. rembmeber that the page construct is more general. leave as string. just apply to the strings to the implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm w/ a few very small things to address before merge
* (EAI-968) ingest multiple versions (#683) * (EAI-969) query multiple versions (#693) * (EAI-1003) get available versions for data source (#696) * (EAI-922) ensure only current version on hugging face dataset (#698) * (EAI-1001) enable search by type (#703) * add sourceType to pages and embedded_content and ability to filter by it * change sourceRegex filter to sourceType filter
* (EAI-968) ingest multiple versions (#683) * remove snooty prefix * ingesting pages for all branches on each data source * do not ingest (and delete if already exists) pages on inactive branches * handle current version override * cleanup unused code from previous version override implementation, tests * update SnootyDataSource tests * remove override for docs current version * (EAI-969) query multiple versions (#693) * nearest neighbor search accepts filters, defaults to current version * parse filters to mdb query * (EAI-1003) get available versions for data source (#696) * get versions of a data source * get versions for multiple data sources * (EAI-922) ensure only current version on hugging face dataset (#698) * exclude old versions from dataset * add test case * fix other tests * fix return type of getDataSourceVersions - return object, not array * move QueryFilters type def to embedded content store * fix type * (EAI-1001) enable search by type (#703) * add sourceType to pages and embedded_content and ability to filter by it * test case * change sourceRegex filter to sourceType filter * lint
Jira: https://jira.mongodb.org/browse/EAI-1001
Changes
"metadata.pageType": "tech-docs"
"metadata.pageType": "devcenter"
Notes