Skip to content

(EAI-1001) enable search by type #703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 9, 2025

Conversation

yakubova92
Copy link
Collaborator

Jira: https://jira.mongodb.org/browse/EAI-1001

Changes

  • "metadata.pageType" field added
  • SnootyDataSource creates pages with "metadata.pageType": "tech-docs"
  • DevCenterDataSource creates pages with "metadata.pageType": "devcenter"
  • index created for new field, vector search index updated to include new field

Notes

  • This will allow us to change the filter here to find public datasets (snooty or devcenter) by filtering on the "metadata.pageType" field

@yakubova92 yakubova92 requested a review from mongodben May 7, 2025 21:33
"tech-docs" indicates documents from the mongodb.com/docs site. SnootyDataSource has this type
"devcenter" indicates documents from the mongodb.com/developer site. DevCenterDataSource has this type
*/
pageType?: "tech-docs" | "devcenter";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few things:

  1. pls rename this to contentType. more descriptive
  2. do not hard code in the values on the Page type. keep as string. remember, this is a flexible construct. also this type is used by pages from other sources, such as web data sources, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We're using contentType already. I ran an aggregation against the pages collection in prod to see what the values were: Article, article, News & Announcements, website, Event, Code Example, Tutorial, Podcast, Quickstart, Video. contentType is mostly used by devcenter sources and the web scape sources (they have "website" and "article" as contentTypes.

  2. I thought it was valuable to allow only specific pageTypes to avoid something like "article" and "Article", which happens on contentType. We would only introduce a new pageType through a code change (its not coming from metadata from a page scrape or anything), so we can update the type then.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok. in that case can you put the value inside of the Page.metadata.page object? this way its not included in the chunked data.

so the property would be Page.metadata.page.type?: string

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussion, we decided to put the property at the root of the document so that it would not be chunked, but would appear on the embedded_content document. Renaming the property to sourceType

@yakubova92 yakubova92 requested a review from mongodben May 9, 2025 16:16
@@ -88,6 +93,7 @@ export type QueryFilters = {
current?: boolean;
label?: string;
};
sourceType?: string;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shuld this also be Page["sourceType"] ?

"tech-docs" indicates documents from the mongodb.com/docs site. SnootyDataSource has this type
"devcenter" indicates documents from the mongodb.com/developer site. DevCenterDataSource has this type
*/
sourceType?: "tech-docs" | "devcenter";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, do not put this typing here. rembmeber that the page construct is more general. leave as string. just apply to the strings to the implementations

Copy link
Collaborator

@mongodben mongodben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm w/ a few very small things to address before merge

@yakubova92 yakubova92 merged commit 01800ab into EAI-428-feature-versioned-docs May 9, 2025
1 check passed
@yakubova92 yakubova92 deleted the EAI-1001 branch May 9, 2025 18:39
yakubova92 added a commit that referenced this pull request May 19, 2025
* (EAI-968) ingest multiple versions (#683)

* (EAI-969) query multiple versions (#693)

* (EAI-1003) get available versions for data source (#696)

* (EAI-922) ensure only current version on hugging face dataset (#698)

* (EAI-1001) enable search by type (#703)

* add sourceType to pages and embedded_content and ability to filter by it

* change sourceRegex filter to sourceType filter
mongodben pushed a commit that referenced this pull request May 20, 2025
* (EAI-968) ingest multiple versions (#683)

* remove snooty prefix

* ingesting pages for all branches on each data source

* do not ingest (and delete if already exists) pages on inactive branches

* handle current version override

* cleanup unused code from previous version override implementation, tests

* update SnootyDataSource tests

* remove override for docs current version

* (EAI-969) query multiple versions (#693)

* nearest neighbor search accepts filters, defaults to current version

* parse filters to mdb query

* (EAI-1003) get available versions for data source (#696)

* get versions of a data source

* get versions for multiple data sources

* (EAI-922) ensure only current version on hugging face dataset (#698)

* exclude old versions from dataset

* add test case

* fix other tests

* fix return type of getDataSourceVersions - return object, not array

* move QueryFilters type def to embedded content store

* fix type

* (EAI-1001) enable search by type (#703)

* add sourceType to pages and embedded_content and ability to filter by it

* test case

* change sourceRegex filter to sourceType filter

* lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants