-
Notifications
You must be signed in to change notification settings - Fork 10
Feature: repository search
The purpose of this page is to document how our search works (at a high level).
We need to show the following materials in our blended search results:
- Regulations
- Public policy documents that we link to:
- Federal Register rules
- Public links: subregulatory guidance, technical assistance, etc.
- Internal policy documents (internal to CMCS):
- Internal files: uploaded files that are hosted within eRegulations
- Internal links: Box, Sharepoint, or other URLs hosted within other CMCS tools
- Return relevant results in less than 2 seconds.
- Return meaningful highlighting of where your query term shows up in the document.
- Provide a bit of interpretation instead of being 100% literal (such as applying stop words, stemming, etc.), but not so much that it prevents relevant results.
Helpful test queries:
- state
- At rank filter 0.05, this returns 2800+ results
- "state plan"
- At rank filter 0.05, this returns 1400+ results
- state plan amendment
- At rank filter 0.05, this returns 2000+ results
- Should show the stemmed word "amendment" in highlights
- Medicare
- At rank filter 0.05, this returns 2100+ results
- Shouldn't just return results for "medical" at the top
- personal care services
- At rank filter 0.05, this returns 1700+ results
- Should show the stemmed words "personal" and "services" in highlights
We use Postgres full-text search via Django's support for Postgres full-text search.
In our Postgres database we have:
- The full text of regulation sections in scope, imported via eCFR API
- Metadata about each document:
- Imported via Federal Register API for post-1994 rules (and hand-corrected as needed)
- Entered by hand for everything else
- The full text of most documents, extracted via our Text Extractor Lambda. See that page for details. It uses:
-
Python Requests to grab content from URLs, respecting robots.txt and providing a custom user agent (
CMCSeRegsTextExtractorBot/1.0
) - Google Magika to detect file types
- AWS Textract to process PDFs, including text detection for scanned documents (this is about $1.50 per 1000 pages)
- Several open source libraries to process a lot of file types
-
Python Requests to grab content from URLs, respecting robots.txt and providing a custom user agent (
As of November 2024: For PDFs, our system only attempts to extract the first 50 pages of each document, to put a bound on using too much resources. We need to revisit this.
Our FR parser includes a special step to enable search indexing because the FR website does not allow scraping their normal URLs: we fetch their text-only URL via their API and give that Extract_URL to the Text Extractor Lambda instead of the normal URL.
The raw text content of an indexed FR link can be 2-3 MB or more.
We don't index pre-1994 FR links because we link to PDFs that we can't scrape:
- FederalRegister.gov "issue slice" links (we hand-generate these links by going to a URL like https://www.federalregister.gov/citation/55-FR-33907 and selecting "Document View")
- Example: https://archives.federalregister.gov/issue_slice/1990/8/20/33905-33909.pdf#page=3
- The FR site does not allow scraping them
- They usually include the document we want and several unrelated documents that were published on the same day by other agencies
- Usually less than 10 MB
- GovInfo.gov archival issues (table of contents)
- Example: https://www.govinfo.gov/content/pkg/FR-1976-04-13/pdf/FR-1976-04-13.pdf#page=142 (details page)
- Too big to scrape
- They include tons of unrelated documents that were published on the same day by other agencies
- Can be 50 to 150 MB or more
- LOC.gov archival issues (collection page)
- Example: https://tile.loc.gov/storage-services/service/ll/fedreg/fr043/fr043151/fr043151.pdf#page=246
- Similar to GovInfo archival issues
Searching the full text of internal links depends on whether we can programmatically access and index those documents through APIs or other tools, and on whether we've been able to build an integration. Our current infrastructure does not include any such integrations.
See Resources linking system for details about our metadata fields for public links, FR links, internal links, and internal files.
Example:
- FR links, public links, internal links, and internal files can be marked "approved" or not approved in the admin panel. Items that aren't approved are only visible in the admin panel (which is only available to logged-in users), never shown in search results or elsewhere on the site.
- If you're not logged in, you cannot see internal documents (internal files or internal links) in search results or elsewhere on the site.
In search results, we always show the following document metadata if available:
- Document category
- Date
- Subjects
- Related citations
If the desired keyword(s) exist only in the document metadata (FR link Document ID or Title, public link Document ID or Title, internal file Title or Summary, etc.), show that document metadata. This means:
- FR link: Document ID (grey metadata) and Title (blue link)
- Public link: Document ID (grey metadata) and Title (blue link)
- Internal link: Document ID (grey metadata) and Title (blue link)
- Internal files: Title (blue link) and Summary (black text)
If the desired keyword(s) also exist in the extracted document text, show the Document ID and Title (grey metadata and blue link) AND:
- For all types of documents: if we have a relevant excerpt (headline) from the full-text content, display it in black text. (For internal files, this headline replaces the summary.)
- The amount of text is configured via the variables
SEARCH_HEADLINE_MAX_WORDS
,SEARCH_HEADLINE_MIN_WORDS
, andSEARCH_HEADLINE_MAX_FRAGMENTS
. These correspond to MaxWords, MinWords, and MaxFragments as described in 12.3.4. Highlighting Results. - Note that we show headlines from only the first 50k characters in a document, because otherwise search is really slow. If you imagine a plain document in 12 point font, 50k characters is about 10 pages. (This is configured via the
SEARCH_HEADLINE_TEXT_MAX
variable.)
- The amount of text is configured via the variables
We highlight matching keywords in bold in the document Title and Summary or excerpt. We could highlight them in the Document ID as well.
For background, see Ranking Search Results in the Postgres docs.
When Django directs Postgres to provide results for a query, each potential result for a query gets a ts_rank score. See the definition of ts_rank: "Computes a score showing how well the vector matches the query."
A high score (0.1) means very relevant, while a low score (0.01) means not very relevant.
We have an environment variable that tells Postgres how to filter the results: should it only show fewer results that are most relevant, or should it show lots of results, including less relevant results at the end? A higher filter (like 0.1) mean show fewer results, and a lower filter (like 0.01) means show lots of results.
The rank filter value for each environment is in our parameter store: BASIC_SEARCH_FILTER
and QUOTED_SEARCH_FILTER
.
Rank filter is 0.05 in all environments, for both basic (not quoted) and phrase (quoted) search queries.
We're not yet using any of the Postgres document length normalization options.
To make search faster, we create and automatically maintain a "vector_column" with a pre-processed version of each content item. We create the the pre-processed version using "weight" values for various parts of the metadata and content for an item, so that (for example) a word in the title of a document counts more toward relevance than a word in the body of a document.
Context about decisions we made for weights (login required).
Weights for documents:
- (FR link) Document ID: A
- (Public link) Document ID: A
- (Internal file) Document ID: A
- (FR link) Title: A
- (Public link) Title: A
- (Internal file) summary: B
- (Internal file) filename: C
- Date: C
- Subjects (full names, short names, and abbreviations): D
- Content: D
We may want to:
- Add FR docket numbers to weight A
- Bump subjects up to weight C
- Add related regulation and statute citations to weight C
Weights for regulation text sections:
- Section number: A
- Section title: A
- Part title: A
- Content: B
We may want to:
- Add subpart title to weight B
- Reduce part title to weight B
- Reduce content to weight C
Please note that all pages on this GitHub wiki are draft working documents, not complete or polished.
Our software team puts non-sensitive technical documentation on this wiki to help us maintain a shared understanding of our work, including what we've done and why. As an open source project, this documentation is public in case anything in here is helpful to other teams, including anyone who may be interested in reusing our code for other projects.
For context, see the HHS Open Source Software plan (2016) and CMS Technical Reference Architecture section about Open Source Software, including Business Rule BR-OSS-13: "CMS-Released OSS Code Must Include Documentation Accessible to the Open Source Community".
For CMS staff and contractors: internal documentation on Enterprise Confluence (requires login).
- Federal policy structured data options
- Regulations
- Resources
- Statute
- Citation formats
- Export data
- Site homepage
- Content authoring
- Search
- Timeline
- Not built
- 2021
- Reg content sources
- Default content view
- System last updated behavior
- Paragraph indenting
- Content authoring workflow
- Browser support
- Focus in left nav submenu
- Multiple content views
- Content review workflow
- Wayfinding while reading content
- Display of rules and NPRMs in sidebar
- Empty states for supplemental content
- 2022
- 2023
- 2024
- Medicaid and CHIP regulations user experience
- Initial pilot research outline
- Comparative analysis
- Statute research
- Usability study SOP
- 2021
- 2022
- 2023-2024: 🔒 Dovetail (requires login)
- 🔒 Overview (requires login)
- Authentication and authorization
- Frontend caching
- Validation checklist
- Search
- Security tools
- Tests and linting
- Archive