Decision: How to handle resources search

Thing	Info
Relevant features	Full-text resources search
Date started	2024-04-11
Date finished
Decision status	Working on it
Summary of outcome

Background/context

Outside of the regulation text itself, CMCS policy staff need to find and reference Medicaid & CHIP policy information in a wide variety of materials that are hosted on many websites, including statute (uscode.house.gov; govinfo.gov; congress.gov), rules (federalregister.gov), subregulatory guidance and implementation resources (medicaid.gov), GAO reports (gao.gov), and other informational materials (cms.gov; hhs.gov). It can take a lot of experience and time to find what you need. People also sometimes use unreliable websites when they don't have experience with better options.

We're trying to reduce their burden by providing a "one stop shop" experience for Medicaid & CHIP policy information. We've already made a curated, annotated, cross-referenced list of about 2600 links to resources.

We need to improve our resources search system, because it is currently only meeting some user needs.

We have two technologies we can use for resources search, and each has significant benefits but significant limitations.

Core questions

How do we provide a resources search experience that meets most of our user needs?

What we know

User needs

When searching guidance, rules, and other subregulatory or supplemental materials, policy researchers need relevant and comprehensive search results.

Relevant means: When a user enters a query and looks at the top results, those documents should be substantially related to the topic described in the query. Lower results should be moderately related to the query. In other words: if we have anything in our collection that answers their question or provides what they're looking for, it should be in the top results.

Comprehensive means: When a user enters a query, our search system returns results from across our entire collection of resources. Within our entire collection, the system looks for the query words in the entire document text. No documents in our collection are omitted from the search.

We don't have to provide a perfect search system. But before we change our search system, we need to be confident that the change makes it better for most users, most of the time.

Reference materials: supplemental content search stories on Dovetail, some quotes about related needs, comparison of results for recent searches.

In-house metadata search

Using Postgres search, we return results from our collection of document names and descriptions.

Pros:

Relevance:
- It produces highly relevant results for many organic searches, because if a keyword is in a document name or description, that's a very strong signal that the document is relevant to that keyword.
- We control the metadata that we're searching (by editing items in our admin panel), which enables us to improve relevance. Examples:
  - Some documents, such as older SMDLs, don't have titles in the document, so we write brief descriptions in our admin panel.
  - If a document title only uses an abbreviation for a term, we can write a modified description that spells out the term and includes the abbreviation as well.
  - We have many links to extremely long PDFs of old Federal Register documents, which were scanned and may not even be OCRed. We hand-wrote the descriptions for those documents into our database, and we index that metadata, so our search consistently returns those results when their descriptions match query keywords.
Comprehensiveness:
- It reliably produces results from our complete index of documents.
Cost:
- This is our existing low-cost solution.

Cons:

Relevance:
- None
Comprehensiveness:
- Because search is limited to document metadata, not the text of the documents, this search is not sufficiently comprehensive. Many organic queries return zero or few results, even when we have documents with contents that contain the query keywords, because the keywords are not in the document name or description. This challenge means that this search frequently does not produce the results that our users are looking for.
Cost:
- None

Search.gov full-text search

Search.gov tries to index the full text of our complete collection of document links.

Pros:

Relevance:
- It produces relevant results for many organic searches that produce 0 results with metadata search.
Comprehensiveness:
- It indexes the complete text of HTML pages, PDFs, Word docs, Excel sheets, and more.
Cost:
- This service is completely free to us, including crawling, document storage, and indexing.
The Search.gov team plans to improve this tool. They have their own engineering team with specialized skill in search engines.

Cons:

Relevance:
- For some queries, such as "postpartum" or "dental", the results have a confusing ranking order -- the top result seems less relevant than the last result (estimated by counting the number of times the term appears in the document compared to its number of pages).
- For multi-word queries, the initial results can be mostly irrelevant, even though a quoted search for that phrase generates relevant results.
- It has a naive form of stemming that creates irrelevant matches. Example: search for "community first choice" (without quotes) and get results with the term "communications".
- It indexes navigation menus for the Federal Register and other websites, so it produces a lot of irrelevant results if your query word happens to be in the navigation menu.
Comprehensiveness:
- It is not able to consistently index our entire collection of documents. About half of our documents seem to be missing from its results. Many organic queries return incomplete results, missing many documents with descriptions and contents that contain the query keywords. We don't fully understand why this is happening. See "What we don't know" below for opportunities to learn more.
- It cannot index certain items hosted on sites that block its crawler, mainly MACPro training videos hosted on YouTube and Streamlined Modular Certification Word/Excel documents hosted on GitHub.
Cost:
- None
The Search.gov team does not plan to do substantial work on this cherry-picked index tool this quarter. This cherry-picked list of URLs (across many websites) is a less-common use case for them than "search all of the pages within this one website", so it's a lower priority, but they do plan to improve its relevance calculations (such as by not indexing navigation menus).

Hypothetically developing our own custom full-text resources search system

If we wanted to crawl and index documents ourselves, we would need to estimate the potential AWS costs of that work and review it with DSG before proceeding. DSG is not likely to approve any significant AWS cost increases.

What we don't know

About Search.gov:

When search.gov indexes Medicaid.gov pages, why does Medicaid.gov often return a 403 error even for pages that work fine otherwise?
- We're scheduled to learn more about this on Wednesday.
Why does search.gov not index many of the pages in the RSS feed that we send them?
- We'll try a delete-and-reindex after we learn more about the 403 error issue. We've resolved a lot of issues with our RSS feed since their initial indexing of it!
How long will it take for search.gov to improve the feature that we're using?
- Not this quarter, maybe next quarter, but they don't know for sure.

About impacts on our system:

What would be the maintenance and technical debt implications of a hybrid option?
- We only have a prototype implementation right now; we haven't determined what a production implementation would look like.
- Would a hybrid implementation make it harder for us to apply routine updates to Postgres, Django, Vue, or any of our other components?
- If a hybrid implementation didn't end up working well for us, could we remove it relatively easily and revert back to our metadata-only search?
What would be the efforts and costs involved in an alternate option?

Things we need to decide + options for them

How do we best meet our user needs?

These systems have complementary strengths and weaknesses. Theoretically, if you combine the results from both systems, you can get relevant and comprehensive results.

Could we produce an interim hybrid solution that would hold us over until search.gov modernizes its system?

Decision

Consequences

Please note that all pages on this GitHub wiki are draft working documents, not complete or polished.

Our software team puts non-sensitive technical documentation on this wiki to help us maintain a shared understanding of our work, including what we've done and why. As an open source project, this documentation is public in case anything in here is helpful to other teams, including anyone who may be interested in reusing our code for other projects.

For context, see the HHS Open Source Software plan (2016) and CMS Technical Reference Architecture section about Open Source Software, including Business Rule BR-OSS-13: "CMS-Released OSS Code Must Include Documentation Accessible to the Open Source Community".

For CMS staff and contractors: internal documentation on Enterprise Confluence (requires login).

Overview

Project context / problem statement
Audiences
Use cases
Functionality
Archive
- Pilot stage
- Potential capabilities

Data

Features

Site homepage
Content authoring
- Admin panel structure
- Content editor user flows
Search
Timeline
Not built
- Definitions

Decisions

User research

Usability studies

Design

Development

🔒 Overview (requires login)
Authentication and authorization
- Roles and permissions
- Test users
Frontend caching
Validation checklist
Search
- Regulations Search
- Text Extractor
Security tools
- Gitleaks
- Snyk
Tests and linting
- ESLint (JavaScript)
Archive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly