Decision: How to handle resources search

Thing	Info
Relevant features	Full-text resources search
Date started	2023-04-11
Date finished	2023-04-28
Decision status	Done
Summary of outcome	We decided to use search.gov.

Background/context

Outside of the regulation text itself, CMCS policy staff need to find and reference Medicaid & CHIP policy information in a wide variety of materials that are hosted on many websites, including statute (uscode.house.gov; govinfo.gov; congress.gov), rules (federalregister.gov), subregulatory guidance and implementation resources (medicaid.gov), GAO reports (gao.gov), and other informational materials (cms.gov; hhs.gov). It can take a lot of experience and time to find what you need. People also sometimes use unreliable websites when they don't have experience with better options.

We're trying to reduce their burden by providing a "one stop shop" experience for Medicaid & CHIP policy information. We've already made a curated, annotated, cross-referenced list of about 2600 links to resources.

We need to improve our resources search system, because it is currently only meeting some user needs.

We have two technologies we can use for resources search, and each has significant benefits but significant limitations.

Core questions

How do we provide a resources search experience that meets most of our user needs?

What we know

User needs

When searching guidance, rules, and other subregulatory or supplemental materials, policy researchers need relevant and comprehensive search results.

Relevant means: When a user enters a query and looks at the top results, those documents should be substantially related to the topic described in the query. Lower results should be moderately related to the query. In other words: if we have anything in our collection that answers their question or provides what they're looking for, it should be in the top results.

Comprehensive means: When a user enters a query, our search system returns results from across our entire collection of resources. Within our entire collection, the system looks for the query words in the entire document text. No documents in our collection are omitted from the search.

We don't have to provide a perfect search system. But before we change our search system, we need to be confident that the change makes it better.

Reference materials: supplemental content search stories on Dovetail, some quotes about related needs, comparison of results for recent searches.

In-house metadata search

Using Postgres search, we return results from our collection of document names and descriptions.

Pros:

Relevance:
- It produces highly relevant results for many organic searches, because if a keyword is in a document name or description, that's a very strong signal that the document is relevant to that keyword.
- We control the metadata that we're searching (by editing items in our admin panel), which enables us to improve relevance. Examples:
  - Some documents, such as older SMDLs, don't have titles in the document, so we write brief descriptions in our admin panel.
  - If a document title only uses an abbreviation for a term, we can write a modified description that spells out the term and includes the abbreviation as well.
  - We have many links to extremely long PDFs of old Federal Register documents, which were scanned and may not even be OCRed. We hand-wrote the descriptions for those documents into our database, and we index that metadata, so our search consistently returns those results when their descriptions match query keywords.
Comprehensiveness:
- It reliably produces results from our complete index of documents.
Cost:
- This is our existing low-cost solution.

Cons:

Relevance:
- None
Comprehensiveness:
- Because search is limited to document metadata, not the text of the documents, this search is not sufficiently comprehensive. Many organic queries return zero or few results, even when we have documents with contents that contain the query keywords, because the keywords are not in the document name or description. This challenge means that this search frequently does not produce the results that our users are looking for.
Cost:
- None

Search.gov full-text search

Search.gov tries to index the full text of our complete collection of document links.

Pros:

Relevance:
- Due to its comprehensiveness, it produces relevant results for many organic searches that produce zero or few results with metadata search.
Comprehensiveness:
- It indexes the complete text of HTML pages, PDFs, Word docs, Excel sheets, and more. This comprehensiveness is super valuable for our users.
- It serves as an automated link-checker for us, because it tells us when it can't access a URL. It's helped us find broken URLs that we needed to update.
Cost:
- This service is free to us, including crawling, document storage, and indexing.
The Search.gov team plans to improve this tool. They have their own engineering team with specialized skill in search engines.

Cons:

Relevance:
- For some queries, such as "postpartum" or "dental", the results have a confusing ranking order -- the top result seems less relevant than the last result (estimated by counting the number of times the term appears in the document compared to its number of pages).
- For multi-word queries, the initial results can be mostly irrelevant, even though a quoted search for that phrase generates relevant results.
- It has a naive form of stemming that creates irrelevant matches. Example: search for "community first choice" (without quotes) and get results with the term "communications".
- It indexes navigation menus for the Federal Register and other websites, so it produces a lot of irrelevant results if your query word happens to be in the navigation menu.
Comprehensiveness:
- It is not able to consistently index our entire collection of documents. About half of our documents seem to be missing from its results. Many organic queries return incomplete results, missing many documents with descriptions and contents that contain the query keywords. We don't fully understand why this is happening. See "What we don't know" below for opportunities to learn more.
- It cannot index certain items hosted on sites that block its crawler, mainly MACPro training videos hosted on YouTube and Streamlined Modular Certification Word/Excel documents hosted on GitHub.
- It cannot index documents larger than 50 MB (for example, extremely long PDF scans of old Federal Register documents).
Cost:
- None
The Search.gov team does not plan to do substantial work on this cherry-picked index tool this quarter. This cherry-picked list of URLs (across many websites) is a less-common use case for them than "search all of the pages within this one website", so it's a lower priority, but they do plan to improve its relevance calculations (such as by not indexing navigation menus).

Hypothetical hybrid search

We have an experiment showing that if you combine the results from both systems, you can get relevant and comprehensive results. The concept:

Our metadata results show up as the top items, because they always have the strongest relevance.
After that, we display the search.gov results, because they give comprehensiveness to the results. (We remove any items that already appeared in the metadata results, to avoid duplicates.)

Pros:

Always delivers results that are at least as relevant and comprehensive as our current results, while meeting user needs for increased comprehensiveness.

Cons:

Not best practice from an engineering perspective, could be fragile and tricky to maintain (see "what we don't know" below).

Hypothetical development of our own custom full-text resources search system

If we wanted to crawl and index documents ourselves, we would need to estimate the potential AWS costs of that work and review it before proceeding. We are not likely to get approved for any non-trivial AWS cost increases.

What we don't know

About Search.gov

When search.gov tries to index Medicaid.gov documents, why does Medicaid.gov often return a 403 error even for documents available to the public, which prevents indexing of those documents?
- We talked to the Medicaid.gov team to ensure they didn't have any issues with this experiment, and we tried a delete-and-reindex. This was apparently a temporary error and has not reappeared.
Why does search.gov not index many of the pages in the RSS feed that we send them, especially in the second half of the feed?
- It turned out they had a built-in limit of 1000 items. They increased our limit to 3000.
If we get the comprehensiveness issues fixed, would the Search.gov method become sufficient?
- We'll have to try this and see, but it'll probably still have relevancy downsides when compared to our metadata search.
How long will it take for search.gov to improve relevancy in the feature that we're using?
- Not this quarter, maybe next quarter, but they don't know for sure.

About ramifications of a hybrid method

The goal of a hybrid solution would be to help eRegs meet most user needs for the next few months, until search.gov improves its relevance. We only have a prototype implementation right now; we haven't determined what a production implementation would look like.

Could we create clear requirements and architectural design for this feature?
- Could we create a simple design that minimizes confusing edge cases, including in pagination of results?
- Could we maintain our coding quality standards?
What would be the maintenance and technical debt implications of a hybrid option?
- Would a hybrid implementation make it harder for us to apply routine updates to Postgres, Django, Vue, or any of our other components?
- If a hybrid implementation didn't end up working well for us, how feasible would it be to remove the search.gov method and revert back to our metadata search?
- If search.gov improves its relevance sufficiently, how feasible would it be to remove the metadata method while leaving in place the search.gov method?
- Would it be buggy and hard to debug?
- Would it be hard to learn for new developers?

About an alternate method

What would be the efforts, time, costs, and opportunity costs involved in an alternate option?
- We have a hypothesis that we could crawl our 2600+ URLs, scrape the text (from HTML, PDFs, Word docs, Excel sheets, etc), store the text, and use Postgres full-text search on the text, in a relatively low-cost way.
- There are other open source search solutions that we could run, such as Elasticsearch, with Haystack for Django.
- AWS has a managed service derived from Elasticsearch.
What would be the maintenance and technical debt implications of an alternate method?
- What kinds of complexity would it add to our system and our technical work?
- Would it require specialized expertise or onboarding for future new developers?
- Could it result in more MACPRO ATO tickets for us?
If search.gov improves its relevance and delivers better results, how feasible would it be to remove this component and use search.gov instead?

Things we need to decide

How do we best meet our user needs?

Decision

Consequences

Please note that all pages on this GitHub wiki are draft working documents, not complete or polished.

Our software team puts non-sensitive technical documentation on this wiki to help us maintain a shared understanding of our work, including what we've done and why. As an open source project, this documentation is public in case anything in here is helpful to other teams, including anyone who may be interested in reusing our code for other projects.

For context, see the HHS Open Source Software plan (2016) and CMS Technical Reference Architecture section about Open Source Software, including Business Rule BR-OSS-13: "CMS-Released OSS Code Must Include Documentation Accessible to the Open Source Community".

For CMS staff and contractors: internal documentation on Enterprise Confluence (requires login).

Overview

Project context / problem statement
Audiences
Use cases
Functionality
Archive
- Pilot stage
- Potential capabilities

Data

Features

Site homepage
Content authoring
- Admin panel structure
- Content editor user flows
Search
Timeline
Not built
- Definitions

Decisions

User research

Usability studies

Design

Development

🔒 Overview (requires login)
Authentication and authorization
- Roles and permissions
- Test users
Frontend caching
Validation checklist
Search
- Regulations Search
- Text Extractor
Security tools
- Gitleaks
- Snyk
Tests and linting
- ESLint (JavaScript)
Archive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decision: How to handle resources search

Background/context

Core questions

What we know

User needs

In-house metadata search

Search.gov full-text search

Hypothetical hybrid search

Hypothetical development of our own custom full-text resources search system

What we don't know

About Search.gov

About ramifications of a hybrid method

About an alternate method

Things we need to decide

Decision

Consequences

Overview

Data

Features

Decisions

User research

Usability studies

Design

Development

Clone this wiki locally