Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Quality assurance strategy for 3rd party API calls #224

Closed
oshoma opened this issue Nov 14, 2023 · 4 comments
Closed

[DRAFT] Quality assurance strategy for 3rd party API calls #224

oshoma opened this issue Nov 14, 2023 · 4 comments
Labels
draft-proposal A draft proposal for how to manage this repository

Comments

@oshoma
Copy link
Collaborator

oshoma commented Nov 14, 2023

PROBLEM

Some of our tests call 3rd party APIs such as Serper (Google Search). This can give us non-deterministic results, slow tests, and tests that cost real money. It could also result in pushback from service providers due to bursty API call volumes. See background discussion.

SOLUTION

We need to define how we will test the quality of code that depends on 3rd party APIs via libraries. Here's a draft proposal for comment:

Goals

  1. When run in its default mode, make test will operate offline. Tests will not perform network calls, including function calls to 3rd party APIs. Furthermore, the tests will be crafted to be deterministic. This avoids the problems described above, and is a necessary stepping stone towards implementing automatic testing on developer desktops and in Github actions.
  2. A second, network-connected mode of make test will perform calls to 3rd party APIs. The purpose of this mode is to run end-to-end integration tests to demonstrate that our system is working in the real world. Since these tests are expected to be relatively slow, and will cost real money, they should only be run occasionally, e.g. when we are preparing to deploy into production.

Developer Guidelines

  1. The job of our tests is to ensure our own code is high quality. We are not testing whether a 3rd party API works... that's the job of the 3rd party.
  2. Create stubs to mimic the results of 3rd party API calls.
  3. Occasionally refresh the stub values with new responses from the 3rd party service.
  4. Always use stubbed values in tests of code that rely on 3rd party APIs.
  5. Block network calls during default test runs. To achieve this we can integrate pytest-socket into the top level of our test configuration.
  6. Create some integration tests that call 3rd party services, to be run selectively. These tests will run the real code (no stubs or mocks), resulting in actual network calls to 3rd party APIs. Design these tests so that they succeed when the 3rd party API returns successfully, and fail otherwise. Design these tests so that they do not issue a sustained high velocity burst of 3rd party API calls.
  7. Mark the tests that call 3rd party APIs. Mark the special tests, e.g.@pytest.mark.external-api-call. Modify pyproject.toml so these tests do not run by default. Instead they can be selectively run by passing in a flag to pytest, e.g. `pytest -v -m external-api-call'
  8. Run the "external-api-call" integration tests before deploying new code to production. Keep in mind that these tests can fail for unexpected reasons, including the 3rd party vendor being unavailable or crashing at the moment the tests run.

Implementation Thoughts

Here are additional thoughts from 2023-11-16 discussion with @20001LastOrder about how we might implement this:

  1. Set LLM temperature to 0 in tests
  2. Always use the same LLM when testing
  3. Use a special logger to capture all LLM calls and responses
    a) send to normal console output
    b) send to fixure files, to be used for tests
    ...based on a command line option
  4. Offline test mode (the default)
    • Use Fake LLM | 🦜️🔗 Langchain to simulate LLM responses
    • load canned LLM responses from fixture files, previously captured from real LLM usage
    • tests should run quickly , e.g. < 15 seconds for entire test suite
    • test the behavioral logic in our own code, which interacts with LLMs
    • test expectations about how many times our code calls LLMs, and with what parameters
  5. Online (network-connected) mode
    • tests call the real 3rd party APIs
    • we instantiate a real LLM when testing
    • test suite will run slowly
    • test suite will cost real money to run
@oshoma oshoma added the draft-proposal A draft proposal for how to manage this repository label Nov 14, 2023
@oshoma oshoma changed the title [PROPOSAL] Quality assurance strategy for 3rd party API calls [DRAFT] Quality assurance strategy for 3rd party API calls Nov 14, 2023
oshoma added a commit that referenced this issue Nov 16, 2023
Improve support for command line options

- Use ArgumentParser to make parsing of command line options more robust
- Add support for `gsite` option to scope search to a particular URL
- Rename `real` tests to `external_api` to indicate tests that call 3rd party APIs (see #224 for more info)
- Fix some broken tests

There are 7 failing tests. This is a known issue and will be fixed separately, as the test breaks were introduced
by other commits.
 
Fixes #215
@oshoma
Copy link
Collaborator Author

oshoma commented Nov 17, 2023

@20001LastOrder I added thoughts from our discussion today about making tests robust

@20001LastOrder
Copy link
Collaborator

To move this forward, I recently have the following thoughts:

  1. The content of external API calls is not really important when testing offline, unless, of course, the test is about features using that API call directly. Since the ultimate content is generated by the LLM, which is already handled with caching right now.
  2. Based on the above observation, I think
    • If the API call is part of integration testing (e.g. QA agent using Google Search), we can just mock it with anything that follows the same format as the response of the API
    • If the API call is the essential part of the integration testing, we should mock it with meaningful content following the format of the API. (For example, testing of number validation with Google Search should be mocked with some sample Google Search results containing numbers).

@oshoma
Copy link
Collaborator Author

oshoma commented Feb 9, 2024 via email

@20001LastOrder
Copy link
Collaborator

Closed by #309 and #202

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
draft-proposal A draft proposal for how to manage this repository
Projects
None yet
Development

No branches or pull requests

2 participants