Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External ID table cleanup #631

Open
Mr0grog opened this issue Apr 16, 2022 · 1 comment
Open

External ID table cleanup #631

Mr0grog opened this issue Apr 16, 2022 · 1 comment
Labels

Comments

@Mr0grog
Copy link
Collaborator

Mr0grog commented Apr 16, 2022

We recently ran out of ID space on the external_ids table and solved the immediate problem by updating the id column’s type from integer to bigint, which won’t run out for ~4.3 billion years at the current usage level. However! It might be nice to solve the underlying issue and do some other cleanup on this table.

Some things that might be worth addressing here:

  • The cause of the above problem: we keep the table up-to-date by using INSERT ... ON CONFLICT (provider_location_id, system, value) DO ... and just sending every (provider_location_id, system, value) tuple we have for the location (see addExternalIds() in db.ts). That’s convenient—we can have a quick, atomic update that is handled entirely in the database. However, it turns out this causes Postgres to increment the sequence we use for the id column on every potential update, not just the ones that wind up writing new rows. It would be nice to avoid that by doing one of the following:

    1. Dropping the id column and using (provider_location_id, system, value) as a compound primary key. Since Support multiple external IDs with the same system #206, we never change or update rows in this table (only add or delete them), so we don’t get a whole lot out of having a separate id.

    2. Using a non-atomic query-and-insert SQL statement instead of the atomic INSERT ... ON CONFLICT ... statement. That is, use a INSERT ... SELECT ... WHERE NOT EXISTS (...) pattern:

      INSERT INTO "external_ids" ("provider_location_id", "system", "value")
          SELECT 'c2a16e1d-3a7d-461e-9a42-7a4f1e3a183c', 'cvs', '7156'
          WHERE NOT EXISTS (
              SELECT FROM external_ids
              WHERE provider_location_id = 'c2a16e1d-3a7d-461e-9a42-7a4f1e3a183c'
                  AND system = 'cvs'
                  AND value = '71567'
          );

      This would be slightly slower to execute, makes for a little messier code, requires a separate statement for each tuple (with the current approach we can do the whole batch in a single statement) and make it possible for the statement to fail if two inserts wind up happening in parallel (extremely unlikely in this particular system, but possible), but wouldn’t touch the id sequence unless we’re almost certain to actually be inserting new data.

    3. Compare the current IDs and the IDs we’re inserting in JS land and only try to insert the ones we don’t already have. That is, option 2 above but in JS instead of SQL.

  • We currently do INSERT ... ON CONFLICT DO UPDATE ..., but we only update the rows in conflict, which is pointless. We should INSERT ... ON CONFLICT DO NOTHING. This is probably an accidental holdover from before Support multiple external IDs with the same system #206, when we used to sometimes change external IDs, because we could only have one value per (provider_location_id, system) pair.

    univaf/server/src/db.ts

    Lines 132 to 149 in a95243a

    await dbConn("external_ids")
    .insert(
    toInsert.map(([system, value]: [string, string]) => {
    if (system.includes(":")) {
    throw new ValueError(
    `Provider location ${id} externalIds include ${system} \
    but ':' is not allowed`.replace(/\n\s*/g, " ")
    );
    }
    return {
    provider_location_id: id,
    system,
    value,
    };
    })
    )
    .onConflict(["provider_location_id", "system", "value"])
    .merge();

  • Drop the updated_at column, since we never do updates at this point (see above). We don’t even have lingering data from when we once did:

    > select count(*) from external_ids where created_at != updated_at;
     count
    -------
         0
  • Change the system and value columns from text to varchar(125). Our current longest values for these two are 43 and 102, respectively, and they probably should have some reasonable size cap to prevent them getting out of hand (it doesn’t need to be 125; we just get slightly more efficient storage at that size or below). This is probably a bigger deal if we drop the id column as the primary key.

  • Make provider_location_id, system, and value not null. I’m not sure why they are currently nullable. 🤷

@Mr0grog Mr0grog added the api label Apr 16, 2022
@Mr0grog
Copy link
Collaborator Author

Mr0grog commented Jun 19, 2023

This project has been shut down. This issue has been left open as a guide for anyone forking this project — addressing this issue (or at least knowing about it!) is likely to be worthwhile for you if you are maintaining a running copy or fork of UNIVAF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant