External ID table cleanup #631

Mr0grog · 2022-04-16T02:39:12Z

We recently ran out of ID space on the external_ids table and solved the immediate problem by updating the id column’s type from integer to bigint, which won’t run out for ~4.3 billion years at the current usage level. However! It might be nice to solve the underlying issue and do some other cleanup on this table.

Some things that might be worth addressing here:

The cause of the above problem: we keep the table up-to-date by using INSERT ... ON CONFLICT (provider_location_id, system, value) DO ... and just sending every (provider_location_id, system, value) tuple we have for the location (see addExternalIds() in db.ts). That’s convenient—we can have a quick, atomic update that is handled entirely in the database. However, it turns out this causes Postgres to increment the sequence we use for the id column on every potential update, not just the ones that wind up writing new rows. It would be nice to avoid that by doing one of the following:
1. Dropping the id column and using (provider_location_id, system, value) as a compound primary key. Since Support multiple external IDs with the same system #206, we never change or update rows in this table (only add or delete them), so we don’t get a whole lot out of having a separate id.
2. Using a non-atomic query-and-insert SQL statement instead of the atomic INSERT ... ON CONFLICT ... statement. That is, use a INSERT ... SELECT ... WHERE NOT EXISTS (...) pattern:
```
INSERT INTO "external_ids" ("provider_location_id", "system", "value")
    SELECT 'c2a16e1d-3a7d-461e-9a42-7a4f1e3a183c', 'cvs', '7156'
    WHERE NOT EXISTS (
        SELECT FROM external_ids
        WHERE provider_location_id = 'c2a16e1d-3a7d-461e-9a42-7a4f1e3a183c'
            AND system = 'cvs'
            AND value = '71567'
    );
```
  This would be slightly slower to execute, makes for a little messier code, requires a separate statement for each tuple (with the current approach we can do the whole batch in a single statement) and make it possible for the statement to fail if two inserts wind up happening in parallel (extremely unlikely in this particular system, but possible), but wouldn’t touch the id sequence unless we’re almost certain to actually be inserting new data.
3. Compare the current IDs and the IDs we’re inserting in JS land and only try to insert the ones we don’t already have. That is, option 2 above but in JS instead of SQL.

We currently do INSERT ... ON CONFLICT DO UPDATE ..., but we only update the rows in conflict, which is pointless. We should INSERT ... ON CONFLICT DO NOTHING. This is probably an accidental holdover from before Support multiple external IDs with the same system #206, when we used to sometimes change external IDs, because we could only have one value per (provider_location_id, system) pair.

univaf/server/src/db.ts

Lines 132 to 149 in a95243a

    
             await dbConn("external_ids") 
        
               .insert( 
        
                 toInsert.map(([system, value]: [string, string]) => { 
        
                   if (system.includes(":")) { 
        
                     throw new ValueError( 
        
                       `Provider location ${id} externalIds include ${system} \ 
        
                       but ':' is not allowed`.replace(/\n\s*/g, " ") 
        
                     ); 
        
                   } 
        
                   return { 
        
                     provider_location_id: id, 
        
                     system, 
        
                     value, 
        
                   }; 
        
                 }) 
        
               ) 
        
               .onConflict(["provider_location_id", "system", "value"]) 
        
               .merge();

Drop the updated_at column, since we never do updates at this point (see above). We don’t even have lingering data from when we once did:
```
> select count(*) from external_ids where created_at != updated_at;
 count
-------
     0
```
Change the system and value columns from text to varchar(125). Our current longest values for these two are 43 and 102, respectively, and they probably should have some reasonable size cap to prevent them getting out of hand (it doesn’t need to be 125; we just get slightly more efficient storage at that size or below). This is probably a bigger deal if we drop the id column as the primary key.
Make provider_location_id, system, and value not null. I’m not sure why they are currently nullable. 🤷

The text was updated successfully, but these errors were encountered:

Mr0grog · 2023-06-19T20:08:04Z

This project has been shut down. This issue has been left open as a guide for anyone forking this project — addressing this issue (or at least knowing about it!) is likely to be worthwhile for you if you are maintaining a running copy or fork of UNIVAF.

Mr0grog added the api label Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External ID table cleanup #631

External ID table cleanup #631

Mr0grog commented Apr 16, 2022 •

edited

Loading

Mr0grog commented Jun 19, 2023

External ID table cleanup #631

External ID table cleanup #631

Comments

Mr0grog commented Apr 16, 2022 • edited Loading

Mr0grog commented Jun 19, 2023

Mr0grog commented Apr 16, 2022 •

edited

Loading