Skip to content

Conversation

@Jongmassey
Copy link
Contributor

@Jongmassey Jongmassey commented Nov 27, 2025

A Hierarchy contains a dict which contains collections of codes stored as sets.

We cache the hierarchy as a json string for performance reasons.

This cached version is deserialised and used for comparisons to ascertain whether the CodelistVersion it belongs to is compatible with various releases of the underlying coding system.
The standard json deserialisation uses lists not sets.
This makes the comparison of these collections of codes order-sensitive, which is not required or appropriate.

This change provides a custom deserialiser for cached hierarchies which uses sets in place of lists such that these comparisons are order-insensitive.

Fixes #2912

@Jongmassey Jongmassey force-pushed the Jongmassey/fix-version-compat branch 2 times, most recently from 84855de to 495c5b6 Compare November 27, 2025 15:24
Hierarchy uses sets to contain collections of codes.
When serialised then deserialised from json these become lists.
When comparing deserialised cached hierarchies either with other
deserialised cached hierarchies or freshly-constructed hierarchies the
order of these lists is not deterministic.

Therefore by deserialising as sets to match the original types we take
away concerns of order of these collections.
@Jongmassey Jongmassey force-pushed the Jongmassey/fix-version-compat branch from 495c5b6 to 870f492 Compare November 27, 2025 15:42
@evansd
Copy link
Contributor

evansd commented Nov 27, 2025

Arghh, such a subtle nasty bug! Nice work in tracking this down.

This is just a drive-by comment to say that an alternative fix would be to use pickle instead of JSON for this which will naturally preserve the native Python types. There are downsides to that of course in that it's not human readable and you cant' easily work with it outside of Python. But depending on exactly how these cached values are used it might be an option.

You could do an incremental migration here by checking the first character of the serialised data for { and using json or pickle to deserialize accordingly. And then update all the cached values to use pickle, and then remove the backwards compatibility code.

That's obviously more work than the approach you've taken here, but possibly solves the issue in a more general way.

Anyway, this was just to flag that as an option for consideration, not a recommendation.

Copy link
Contributor

@lucyb lucyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reasonable solution. I'm still learning about this area of the codebase though and I'm wondering how Hierarchy.from_cache and Hierarchy.data_for_cache fits into this? The former uses json.loads(data), should that be updated? The latter still uses lists, would it make sense to update that to use sets (it'll be clearer even if it works okay)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deserialise cached hierarchies with list not set to be order-invariant

4 participants