Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sorted_sequences as recommended non-inherent attribute #71

Closed
nsheff opened this issue Mar 5, 2024 · 4 comments
Closed

Add sorted_sequences as recommended non-inherent attribute #71

nsheff opened this issue Mar 5, 2024 · 4 comments
Labels
enhancement New feature or request schema-term Proposals for terms in the core schema

Comments

@nsheff
Copy link
Member

nsheff commented Mar 5, 2024

Some feedback from the PRC was that we could think about another RECOMMENDED non-inherent attribute to live alongside sorted_name_length_pairs, that would be a digest for the sequences that does not respect order. So, something like: sorted_sequences.

This digest would allow you to easily assess order-invariant equivalence of sequences without having to use the comparison function, which would be useful for some use cases.

@nsheff nsheff added the enhancement New feature or request label Mar 5, 2024
@nsheff nsheff changed the title Add unordered seqcol as non-inherent attribute Add sorted_sequences as recommended non-inherent attribute Mar 5, 2024
@nsheff
Copy link
Member Author

nsheff commented Mar 6, 2024

Here's some proposed text to add to the spec:


3.3 The sorted_sequences attribute (RECOMMENDED)

The sorted_sequences attribute is a non-inherent attribute of a seuqence collection, with a formal definition.
We RECOMMEND all implementations provide this attribute.
When digested, this attribute provides a digest representing an order-invariant set of unnamed sequences.
It provides a way to compare two sequence collections to see if their sequence content is identical, but just in a different order.
Such a comparison can, of course, be made by the comparison function, so why do we recommend this attribute be included as well?
Simply that for some large-scale use cases, comparing the sequence content without considering order is something that needs to be done for
In these cases, using the comparison function could be computationally prohibitive. This digest allows the comparison to be pre-computed, and more easily compared.

Algorithm:

  1. Take the sequences attribute and canonicalize the JSON (using RFC-8785).
  2. Sort the resulting digests lexographically.
  3. Add to the sequence collection object as the sorted_sequences attribute, non-inherent and non-collated.

@tcezard tcezard added the schema-term Proposals for terms in the core schema label Mar 6, 2024
@nsheff
Copy link
Member Author

nsheff commented May 15, 2024

What was the decision on this? Add to the spec?

@nsheff
Copy link
Member Author

nsheff commented May 16, 2024

Our decision on this was to make this an OPTIONAL and for now include it in the spec.

In the future if the number of proposed ancillary attributes grows, it could move to a separate document together with other ideas for ancillary attributes.

@nsheff
Copy link
Member Author

nsheff commented May 17, 2024

ADR added, added to spec.

@nsheff nsheff closed this as completed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request schema-term Proposals for terms in the core schema
Projects
None yet
Development

No branches or pull requests

2 participants