Add sorted_sequences as recommended non-inherent attribute #71

nsheff · 2024-03-05T17:16:45Z

Some feedback from the PRC was that we could think about another RECOMMENDED non-inherent attribute to live alongside sorted_name_length_pairs, that would be a digest for the sequences that does not respect order. So, something like: sorted_sequences.

This digest would allow you to easily assess order-invariant equivalence of sequences without having to use the comparison function, which would be useful for some use cases.

The text was updated successfully, but these errors were encountered:

nsheff · 2024-03-06T14:46:52Z

Here's some proposed text to add to the spec:

3.3 The `sorted_sequences` attribute (`RECOMMENDED`)

The sorted_sequences attribute is a non-inherent attribute of a seuqence collection, with a formal definition.
We RECOMMEND all implementations provide this attribute.
When digested, this attribute provides a digest representing an order-invariant set of unnamed sequences.
It provides a way to compare two sequence collections to see if their sequence content is identical, but just in a different order.
Such a comparison can, of course, be made by the comparison function, so why do we recommend this attribute be included as well?
Simply that for some large-scale use cases, comparing the sequence content without considering order is something that needs to be done for
In these cases, using the comparison function could be computationally prohibitive. This digest allows the comparison to be pre-computed, and more easily compared.

Algorithm:

Take the sequences attribute and canonicalize the JSON (using RFC-8785).
Sort the resulting digests lexographically.
Add to the sequence collection object as the sorted_sequences attribute, non-inherent and non-collated.

nsheff · 2024-05-15T12:38:25Z

What was the decision on this? Add to the spec?

nsheff · 2024-05-16T23:42:38Z

Our decision on this was to make this an OPTIONAL and for now include it in the spec.

In the future if the number of proposed ancillary attributes grows, it could move to a separate document together with other ideas for ancillary attributes.

nsheff · 2024-05-17T01:20:02Z

ADR added, added to spec.

nsheff added the enhancement New feature or request label Mar 5, 2024

nsheff changed the title ~~Add unordered seqcol as non-inherent attribute~~ Add sorted_sequences as recommended non-inherent attribute Mar 5, 2024

tcezard added the schema-term Proposals for terms in the core schema label Mar 6, 2024

nsheff mentioned this issue May 16, 2024

Use case: a digest for a collection of sequences #76

Open

nsheff closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sorted_sequences as recommended non-inherent attribute #71

Add sorted_sequences as recommended non-inherent attribute #71

nsheff commented Mar 5, 2024

nsheff commented Mar 6, 2024

nsheff commented May 15, 2024

nsheff commented May 16, 2024

nsheff commented May 17, 2024

Add sorted_sequences as recommended non-inherent attribute #71

Add sorted_sequences as recommended non-inherent attribute #71

Comments

nsheff commented Mar 5, 2024

nsheff commented Mar 6, 2024

3.3 The sorted_sequences attribute (RECOMMENDED)

nsheff commented May 15, 2024

nsheff commented May 16, 2024

nsheff commented May 17, 2024

3.3 The `sorted_sequences` attribute (`RECOMMENDED`)