Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordered Maps: maps that retain insertion order #1651

Open
ChristianGruen opened this issue Dec 10, 2024 · 12 comments
Open

Ordered Maps: maps that retain insertion order #1651

ChristianGruen opened this issue Dec 10, 2024 · 12 comments
Labels
Feature A change that introduces a new feature XDM An issue related to the XPath Data Model XPath An issue related to XPath XQFO An issue related to Functions and Operators

Comments

@ChristianGruen
Copy link
Contributor

Currently, XDM maps are “unordered”: An implementation is allowed to organize entries in a way that optimizes lookup, not order. The entries do not have a predictable order unless they are explicitly sorted.

There are cases in which it is helpful if the “insertion order” is preserved – i.e., the order in which new map entries are added to a map. While the insertion order is not relevant if a map is exclusively used for lookups, it may be beneficial if the input includes deliberately sorted key/value pairs, such as (often) in JSON data, configurations or key/value sequences.

I created this issue because there was some confusion in #564, and on Slack, about this map flavor and “sorted maps”, which are discussed in issue #564: Sorted maps hold all map entries sorted by the key, using a comparator or (in its basic variant) fn:data#1.

PR #1609 attempts to solve both requirements at once.

@ChristianGruen ChristianGruen added XPath An issue related to XPath XQFO An issue related to Functions and Operators XDM An issue related to the XPath Data Model Feature A change that introduces a new feature labels Dec 10, 2024
@ChristianGruen
Copy link
Contributor Author

There are many ways in which maps are used in practice, but I think there we can define two main categories of use:

  1. Records: few entries, with a focus on organizing data (the map constructor map { } is often used for those cases).
  2. Dictionaries: many entries, with a focus on lookup performance (generated with map:merge and map:build).

For 1., ordering is helpful and sometimes essential. For 2., it does not matter.

I would like to warm up my idea to retain the order in some cases, but to drop it (by default) when updates are performed. In particular, I think we should:

  1. Make the order mandatory for the map constructor: It is often used to create maps with just few entries. In addition, its syntax does not allow us to supply arguments that might control the map generation.
  2. Make the order mandatory for records: It is confusing if the contents of a record are returned in a way that differs from the order of a record declaration.
  3. Provide options to map:build and map:merge (and possibly to update operations like map:put, map:remove) to retain the order: map:build(1 to 5, options := { 'ordered': true() }. In all cases, we can expect map:put to retain the order if the value of an existing entry is changed.

If we do so, it will be up to processors to provide support for an implementation of an ordered map that supports updates, or to use a simple map that transfers the entries to an unordered map once updates are performed.

Finally, it is important to remember that in many cases maps are never updated after the creation – which means there is no need to store such entries in an immutable implementation.

@michaelhkay
Copy link
Contributor

During discussion, @dnovatchev raised the question of defining ordered maps as a subtype of maps rather than as a property associated with the map instance. I'd like to consider the advantages and disadvantages of both approaches.

The most obvious benefit of using a subtype is that it enables functions to declare that they require an ordered map as the supplied argument, and to fail with a type error if supplied with an unordered map. Is that a feature we think we need? At the moment we aren't proposing any system functions or operators that require a map to be ordered.

Using a subtype would also obviate the need for an interrogative function to test whether a map is ordered, since this could be done using the "instance of" mechanism.

The main disadvantage of using subtyping is that it requires new syntax and semantics for a new item type, and new subtyping rules, and therefore more tests. There's potential complexity in deciding whether record types are always ordered, always unordered, or whether they can be either. Adding a property to maps, as currently proposed, is a lot simpler.

@Arithmeticus
Copy link
Contributor

I'm inclined to favor subtyping for ordered maps, based on experience with other PLs. Yes, it will require work to set up in the specs, but once that happens, we have the means to introduce other map subtypes. (In fact, I think once that mechanism is in place, the CG will be more favorable to allowing sorted maps than they currently are.)

Obviously (or maybe not), the map functions should remain in a single namespace. That is, functions that would work only on ordered maps would remain in the map namespace, but the signature would require an error to be raised if non-ordered maps were used as input instead.

@dnovatchev
Copy link
Contributor

dnovatchev commented Dec 11, 2024

There are many ways in which maps are used in practice, but I think there we can define two main categories of use:

  1. Records: few entries, with a focus on organizing data (the map constructor map { } is often used for those cases).
  2. Dictionaries: many entries, with a focus on lookup performance (generated with map:merge and map:build).

For 1., ordering is helpful and sometimes essential

Absolutely not.

A record is the closest thing we have to an interface, the order of attributes/properties is not significant - the same way the order of keys in a map is not significant.

We have had records for more than a year and this is the first and only time someone is saying the order of entries in a record is significant.

@michaelhkay
Copy link
Contributor

@Arithmeticus wrote:

I'm inclined to favor subtyping for ordered maps, based on experience with other PLs.

I think it's dangerous to apply that analogy too closely. The model used in Java and C# is to have a single interface for all maps, supported by different implementation classes. Our types in XDM are much more like interfaces: they define the set of operations available, not the implementation characteristics. This gives us many advantages that we don't want to lose, for example the ability to change the implementation characteristics of an existing value dynamically, behind the scenes, based on observed usage.

@michaelhkay
Copy link
Contributor

@dnovatchev wrote:

We have had records for more than a year and this is the first and only time someone is saying the order of entries in a record is significant.

There is ample user feedback that ordering is important for records - not to make it information-bearing, but for human comprehension. It's primarily important for the same kind of reason that indentation is important: a great strength of XML and JSON over formats such as ASN.1 and protocol buffers is that it is human readable. This is the reason that a lot of JSON-oriented software has moved in the direction of preserving order.

Incidentally, users also frequently ask this for XML attributes. It's not that they want to attach meaning to the order of attributes, it's that predictable order improves readability. If that's true for XML attributes where things are usually all on one line, it's vastly more true for JSON structures where rearranging the fields of a record in the serialized output might move one of the fields 10,000 lines away. Consistent ordering makes it easier to compare two structures, both for humans and for machines.

This whole exercise to define ordering for maps arose, if you recall, in response to feedback from users trying out early implementations.

@dnovatchev
Copy link
Contributor

dnovatchev commented Dec 12, 2024

There is ample user feedback that ordering is important for records - not to make it information-bearing, but for human comprehension.

It is a well-known antipattern to have classes (records) with too-many members (keys).

Users who have this problem are violating this principle and now we are trying to support them in this???

@michaelhkay
Copy link
Contributor

It is a well-known antipattern to have classes (records) with too-many members (keys).

You don't need many fields to encounter this problem. Consider a record representing a telephone bill with five fields: (name, address, phone number, itemised charges, call details). The last two fields might contain large amounts of data. If this is displayed in the order (address, call details, phone number, itemised charges, name), then it's going to be very difficult to find the name and phone number when looking at the data visually.

The user who designed this structure hasn't violated any design principles.

We've been experiencing this problem for years with some of our own data, for example SEF files, and the output of the Java parser that we use in one of our toolchains. It's certainly a problem our users have reported as well, and yes, it would be nice to help them solve it.

For another example, consider if we held the function catalog in JSON rather than XML, and every time we edited a function, the order of fields changed so that the examples might appear randomly before or after the function signature. Editing the catalog would become far more difficult.

@michaelhkay
Copy link
Contributor

@ChristianGruen wrote:

I would like to warm up my idea to retain the order in some cases, but to drop it (by default) when updates are performed. In particular, I think we should:

* Make the order mandatory for the map constructor: It is often used to create maps with just few entries. In addition, its syntax does not allow us to supply arguments that might control the map generation.
* Make the order mandatory for records: It is confusing if the contents of a record are returned in a way that differs from the order of a record declaration.
* Provide options to map:build and map:merge (and possibly to update operations like map:put, map:remove) to retain the order: map:build(1 to 5, options := { 'ordered': true() }. In all cases, we can expect map:put to retain the order if the value of an existing entry is changed.

This is reasonably well aligned with the current PR. There are a couple of differences of detail:

(a) should the result of a map constructor be ordered? I have suggested no, but there are arguments for and against. We could introduce a constructor ordered-map{ .... } so the user has both options. We could also make map{...} deliver an unordered map, and have the bare brace constructor {...} default to ordered. However, there's some merit in the argument that you only get an ordered map if you ask for it, and the default is always unordered.

(b) should map:put() and map:remove() retain order if the input is an ordered map? I think they probably should. But an alternative would be to have a map:put() that destroys order and a map:append() that retains it. From an implementation perspective, if the data is already held in a data structure that maintains an ordering, then keeping the ordering on put() and remove() operations is easier than dropping it.

@ChristianGruen
Copy link
Contributor Author

(a) should the result of a map constructor be ordered?

I think it should be. For users who care about order, it would be the obvious behavior. Others will only care if they get worse performance – which will only be the case for large maps (Category 2, as named above). While it’s possible to write down map constructors with millions of entries in XPath/XQuery code, it will certainly be an edge case (and possibly make parsing the query more expensive than building the map). Using fn:parse-json and fn:json-doc will probably be better options in practice (even for those, I would prefer to have orderedness as default).

We could introduce a constructor ordered-map{ .... } so the user has both options. We could also make map{...} deliver an unordered map, and have the bare brace constructor {...} default to ordered.

It might confront users with performance considerations that they are not aware of. In particular, it gets confusing for users who read existing code. I think the JSON syntax has benefited from being as simple as possible.

(b) should map:put() and map:remove() retain order if the input is an ordered map? I think they probably should. But an alternative would be to have a map:put() that destroys order and a map:append() that retains it.

This distinction would seem very sensible to me.

From an implementation perspective, if the data is already held in a data structure that maintains an ordering, then keeping the ordering on put() and remove() operations is easier than dropping it.

It may depend. For example, a compact and ordered map representation (in terms of memory consumption) could be used at creation time. Such a map could, for example, store all hash values – keys, buckets, etc. – in flat arrays, thus making the storage of explicit references to the next hash entry obsolete. If such a map is updated later on, it could be transformed to an unordered map (based, e.g., on an HAMT). Usually, this needs to be done only once in the lifetime of the map, and the complexity is similar to resize/rehash operations in conventional maps, i.e., an operation that is usually pretty efficient. In other cases, if such an ordered map is never updated, it could possibly consume less memory than an updatable and unordered map.

I had a look at the query sum(map:build(1 to 1000000)?*). It consumes ~70 MB with BaseX and ~86 MB with Saxon, although it contains only 1 million entries. This is certainly an artificial example, but I think that the memory consumption could generally be lowered in both implementations if we take into consideration that many maps will never be updated once they are created.

But more generally, it would give implementers more freedom on how to proceed. It should be perfectly fine to only have one map implementation, and retain the order (or some defined order) in a map if updates are performed.

@michaelhkay
Copy link
Contributor

michaelhkay commented Dec 12, 2024

I had a look at the query sum(map:build(1 to 1000000)?*). It consumes ~70 MB with BaseX and ~86 MB with Saxon, although it contains only 1 million entries. This is certainly an artificial example, but I think that the memory consumption could generally be lowered in both implementations if we take into consideration that many maps will never be updated once they are created.

Using a structure that doesn't require persistent update is one approach to improving things; the other is to avoid some of the overheads caused by type annotations and the equality matching semantics. We're currently using a map implementation that doesn't allow customized equality/hashcode computation except by wrapping each key value in a wrapper that has the required equals/hashCode methods, so there's a lot of waste.

We do a lot better for maps built with parse-json(), because we know all the keys will be xs:string instances and Java has the right equality semantics for that case; we also build a map in that case that isn't updateable without first copying to a different structure. But for map:build we currently construct a map that has overheads to support functionality that you're probably not going to use.

There's immense scope for improvements. For example, as I think you're suggesting, one can even initially build a map that doesn't have any "search by key" functionality, and add a hash index on first use. In many JSON transformations, it's likely that many maps will be copied directly from the input to the output without ever being queried or modified in any way.

@ChristianGruen
Copy link
Contributor Author

There's immense scope for improvements. For example, as I think you're suggesting, one can even initially build a map that doesn't have any "search by key" functionality, and add a hash index on first use. In many JSON transformations, it's likely that many maps will be copied directly from the input to the output without ever being queried or modified in any way.

I had an implementation in mind that stores all data in flat arrays. We use such sets/maps for all kinds of things in BaseX: One array contains all entries in the order in which they have initially been added; the other arrays organize hash table buckets and next-pointers for entries that cause collisions. Our corresponding set implementation for byte arrays consumes 25% less memory than Java’s HashSet. Such a map could also be enhanced to be immutable/persistent. Appends would be cheap, but value replacements and deletions may become expensive.

It’s an interesting idea to only generate the hash index if it’s required. It could work if we know in advance that all keys that are to be added are distinct (which, as you wrote, would be the case for valid JSON input).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A change that introduces a new feature XDM An issue related to the XPath Data Model XPath An issue related to XPath XQFO An issue related to Functions and Operators
Projects
None yet
Development

No branches or pull requests

4 participants