-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ordered Maps: maps that retain insertion order #1651
Comments
There are many ways in which maps are used in practice, but I think there we can define two main categories of use:
For 1., ordering is helpful and sometimes essential. For 2., it does not matter. I would like to warm up my idea to retain the order in some cases, but to drop it (by default) when updates are performed. In particular, I think we should:
If we do so, it will be up to processors to provide support for an implementation of an ordered map that supports updates, or to use a simple map that transfers the entries to an unordered map once updates are performed. Finally, it is important to remember that in many cases maps are never updated after the creation – which means there is no need to store such entries in an immutable implementation. |
During discussion, @dnovatchev raised the question of defining ordered maps as a subtype of maps rather than as a property associated with the map instance. I'd like to consider the advantages and disadvantages of both approaches. The most obvious benefit of using a subtype is that it enables functions to declare that they require an ordered map as the supplied argument, and to fail with a type error if supplied with an unordered map. Is that a feature we think we need? At the moment we aren't proposing any system functions or operators that require a map to be ordered. Using a subtype would also obviate the need for an interrogative function to test whether a map is ordered, since this could be done using the "instance of" mechanism. The main disadvantage of using subtyping is that it requires new syntax and semantics for a new item type, and new subtyping rules, and therefore more tests. There's potential complexity in deciding whether record types are always ordered, always unordered, or whether they can be either. Adding a property to maps, as currently proposed, is a lot simpler. |
I'm inclined to favor subtyping for ordered maps, based on experience with other PLs. Yes, it will require work to set up in the specs, but once that happens, we have the means to introduce other map subtypes. (In fact, I think once that mechanism is in place, the CG will be more favorable to allowing sorted maps than they currently are.) Obviously (or maybe not), the map functions should remain in a single namespace. That is, functions that would work only on ordered maps would remain in the |
Absolutely not. A record is the closest thing we have to an interface, the order of attributes/properties is not significant - the same way the order of keys in a map is not significant. We have had records for more than a year and this is the first and only time someone is saying the order of entries in a record is significant. |
@Arithmeticus wrote:
I think it's dangerous to apply that analogy too closely. The model used in Java and C# is to have a single interface for all maps, supported by different implementation classes. Our types in XDM are much more like interfaces: they define the set of operations available, not the implementation characteristics. This gives us many advantages that we don't want to lose, for example the ability to change the implementation characteristics of an existing value dynamically, behind the scenes, based on observed usage. |
@dnovatchev wrote:
There is ample user feedback that ordering is important for records - not to make it information-bearing, but for human comprehension. It's primarily important for the same kind of reason that indentation is important: a great strength of XML and JSON over formats such as ASN.1 and protocol buffers is that it is human readable. This is the reason that a lot of JSON-oriented software has moved in the direction of preserving order. Incidentally, users also frequently ask this for XML attributes. It's not that they want to attach meaning to the order of attributes, it's that predictable order improves readability. If that's true for XML attributes where things are usually all on one line, it's vastly more true for JSON structures where rearranging the fields of a record in the serialized output might move one of the fields 10,000 lines away. Consistent ordering makes it easier to compare two structures, both for humans and for machines. This whole exercise to define ordering for maps arose, if you recall, in response to feedback from users trying out early implementations. |
It is a well-known antipattern to have classes (records) with too-many members (keys). Users who have this problem are violating this principle and now we are trying to support them in this??? |
You don't need many fields to encounter this problem. Consider a record representing a telephone bill with five fields: (name, address, phone number, itemised charges, call details). The last two fields might contain large amounts of data. If this is displayed in the order (address, call details, phone number, itemised charges, name), then it's going to be very difficult to find the name and phone number when looking at the data visually. The user who designed this structure hasn't violated any design principles. We've been experiencing this problem for years with some of our own data, for example SEF files, and the output of the Java parser that we use in one of our toolchains. It's certainly a problem our users have reported as well, and yes, it would be nice to help them solve it. For another example, consider if we held the function catalog in JSON rather than XML, and every time we edited a function, the order of fields changed so that the examples might appear randomly before or after the function signature. Editing the catalog would become far more difficult. |
@ChristianGruen wrote:
This is reasonably well aligned with the current PR. There are a couple of differences of detail: (a) should the result of a map constructor be ordered? I have suggested no, but there are arguments for and against. We could introduce a constructor (b) should |
I think it should be. For users who care about order, it would be the obvious behavior. Others will only care if they get worse performance – which will only be the case for large maps (Category 2, as named above). While it’s possible to write down map constructors with millions of entries in XPath/XQuery code, it will certainly be an edge case (and possibly make parsing the query more expensive than building the map). Using
It might confront users with performance considerations that they are not aware of. In particular, it gets confusing for users who read existing code. I think the JSON syntax has benefited from being as simple as possible.
This distinction would seem very sensible to me.
It may depend. For example, a compact and ordered map representation (in terms of memory consumption) could be used at creation time. Such a map could, for example, store all hash values – keys, buckets, etc. – in flat arrays, thus making the storage of explicit references to the next hash entry obsolete. If such a map is updated later on, it could be transformed to an unordered map (based, e.g., on an HAMT). Usually, this needs to be done only once in the lifetime of the map, and the complexity is similar to resize/rehash operations in conventional maps, i.e., an operation that is usually pretty efficient. In other cases, if such an ordered map is never updated, it could possibly consume less memory than an updatable and unordered map. I had a look at the query But more generally, it would give implementers more freedom on how to proceed. It should be perfectly fine to only have one map implementation, and retain the order (or some defined order) in a map if updates are performed. |
Using a structure that doesn't require persistent update is one approach to improving things; the other is to avoid some of the overheads caused by type annotations and the equality matching semantics. We're currently using a map implementation that doesn't allow customized equality/hashcode computation except by wrapping each key value in a wrapper that has the required equals/hashCode methods, so there's a lot of waste. We do a lot better for maps built with parse-json(), because we know all the keys will be xs:string instances and Java has the right equality semantics for that case; we also build a map in that case that isn't updateable without first copying to a different structure. But for map:build we currently construct a map that has overheads to support functionality that you're probably not going to use. There's immense scope for improvements. For example, as I think you're suggesting, one can even initially build a map that doesn't have any "search by key" functionality, and add a hash index on first use. In many JSON transformations, it's likely that many maps will be copied directly from the input to the output without ever being queried or modified in any way. |
I had an implementation in mind that stores all data in flat arrays. We use such sets/maps for all kinds of things in BaseX: One array contains all entries in the order in which they have initially been added; the other arrays organize hash table buckets and next-pointers for entries that cause collisions. Our corresponding set implementation for byte arrays consumes 25% less memory than Java’s HashSet. Such a map could also be enhanced to be immutable/persistent. Appends would be cheap, but value replacements and deletions may become expensive. It’s an interesting idea to only generate the hash index if it’s required. It could work if we know in advance that all keys that are to be added are distinct (which, as you wrote, would be the case for valid JSON input). |
Currently, XDM maps are “unordered”: An implementation is allowed to organize entries in a way that optimizes lookup, not order. The entries do not have a predictable order unless they are explicitly sorted.
There are cases in which it is helpful if the “insertion order” is preserved – i.e., the order in which new map entries are added to a map. While the insertion order is not relevant if a map is exclusively used for lookups, it may be beneficial if the input includes deliberately sorted key/value pairs, such as (often) in JSON data, configurations or key/value sequences.
I created this issue because there was some confusion in #564, and on Slack, about this map flavor and “sorted maps”, which are discussed in issue #564: Sorted maps hold all map entries sorted by the key, using a comparator or (in its basic variant)
fn:data#1
.PR #1609 attempts to solve both requirements at once.
The text was updated successfully, but these errors were encountered: