-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map collections of primitive types to JSON column in relational database #29427
Comments
@ajcvickers @maumar is this a dup of #28688? |
@roji No. #28688 is about collections of primitive types inside another type where the other type is mapped to a JSON document. This is about mapping a property that is itself a collection of primitive types. In my opinion, this should end up being the by-convention mapping for collections of primitive types, although I think some others favor making #25163 the default mapping. |
Ah OK, in that case #29825 (which I opened yesterday) is a dup of this. Pasting some stuff I wrote there: |
@ajcvickers mapping to a separate table feels overly heavy for something you are modelling using only a primitive type. The use cases here likely don't scale beyond a few values stored in. I would argue for potentially a CSV format being the default: ' JSON format ought to be opt-in. Npgsql has support for these type through a native Postgres |
What would be your argument in favor of CSV over JSON as the default mapping? |
@roji It’s the simplest possible format for the problem at hand. The usual use case I’ve personally seen these used for are tags or categories. Why add extra characters to denote that the data holds an array and have escaping quotes around values when the commas are all that is needed to delineate them? Correction: I made a mistake in my comment above, the CSV values should be Simplicity and storage performance. Opt in to extra complexity (ie: JSON) only if and as you need it. |
For one thing, at least for some values (e.g. strings), quoting is just as much a problem for CSV encoding as for JSON. For example, try to use STRING_SPLIT to do querying over a CSV field which does require quoting (because e.g. a comma is present in one of the values) - that's not likely to work well. The point is that JSON is a format that's already recognized and somewhat well-supported by the database, so in that sense it's a superior option to CSV. The extra characters for quoting are unlikely to make a meaningful difference in real-world scenarios. |
In what scenarios does quoting actually come into play though? I would imagine this would only be for user-inputted values, in which case you would likely configure it to be its own separate related table (#25163) as your value-set is going to grow with your userbase (ie: multi-tenancy scenarios) and likely include more complex data structures than a simple primitive value (ie: you'll at least have a Name and a Rank/Order column). If you have this scenario you would NOT be leveraging a collection of simple primitive types to model it. On the flip-side, here's an example of where we do use non-user-inputted tagging/categorization in practice: Each of those columns are arrays of pre-defined, semantic, well-defined enums. When it comes to DB, I would argue perf and storage optimization matter MOST – that's why DBs are DBs and we use ORMs to abstract and make it human-centric for developers to work with and hide away the serialization optimization concerns. Why would we transport any amount of extra bytes/characters between the DB server and our API server? We already have areas where we need to include custom serialization formats for performance reasons to reduce the size of payload transferred. Instead of designing for the 1% scenario where someone might be persisting non-semantic data – we should optimize for the scenario that we encounter most often in practice. I believe this has been a recent design philosophy for C# 10/11 as well – to simplify and make the common real-world scenarios most optimized, then have a fallback to opt in to the extra once-in-a-blue-moon scenarios if and only when you need it. |
That's very speculative - I'm sure many developers would be happy representing lists of user inputs in a simple column rather than a separate table - especially given how easy it is to do with JSON. Not to mention performance: the cost of doing an extra JOIN because you've decided to put your value list in another table is likely to be significantly more than the overhead of the quoting characters in a JSON document. I certainly don't think storing lists of arbitrary strings (which may contain commas) is a once-in-a-blue-moon scenario. I also think that before prematurely optimizing for the scenario where there are no commas, we'd want to see some hard proof that JSON really is significantly slower than CSV here - which I doubt. Bottom line, I don't think we'd want the default mapping strategy for primitive collections to be something that breaks down as soon as you stick a comma in one of the values. |
Would it be that big a deal to add an
I don't think this is true for MS SQL: "JSON text is stored in VARCHAR or NVARCHAR columns and is indexed as plain text." (SQL Server 2022 docs). Seems like it's treated as just a raw string equivalent in the DB. Here's another consideration: standardization across EF providers. Given that Postgres DOES support array types natively and would by default map collections to its native I think the "bottom line" you stated exactly falls to theoretical / edge case optimization rather than doing the best thing the large majority of the time (ie: +90% scenario). DBs should be optimized for performance, simplicity and highest storage optimization. OO is for humans and thus the reason for Entity Framework to exist and hide away the optimization complexities around the impedance mismatch – it's core value prop. If there's any bottom line or guiding principle here it ought to be that. |
The EF 7 JSON columns feature is an opt-in because there already is a default behavior when mapping owned entity types for relational database: table splitting. In other words, by default the properties of the owned entity are simply mapped to columns in the same table as the owner, which seems like a very reasonable default that people have been happy with. However, with primitive collections there's no current default: just sticking a BTW one possible reason to require an explicit ToJson would be to allow System.Text.Json to possibly get trimmed when ToJson isn't specified. We'd likely need to do considerable other work to make that happen.
I never said SQL Server stored JSON in anything other than text (e.g. nvarchar). What I said was that the format is recognized and well-supported via various functions. For example, you can easily index into a JSON array via a built-in SQL Server function (JSON_VALUE), whereas you most cannot do that with CSV, especially if escaping comes into play.
Different providers can certainly do different things - there's no reason to force an opt-in to JSON on the SQL Server provider just because the PG provider by default maps to its natively-supported array types. Just like on the PG side we don't require an explicit opt-in to map to arrays, even though other providers don't support arrays.
You seem to be saying that CSV is superior for over 90% of the scenarios out there; in other words, that over 90% don't need escaping/quoting, and that the perf difference between CSV and JSON is important enough to choose CSV over JSON (assuming it's better). That really is purely a subjective, unbacked intuition of yours which I don't share... If you'd like to substantiate your argument, I'd start by actually comparing performance here, posting an actual benchmark showing that CSV is really better. |
Thanks @roji, it's great we're picking this apart.
Great points on the opt-in side. I fully agree, the gap is two-fold: should there be a default mapping and what should that mapping be. In terms of the ability for EF to map a primitive collection to a JSON column, I fully agree that there should be. We use both JSON columns and CSV arrays heavily in our DB schemas as a more modern and practical alternative to having tables for everything.
I think minimizing fragmentation across DB provided should be a significant design goal for EF, so taking into account the functionality of other databases (in particular Postgres) and not just applying an MS SQL mindset ought to be of high consideration. NOTE: Oracle also supports arrays with its
Yup, 100% intuitive and subjective and I am not a core member of the EF team to be able to take it on as a deep-dive research initiative to turn into pure data-driven science – you guys are. I draw my sense from a slew of architecture experiences across both large enterprise, FAANG and early-stage startups for what comes up there in the wild. This is of course an informed/educated opinion, one which leans into favouring simplicity and pragmatism with incremental complexity (YAGNI principle). Based on this, CSV feels like a perfectly reasonable default for this with JSON as an opt-in – though I would say even having NO serialization as a default and having JSON as an opt-in would be better than having JSON as an arbitrary default on MS SQL. This would unify the functionality across providers: if they have a DB-backed array type they can map to it by default, if not you can opt in to JSON explicitly and they would all model it as the best thing for that. NOTE: The best JSON representation in Postgres is not a raw string, it's the Option 1: CSV default (NEW), JSON opt-in (NEW) Option 1 scores higher on provider fragmentation (or lack thereof), provides a unified experience across them (don't surprise the user principle), has higher storage and payload efficiency and higher serialization simplicity on MS SQL. This is why it is my favoured and recommended approach – but Option 2 would still be better than Option 3 as it will at least allow JSON serialization to be leveraged by those few people that need it and don't want to add a custom CSV converter for all their semantic array value scenarios (existing status quo). |
The thing I'm still missing here is some sort of proof that JSON is significantly less performant, which is what you were claiming above.
Well, that wouldn't really be a consistent cross-database experience - on PG (and Oracle) you'd get implicit mapping to the DB native array type, whereas on SQL Server you wouldn't (that option doesn't exist). I'm not sure how forcing people to explicitly opt into JSON (or CSV) on SQL Server helps unify anything; the databases really are different in what they support, and I think the "unified" experience users expect is for the mapping to just work, regardless of the specific mechanism used to represent the array (native array type on PG, JSON or CSV). Re your options, no option here provides a unified cross-database experience in terms of how the data is stored; that's going to be the case no matter what we do. The only unified experience we can possibly provide is making the mapping "just work" without any explicit opt-in, but with differing storage mechanisms. Regardless, in your comparison you've left out the fact that CSV breaks down (badly!) when someone inserts a comma. You may speculate that this is rare, but forcing people to discover this and go through the pain of possibly converting their data just doesn't seem justified - at least for string data. We may consider a different mapping based on the data type (e.g. CSV for ints, JSON for strings), though going into even more complexity here also has a price. Finally, note that again, there's a chance we may want to avoid JSON as an implicit default anywhere in order to enable trimming of System.Text.Json. That really is the strongest argument I'm seeing here for avoiding this as a default. |
Ahh, there's a mis-understanding here. What I mean by consistency across providers is that if you were to use an explicit opt-in to JSON ( If you were to NOT specify it, then each DB would do whatever it can to support the collections natively (again, consistent in behaviour here), so that if a particular DB does have a native column format out of the box for it can be used – or if there is a reasonable alternative (ie: delimiter-separated option above), then it can use that (either Option 1 or Option 2 on SQL Server). One other thing to consider with the whole CSV approach is that if the comma delimiter is an issue, the delimiter could be configured explicitly (either by convention or at the property level), that way if you want to use |
Of course, but that doesn't really help with the fundamental problems in any way. First, it doesn't help where escaping is necessary (i.e. where there's no possible safe delimiter choice because the strings are arbitrary). More importantly, the user must be aware of this detail/complexity (discoverability). I'm thinking of the naïve user simply sticking a string array on their class and expecting EF to just "do the right thing" when mapping it, without even being aware of it being serialized as CSV. One of the values contains the delimiter character, and so when the column is read back, the user get silent data corruption. That's a classic "pit of failure" with potentially really bad consequences; I'd never want that to be a default behavior. To summarize my view so far... Although JSON seems like a very reasonable default to me, I'd probably want us to avoid it as an implicit default for trimming reasons. Doing a CSV mapping by default may be acceptable for non-strings, but for strings I'd require users to explicitly specify what mapping strategy is desired (and possibly allow CSV, but require explicit specification of the delimiter). |
One last note (thanks @ajcvickers): the actual character overhead of a JSON int array ( |
Heh I don't think if it was only 2 characters this push-back would have had much of an argument. The general case here to support – as I included with screenshots above – is sets of structured tags, modelled as enums in OO, and encoded as strings in the DB. These require quotes in JSON, thus This means it's 2 chars + 2 chars * size of the collection.
Not remotely the same. I would never want my DB/ORM to arbitrarily use such an inefficient encoding mechanism unless the complexity of the data structure warranted it. Given these are primitive types we are discussing by definition – JSON is overkill, and I would opt out of it each and every time if it was the default. Maybe a good argument for setting this up as a model building convention that can be configured! |
Closes dotnet#29427 Closes dotnet#30426 Closes dotnet#13617
Closes dotnet#29427 Closes dotnet#30426 Closes dotnet#13617
Closes dotnet#29427 Closes dotnet#30426 Closes dotnet#13617
Closes dotnet#29427 Closes dotnet#30426 Closes dotnet#13617
Closes dotnet#29427 Closes dotnet#30426 Closes dotnet#13617
Closes dotnet#29427 Closes dotnet#30426 Closes dotnet#13617
Closes dotnet#29427 Closes dotnet#30426 Closes dotnet#13617
Closes dotnet#29427 Closes dotnet#30426 Closes dotnet#13617
Closes dotnet#29427 Closes dotnet#30426 Closes dotnet#13617
Closes dotnet#29427 Closes dotnet#30426 Closes dotnet#13617
Closes dotnet#29427 Closes dotnet#30426 Closes dotnet#13617
Contrast to #25163 (map primitive collections to another table).
Contrast also to #28688, which is about allowing primitive collections inside JSON documents.
Contrast as well to #28871, which is about allowing any unmapped value in a JSON document, not just primitives
This type of mapping doesn't involve a new table in the database but could still allow queries using the JSON query functionality of the relational database.
Notes:
area-primitive-collections
.The text was updated successfully, but these errors were encountered: