-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: add support for ARRAY type #2115
Comments
I've been thinking about how youtube could store all their user comments on cockroachdb. On youtube you can post a comment against a video. You can also reply to an existing comment about a video. So the comment space can be modeled in a table where every row represents a video, with a column containing an array of all the comments related to a video. In addition comments can reference other comments, so the comment type needs to be a tuple like <id, other-comment-id, comment> The array type should support fast append and update operations, to allow quick commenting and editing of comments. |
@vivekmenezes Your method would work for small comment sets but you would quickly run into an issue with contention on that array for videos with people commenting quickly and size problems with large amounts of comments. However I think that your suggestion of update operations is a great one. Having array operations built in would avoid having to read the value and then process it locally in a transaction before sending back the new array to the server. Using update commands could create much shorter transactions witch would obviously be good for performance. |
|
One of the reasons for me putting in a real world example, is to allow the Alternatively, propose a different schema for the domain. A table with each A key design consideration is the ability to read all the comments off a On Tue, Jan 26, 2016 at 11:33 AM Kevin Cox notifications@github.com wrote:
|
WWPGD (What would postgres do) seems like a pretty good starting point for db related questions (it has served sqlite well although very different from cockroach). I suspect that inline arrays should probably not go over a "modest" size. I'm not saying it should be forbidden but probably discouraged and not optimized for. At the very least there comes a point where returning the whole array becomes not very useful and that probably means that you need to filter, and once you filter you should probably be using a separate table. That's my 2 cents at least. |
We might also consider the Cassandra model of collections - list, set, and map. Sure, sometimes people actually do want mathematical arrays, matrices, and multi-dimensional arrays (ala Postgres), but sometimes they just want simple lists or are trying to simulate a set using an array, or similarly storing keyword/value pairs in two-dimensional arrays or parallel linear arrays. The map collection type is nice since it lets you store and query arbitrary semi-structured data, typically keyword/value pairs much more naturally. Cassandra has convenient operations to operate on lists, sets, and maps, such as append and prepend for list, removing list elements, removing all occurrences of a value from a list or set, removing from a map by keyword, etc. Doing all this with raw arrays would be... not so fun. And I would encourage the same philosophy that Cassandra adopted for lists, sets, and maps: "Collections are meant for storing/denormalizing relatively small amount of data. They work well for things like “the phone numbers of a given user”, “labels applied to an email”, etc. But when items are expected to grow unbounded (“all the messages sent by a given user”, “events registered by a sensor”, ...), then collections are not appropriate anymore and a specific table (with clustering columns) should be used." It all depends on the use case. See: |
The Cassandra collection model looks really nice. It also appears similar to what I was recommending. I have quoted some of what I deem as the important parts below:
That limit seems pretty high but I wouldn't be surprised if it is considered bad practice to get anywhere near that. Personally I would be thinking that embedded collections would be suited for "tens of elements" at most. Again, there are multiple right answers, just sharing what I'm thinking :) |
+1 for Cassandra collection model. please implement |
Any updates on this @petermattis ? |
@amangoeliitb |
@petermattis One of the best use cases for |
I have seen quite a few Array related PRs been merged lately and was wondering how far we are from basic support of array type. Thanks ! |
+1 re: @kulshekhar's comment "could also be more performant alternative to joins in several cases" |
What specific functionality do you need for arrays? Postgres exposes a lot of syntax and functions for arrays, so we've been limiting our work to what's needed to support Hibernate ORM. |
These are the most important, from my perspective. Additional nice to have functionality would be
|
Honestly, it would be already useful to be able to simply store and retrieve an array. Then comes the list @kulshekhar posted and I agree, being able to index the array and check if an array contains a given value is probably the most desired features afterward. |
@jordanlewis @nvanbenschoten could you guys update this issue? |
Another key functionality @bdarnell brought up is indexing of ARRAY columns. |
Hi @MehSha, @kulshekhar, @tlvenn, @jhartman86, @amangoeliitb, @kajmagnus, @beldpro-ci, @thearchitect, @yinhm, @nickjackson, @arypurnomoz, @mbertschler, @sudhakar, @tlvenn, @bkleef, @domnulnopcea, @iamcarbon, @muesli, @heiskr, @arduanov, @daliwali, @manigandham since you expressed interest in this issue, I was wondering if you had any insight you could share on how you plan to use arrays within Cockroach! I'm interested in a couple main questions:
Thanks so much! |
TL;DR I use arrays to contain "tags" which I almost exclusively query to see if some value is in the array of tags.
Most of my array usage is checking membership. For example I have short arrays of small data (numbers or small strings) and I do a query if x is a member of that array. Ocasionally I will return the whole set of array elements in a query. My second most common use case is a small set of data such as email addresses which I will almost always return in it's entirety. I almost never access by index.
Very small. Most are [0 2] elements, a couple get to around 20. Again, these are integers or short strings.
I ocasionally use a GIN index to find rows that contain a certian element in an array.
I expect that they will be stored closer to the row itself for quick checking as well as the ease of indexing and querying. Hope that helps explain my use case. Feel free to ask for clarification or anything else. |
I mainly use these fields with queries. When I have to use the data from an array field, I normally parse it in the application.
I have applications with Array field size ranging upto 20000 elements
GIN indexing does it for me
The main use case is application level ACL - arrays allow row level permissions to be managed in a flexible and performant manner. It would also work better than joins in a distributed setup, I imagine |
I'll join in, why not? :)
Querying using ANY.
Not unbounded, rarely more than a list of dozen IDs or so.
I'm not at the scale where that matters... yet ;)
I'm not a 3rd normal form queen, just as long as I don't have duplicate data. I'm actually transitioning from NoSQL, so even just having a schema is a step in the traditional route. In these tables I'm storing versions of things, and the relationships are part of the versioning, so having to join over the I would say though that |
I would agree with this. |
Arbitrarily large, perhaps up to the row size limit.
I haven't had a use case that would benefit from indexing array columns yet.
Better read performance and less joins, more flexible data modelling. |
Thanks so much all of you for your input! It's much appreciated. |
A simple, and probably common scenario, would be a standard MIME email that has been broken out into parts so it's indexed/searchable. Such a domain model would (usually) contain a small number of In this case, it's quite convenient to be able to have a native array type in a column that is indexed so you can search for a specific email address. In postgres, this is easily accomplished with a |
Going to close this since the only outstanding issue is not blocking array usage and has its own issue. |
It wasn't immediately obvious to me when checking on this issues updates to see that the feature was actually implemented and is part of 1.1 (currently beta).. just for anyone else looking here as well |
arrays in expressionssql: add support for ARRAY indexing #10518 sql: add support for ARRAY operators and functions #10519= ANY(array)
like forIN
expressions sql: optimize= ANY
likeIN
#13593The text was updated successfully, but these errors were encountered: