Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add shard_results and hashtext functions #2220

Closed
tychoish opened this issue Dec 7, 2023 · 0 comments
Closed

add shard_results and hashtext functions #2220

tychoish opened this issue Dec 7, 2023 · 0 comments
Labels
cloud Issues that affect cloud feat New feature or request

Comments

@tychoish
Copy link
Contributor

tychoish commented Dec 7, 2023

Description

I was working on something with GlareDB cloud and I wished to have these functions in glaredb. We can discus names, but basically the operations would be:

hashtext

This could easily be fnv() or fnv1a() wold take the raw data of a field, and return the hash value as an unsigned 64 bit integer. If we differ from the postgres function we shouldn't use the same name. I'm partial to 64bit fnv1a but any non-cryptographic hash is fine.

shard_results

shard_results(<data>, <num_shards>, <shard_id>)

Data could be any type (we'll use it's byte sequence, no need to cast), num_shards is a positive non-zero integer, and shard_id is a number that is within the [0,<num_shards>) range. The function would return a boolean, and be used in a WHERE clause.

The operation would be, basically hash(<data>) % <num_shards> == <shard_id>.

This function should be implemented in terms of the first.

As future work, It would be interesting if for parquet data sources, to see if it would end up working so that we'd pull the column in question, do the filtering, and then pull the remaining data out?

Use Case

If you have multiple stateless application servers and you want to divide the output of query (which represents some work), into slices (shards) for each application servers, this function can help push that calculation into the database, and reduce the amount of data that's sent to the application.

@tychoish tychoish added the feat New feature or request label Dec 7, 2023
@greyscaled greyscaled added the cloud Issues that affect cloud label Dec 7, 2023
tychoish added a commit that referenced this issue Dec 28, 2023
tychoish added a commit that referenced this issue Dec 28, 2023
Adds partitioning (sharding) of result sets using the hashing method. 

Closes #2220
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud Issues that affect cloud feat New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants