Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

union with count limit #4937

Closed
allukaZod opened this issue Dec 14, 2023 · 4 comments · Fixed by #5069
Closed

union with count limit #4937

allukaZod opened this issue Dec 14, 2023 · 4 comments · Fixed by #5069

Comments

@allukaZod
Copy link

Hi,
Thank you for the great tool.
Here is my problem:

when dealing with large amount set of data, such as

{"tag": "1", "ip": "1.1.1.1", "category": "some_cat1"}
{"tag": "2", "ip": "1.1.1.2", "category": "some_cat1"}
{"tag": "3", "ip": "1.1.1.3", "category": "some_cat1"}
{"tag": "4", "ip": "1.1.1.4", "category": "some_cat2"}
...
{"tag": "100000", "ip": "11.1.1.1", "category": "some_cat2"}

When use union function to union ip by category or tag, the zq command is: union(ip) by category.
The result would be a large list of ip, such as

{"union": ["1.1.1.1", "1.1.1.2", "1.1.1.3", ... "11.1.1.1"], "category": "some_cat1"}

What I want was to set a limit in union, to limit the length of "union", and split this union with at most the number of limit. For example above, if I set the limit to be 2, union(ip) by category limit 2, the result would be:

{"union": ["1.1.1.1", "1.1.1.2"], "category": "some_cat1"}
{"union": ["1.1.1.3", "1.1.1.4"], "category": "some_cat1"}
...
{"union": ["11.1.1.1", "11.1.1.2"], "category": "some_cat1"}
{"union": ["1.11.1.3", "11.1.1.4"], "category": "some_cat1"}

@philrz
Copy link
Contributor

philrz commented Dec 14, 2023

Hi @allukaZod! Thanks for your interest in Zed.

In your final table output below, I think you probably meant to have some entries with some_cat2? In any case, I think I understand the question, and if so, it's an interesting one. I'm looking into if Zed has the building blocks to do this with what's already there. I'll circle back after I've done more research.

@philrz
Copy link
Contributor

philrz commented Dec 17, 2023

Hello again @allukaZod. I've got something close to what you were seeking. The approach using the currently-available building blocks in Zed creates the full sets via union() and then emits them in batches of the requested size after the fact. The approach is admittedly a little hacky and we might design a more direct way to achieve this in the future. For now I've wrapped the functionality in a User-Defined Operator so that way you don't have to reckon with it in the main part of your Zed pipeline. That said, it also provides an opportunity to understand some of the more advanced parts of the language like the spread operator and lateral subqueries.

Here's the user-defined operator in a file called batches.zed:

op emit_batches(complex_val, batch_size, group):
(
  over [...complex_val] with group =>
  (
    {id:(count()-1)/batch_size,val:this}
    | collect(val) by id
    | yield {group:group, batch:collect}
  )
)

And some sample data in a file data.json similar to what you showed:

{"tag": "1", "ip": "1.1.1.1", "category": "some_cat1"}
{"tag": "2", "ip": "1.1.1.2", "category": "some_cat1"}
{"tag": "3", "ip": "1.1.1.3", "category": "some_cat1"}
{"tag": "4", "ip": "1.1.1.4", "category": "some_cat1"}
{"tag": "5", "ip": "1.1.1.5", "category": "some_cat1"}
{"tag": "6", "ip": "1.1.1.6", "category": "some_cat1"}
{"tag": "7", "ip": "1.1.1.7", "category": "some_cat1"}
{"tag": "8", "ip": "1.1.1.8", "category": "some_cat1"}
{"tag": "9", "ip": "1.1.1.9", "category": "some_cat1"}
{"tag": "10", "ip": "1.1.1.10", "category": "some_cat1"}
{"tag": "11", "ip": "1.1.1.11", "category": "some_cat2"}
{"tag": "12", "ip": "1.1.1.12", "category": "some_cat2"}
{"tag": "13", "ip": "1.1.1.13", "category": "some_cat2"}
{"tag": "14", "ip": "1.1.1.14", "category": "some_cat2"}
{"tag": "15", "ip": "1.1.1.15", "category": "some_cat2"}
{"tag": "16", "ip": "1.1.1.16", "category": "some_cat2"}
{"tag": "17", "ip": "1.1.1.17", "category": "some_cat2"}
{"tag": "18", "ip": "1.1.1.18", "category": "some_cat2"}
{"tag": "19", "ip": "1.1.1.19", "category": "some_cat2"}
{"tag": "20", "ip": "1.1.1.20", "category": "some_cat2"}

And an example that ties it all together:

$ zq -I batches.zed 'union(ip) by category | emit_batches(union, 2, category)' data.json
{group:"some_cat2",batch:["1.1.1.11","1.1.1.12"]}
{group:"some_cat2",batch:["1.1.1.13","1.1.1.14"]}
{group:"some_cat2",batch:["1.1.1.15","1.1.1.16"]}
{group:"some_cat2",batch:["1.1.1.17","1.1.1.18"]}
{group:"some_cat2",batch:["1.1.1.19","1.1.1.20"]}
{group:"some_cat1",batch:["1.1.1.3","1.1.1.4"]}
{group:"some_cat1",batch:["1.1.1.5","1.1.1.6"]}
{group:"some_cat1",batch:["1.1.1.7","1.1.1.8"]}
{group:"some_cat1",batch:["1.1.1.9","1.1.1.10"]}
{group:"some_cat1",batch:["1.1.1.1","1.1.1.2"]}

However, I did bump into a new bug #4943 while working on this. The effects are evident if we use batch the same input data into groups of three.

$ zq -I batches.zed 'union(ip) by category | emit_batches(union, 3, category)' data.json
{group:"some_cat1",batch:["1.1.1.1","1.1.1.2","1.1.1.3"]}
{group:"some_cat1",batch:["1.1.1.4","1.1.1.5","1.1.1.6"]}
{group:"some_cat1",batch:["1.1.1.7","1.1.1.8","1.1.1.9"]}
{group:"some_cat1",batch:["1.1.1.10"]}
{group:"some_cat2",batch:["1.1.1.11","1.1.1.12"]}
{group:"some_cat2",batch:["1.1.1.13","1.1.1.14","1.1.1.15"]}
{group:"some_cat2",batch:["1.1.1.16","1.1.1.17","1.1.1.18"]}
{group:"some_cat2",batch:["1.1.1.19","1.1.1.20"]}

i.e., for some_cat2 we should have had three groups of 3 and one group of 1 like we had for some_cat1.

Anyway, I figured I'd share what I've got thus far in case you can make use of it despite that bug. I'll update again when we have that fix for #4943. Let me know if you have any other questions in the meantime.

@allukaZod
Copy link
Author

Thanks a lot for the suggestion, that's exactly what Im looking for!

Ane yes, some_cat2 shour be the other group.

@philrz
Copy link
Contributor

philrz commented Jun 7, 2024

@allukaZod: Not sure if you're still watching this issue, but FYI, the issue #4943 I mentioned above has been fixed, so that last example shown previously now generates the correct expected output.

$ zq -I batches.zed 'union(ip) by category | emit_batches(union, 3, category)' data.json
{group:"some_cat1",batch:["1.1.1.1","1.1.1.2","1.1.1.3"]}
{group:"some_cat1",batch:["1.1.1.4","1.1.1.5","1.1.1.6"]}
{group:"some_cat1",batch:["1.1.1.7","1.1.1.8","1.1.1.9"]}
{group:"some_cat1",batch:["1.1.1.10"]}
{group:"some_cat2",batch:["1.1.1.11","1.1.1.12","1.1.1.13"]}
{group:"some_cat2",batch:["1.1.1.14","1.1.1.15","1.1.1.16"]}
{group:"some_cat2",batch:["1.1.1.17","1.1.1.18","1.1.1.19"]}
{group:"some_cat2",batch:["1.1.1.20"]}

This fix is currently in Zed's tip of main and will be included in the next GA release, which I estimate will come out next week.

@philrz philrz closed this as completed Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants