groupby-agg all columns based on one column #828

markbarna · 2020-06-12T17:05:27Z

Hello,
I have a dataset that looks roughly like this:

#    category    date        x                     y
0    c           2020-01-20  0.6985333564957753    61
1    b           2020-01-19  0.011782836532168281  63
2    c           2020-01-18  -1.5929651389254533   16
3    c           2020-01-17  -1.614997608433403    97
4    b           2020-01-16  0.5646888996607152    23
...  ...         ...         ...                   ...
15   c           2020-01-05  -0.3777229791754252   55
16   b           2020-01-04  -1.8989497908039141   44
17   c           2020-01-03  0.5762175737242692    99
18   c           2020-01-02  1.4724334531958192    22
19   a           2020-01-01  1.1491668587221784    2

I need to group by category and then select the values for x and y that correspond to the maximum date within each group. With Pandas, I would do:

df[df.groupby('category')['date'].idxmax()]

Is there a way to achieve this using Vaex?
I was able to select the minimum date within each group using:

df.groupby('category').agg({'x': vx.agg.first('x', 'date'),
                                'y': vx.agg.first('y', 'date')
                               })

The documentation for vaex.agg.first doesn't indicate exactly how to use the order_expression parameter so I wasn't sure if there is a way to reverse-order the date expression. Is this possible?

Thank you

The text was updated successfully, but these errors were encountered:

maartenbreddels · 2020-06-16T06:58:51Z

Hi,

yes, first would be a good way to use this. Initially I didn't think we'd need 'last', but negating a date isn't something numpy likes. Maybe you can try:

dfg = df.groupby('category').agg({'x': vx.agg.first('x', -df.date.astype('int64')), ..

Indices is a bit problematic, as explained in #579

Regards,

Maarten

markbarna · 2020-06-17T00:37:47Z

Hi Maarten,
Thanks for your help. Your suggestion worked with some changes. First, I had to enclose the entire order_expression in quotes. Otherwise, I get: TypeError: unhashable type: 'Expression'. Also, it seems that the datatype conversion used on the date column must match the datatype of the column that you're selecting from so float for column x and int for column y, in this case. Here is what worked:

dfg = df.groupby('category').agg({'x': vx.agg.first('x', '-date.astype("float")'),
                                'y': vx.agg.first('y', '-date.astype("int")')
                               })

Alternatively, this also worked to accomplish the same thing:

dfg = df.groupby('category').agg({'date': 'max'})
df = df.join(dfg, on='category', rsuffix='_max')
df = df[df['date'] == df['date_max']].drop(['date_max', 'category_max'])

Obviously, it's not as concise, but I found it to still be much faster than using Pandas.
Thanks

Penacillin · 2021-05-28T18:56:31Z

Yeah there's definitely an issue around types in aggregates.
Example:

num = 59
adf = vaex.from_arrays(ts=np.arange(1, num+1), ts2=np.arange('2005-02-01T00:00:00', '2005-02-01T00:00:59', dtype='datetime64[s]'),
                       x=np.random.randint(1, 1000, num), y=np.random.randint(1, 1000, num).astype(np.int64))
print(adf.dtypes)
print(adf)
print(adf.first(adf.x, adf.ts))
print(adf.first(adf.x, adf.ts2))
print(adf.first(adf.y, adf.ts))
print(adf.first(adf.y, adf.ts2))

Output:

ts             int32
ts2    datetime64[s]
x              int32
y              int64
dtype: object
#    ts    ts2                  x    y
0    1     2005-02-01 00:00:00  493  196
1    2     2005-02-01 00:00:01  924  204
2    3     2005-02-01 00:00:02  402  227
3    4     2005-02-01 00:00:03  156  378
4    5     2005-02-01 00:00:04  140  178
...  ...   ...                  ...  ...
54   55    2005-02-01 00:00:54  999  11
55   56    2005-02-01 00:00:55  10   940
56   57    2005-02-01 00:00:56  706  801
57   58    2005-02-01 00:00:57  680  511
58   59    2005-02-01 00:00:58  556  35
493
924
905
196

matthiasho · 2022-02-20T02:37:56Z

Hi Maarten,

Yes, we definitely need 'last'. It is actually pretty important. Please consider implementing it in the next version, I don't suppose it will have much memory impact

I am unable to reverse order the date, by changing the order_expression as suggested by markbarna. Apart from groupby, I am also using BinnerTime. I am hoping to get the temperature reading at the start and the end (last) of every hour interval.

Thank you

yohplala · 2022-02-20T06:32:40Z

Yes, we definitely need 'last'. It is actually pretty important.

Albeit I am not sure such a message is a very constructive one, please, be aware that @maartenbreddels has initiated a PR for this with #1848.
He is requesting some testing, that I have unfortunately not been able to provide.
If you would like to give it a try, this may speed things up.

Bests,

matthiasho · 2022-02-21T00:40:09Z

My apology sir. I could have conveyed my message in a different way. Thank you for informing me about PR #1848. Best regards. On Sunday, February 20, 2022, 02:32:52 PM GMT+8, yohplala ***@***.***> wrote: Yes, we definitely need 'last'. It is actually pretty important. Albeit I am not sure such a message is a very constructive one, please, be aware that @maartenbreddels has initiated a PR for this with #1848. He is requesting some testing, that I have unfortunately not been able to provide. If you would like to give it a try, this may speed things up. Bests, — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: ***@***.***>

JovanVeljanoski · 2022-02-21T00:50:45Z

He is requesting some testing, that I have unfortunately not been able to provide.

It is on my to-do list, but other things had priority. Perhaps this coming week if I have the time.

maartenbreddels · 2022-03-07T15:24:50Z

Note that we merged #1848, so in the next release (4.9) we should have a better first/last aggregator.

matthiasho · 2022-03-08T02:56:24Z

Thank you, sir. On Monday, March 7, 2022, 11:25:06 PM GMT+8, Maarten Breddels ***@***.***> wrote: Note that we merged #1848, so in the next release (4.9) we should have a better first/last aggregator. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby-agg all columns based on one column #828

groupby-agg all columns based on one column #828

markbarna commented Jun 12, 2020 •

edited

Loading

maartenbreddels commented Jun 16, 2020

markbarna commented Jun 17, 2020

Penacillin commented May 28, 2021 •

edited

Loading

matthiasho commented Feb 20, 2022 •

edited

Loading

yohplala commented Feb 20, 2022

matthiasho commented Feb 21, 2022 via email

JovanVeljanoski commented Feb 21, 2022

maartenbreddels commented Mar 7, 2022

matthiasho commented Mar 8, 2022 via email

groupby-agg all columns based on one column #828

groupby-agg all columns based on one column #828

Comments

markbarna commented Jun 12, 2020 • edited Loading

maartenbreddels commented Jun 16, 2020

markbarna commented Jun 17, 2020

Penacillin commented May 28, 2021 • edited Loading

matthiasho commented Feb 20, 2022 • edited Loading

yohplala commented Feb 20, 2022

matthiasho commented Feb 21, 2022 via email

JovanVeljanoski commented Feb 21, 2022

maartenbreddels commented Mar 7, 2022

matthiasho commented Mar 8, 2022 via email

markbarna commented Jun 12, 2020 •

edited

Loading

Penacillin commented May 28, 2021 •

edited

Loading

matthiasho commented Feb 20, 2022 •

edited

Loading