Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby-agg all columns based on one column #828

Open
markbarna opened this issue Jun 12, 2020 · 9 comments
Open

groupby-agg all columns based on one column #828

markbarna opened this issue Jun 12, 2020 · 9 comments

Comments

@markbarna
Copy link

markbarna commented Jun 12, 2020

Hello,
I have a dataset that looks roughly like this:

#    category    date        x                     y
0    c           2020-01-20  0.6985333564957753    61
1    b           2020-01-19  0.011782836532168281  63
2    c           2020-01-18  -1.5929651389254533   16
3    c           2020-01-17  -1.614997608433403    97
4    b           2020-01-16  0.5646888996607152    23
...  ...         ...         ...                   ...
15   c           2020-01-05  -0.3777229791754252   55
16   b           2020-01-04  -1.8989497908039141   44
17   c           2020-01-03  0.5762175737242692    99
18   c           2020-01-02  1.4724334531958192    22
19   a           2020-01-01  1.1491668587221784    2

I need to group by category and then select the values for x and y that correspond to the maximum date within each group. With Pandas, I would do:

df[df.groupby('category')['date'].idxmax()]

Is there a way to achieve this using Vaex?
I was able to select the minimum date within each group using:

df.groupby('category').agg({'x': vx.agg.first('x', 'date'),
                                'y': vx.agg.first('y', 'date')
                               })

The documentation for vaex.agg.first doesn't indicate exactly how to use the order_expression parameter so I wasn't sure if there is a way to reverse-order the date expression. Is this possible?

Thank you

@maartenbreddels
Copy link
Member

Hi,

yes, first would be a good way to use this. Initially I didn't think we'd need 'last', but negating a date isn't something numpy likes. Maybe you can try:

dfg = df.groupby('category').agg({'x': vx.agg.first('x', -df.date.astype('int64')), ..

Indices is a bit problematic, as explained in #579

Regards,

Maarten

@markbarna
Copy link
Author

Hi Maarten,
Thanks for your help. Your suggestion worked with some changes. First, I had to enclose the entire order_expression in quotes. Otherwise, I get: TypeError: unhashable type: 'Expression'. Also, it seems that the datatype conversion used on the date column must match the datatype of the column that you're selecting from so float for column x and int for column y, in this case. Here is what worked:

dfg = df.groupby('category').agg({'x': vx.agg.first('x', '-date.astype("float")'),
                                'y': vx.agg.first('y', '-date.astype("int")')
                               })

Alternatively, this also worked to accomplish the same thing:

dfg = df.groupby('category').agg({'date': 'max'})
df = df.join(dfg, on='category', rsuffix='_max')
df = df[df['date'] == df['date_max']].drop(['date_max', 'category_max'])

Obviously, it's not as concise, but I found it to still be much faster than using Pandas.
Thanks

@Penacillin
Copy link

Penacillin commented May 28, 2021

Yeah there's definitely an issue around types in aggregates.
Example:

num = 59
adf = vaex.from_arrays(ts=np.arange(1, num+1), ts2=np.arange('2005-02-01T00:00:00', '2005-02-01T00:00:59', dtype='datetime64[s]'),
                       x=np.random.randint(1, 1000, num), y=np.random.randint(1, 1000, num).astype(np.int64))
print(adf.dtypes)
print(adf)
print(adf.first(adf.x, adf.ts))
print(adf.first(adf.x, adf.ts2))
print(adf.first(adf.y, adf.ts))
print(adf.first(adf.y, adf.ts2))

Output:

ts             int32
ts2    datetime64[s]
x              int32
y              int64
dtype: object
#    ts    ts2                  x    y
0    1     2005-02-01 00:00:00  493  196
1    2     2005-02-01 00:00:01  924  204
2    3     2005-02-01 00:00:02  402  227
3    4     2005-02-01 00:00:03  156  378
4    5     2005-02-01 00:00:04  140  178
...  ...   ...                  ...  ...
54   55    2005-02-01 00:00:54  999  11
55   56    2005-02-01 00:00:55  10   940
56   57    2005-02-01 00:00:56  706  801
57   58    2005-02-01 00:00:57  680  511
58   59    2005-02-01 00:00:58  556  35
493
924
905
196

@matthiasho
Copy link

matthiasho commented Feb 20, 2022

Hi Maarten,

Yes, we definitely need 'last'. It is actually pretty important. Please consider implementing it in the next version, I don't suppose it will have much memory impact

I am unable to reverse order the date, by changing the order_expression as suggested by markbarna. Apart from groupby, I am also using BinnerTime. I am hoping to get the temperature reading at the start and the end (last) of every hour interval.

Thank you

@yohplala
Copy link
Contributor

Yes, we definitely need 'last'. It is actually pretty important.

Albeit I am not sure such a message is a very constructive one, please, be aware that @maartenbreddels has initiated a PR for this with #1848.
He is requesting some testing, that I have unfortunately not been able to provide.
If you would like to give it a try, this may speed things up.

Bests,

@matthiasho
Copy link

matthiasho commented Feb 21, 2022 via email

@JovanVeljanoski
Copy link
Member

He is requesting some testing, that I have unfortunately not been able to provide.

It is on my to-do list, but other things had priority. Perhaps this coming week if I have the time.

@maartenbreddels
Copy link
Member

Note that we merged #1848, so in the next release (4.9) we should have a better first/last aggregator.

@matthiasho
Copy link

matthiasho commented Mar 8, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants