-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ last aggregor, and first and last now can have mixed types #1848
Conversation
0611a3d
to
7fe0475
Compare
Hi @maartenbreddels , yes I can try to have a look this week-end, but not sure to understand everything. Thanks a lot for taking these points about |
Hi @maartenbreddels , Doing a test with this branch, there are still things that I find surprising. 1/ Testing inversion of row index, as posted in #1378 import vaex as vx
import pandas as pd
# Test 1
vdf = vx.from_pandas(pd.DataFrame({'a':[10,10,20], 'b':[21,11,12]}))
vdf["row_index"] = vx.vrange(0, len(vdf))
vdf
Out[1]:
# a b row_index
0 10 21 0
1 10 11 1
2 20 12 2
# Setup 'agg_last', data is ordered according row index.
agg_last = {col: vx.agg.first(col, "-row_index") for col in vdf.get_column_names()}
# Drop duplicates based on 'a', keeping last.
vdf.groupby('a', agg=agg_last)
Out[2]:
# a b row_index
0 20 12 2
1 10 11 1 Values are correct, but I am expecting the # a b row_index
0 10 11 1
1 20 12 2 Is the behavior noticed normal? 2/ Testing last # Test 2
vdf = vx.from_pandas(pd.DataFrame({'a':[22,10,10,22,31], 'b':[8,21,11,12,43]}))
vdf["row_index"] = vx.vrange(0, len(vdf))
vdf
Out[3]:
# a b row_index
0 22 8 0
1 10 21 1
2 10 11 2
3 22 12 3
4 31 43 4
vdf.groupby('a', agg={'c': vx.agg.last('b')})
Out[4]:
# a c
0 31 43
1 22 12
2 10 11 Values are correct, but I am expecting the # a c
0 10 11
1 22 12
2 31 43 Is the behavior noticed normal? 3/ Willing to test Without using those 2 parameters (and without selection for now), I get this result I consider quite weird. Please, can you help me to understand? vdf = vx.from_pandas(pd.DataFrame({'a':[22,10,10,22,31], 'b':[8,21,11,12,43]}))
vdf["row_index"] = vx.vrange(0, len(vdf))
vdf.last('b', 'row_index', binby=['a'])
Out[5]
array([ 11, 94914693182688, 311,
7318345117270239, -4281860073, -280710472531969,
28148060314075378, 27303412049444890, -280757702098694,
61924589365559453, 1970788713431039, 16607461712592895,
34902892829671423, 41659228561080319, -280624573054935,
59391215794389094, 53762601038577663, 12103423998558207,
-280981039677170, 82191492082237439, 5348741817040895,
63051146402398244, -280358273679361, -280341094203271,
27022920614084857, 1970320546005199, -280744832335651,
-281011120308206, -4279697174, -280238026194795,
77124143618719743, 23362418791415855, -280736225754837,
56859225195741461, 76561597405462527, 37155328285999103,
-281354708451070, 12666373951914050, -281200098869158,
62768915528482914, 52636610918809735, 48695514571145322,
-281264513810433, -280409817087985, -4289462113,
-280293860769490, -280444184624905, 35747317754626086,
63895133260742844, 14919114363437275, 56294995342065896,
77688356304257023, 58265315894952001, -65522,
-281457796841473, 39125167810216239, -4286185259,
-280637450092456, 43911363399057663, -4285923282,
-4285071361, 69244085527445503, -1,
19422245839372505, -280916630044673, 74309389562282113,
-281062659850241, -65227, 16607023625928703,
64457765376622654, -1, -65248,
36310271995609177, 12, -4287758058,
32933225120923819, 33778019424075775, 7600116446461963,
-280182188343297, -4287692510, 57420916723810303,
21111323332968447, -4289527616, 19704296345370743,
29836347531329535, 23080943815426047, 36028797018898617,
21956250788233215, -281328947822593, 83880079944319213,
-65301, 3659174697173203, 50665491528417367,
-280903735574267, 21955043897704480, -4292542313,
32088143060205567, -281440600915788, -280641753055233,
2815192166367237, 68398784414482472, -280349676994260,
-4293525323, -65342, 52072866411511924,
-281341821976577, 57140008701919231, 41377478720618495,
-281389059932161, 85851049029533800, 14636698788888636,
-4274716673, -280985332809729, 16044756572307455,
-280143530295297, -281307462434752, 40532392355823615,
-65536, 82473246216290428, 5066545289560063,
-281466386841386, 79375943432339465, 53480245574959344,
40814064901357567, 80220368362471478, -65465,
-4286054169, 51229291869831323]) |
groupy doesn't sort by default, if you need that, pass sort=True
Yes, these are uninitialized values, meaning there is no data and you get garbage it seems. Focus on the groupby+aggregator if you are confused about df.first/df.last Regards, Maarten |
Thanks Maarten for the quick feedback.
I don't request sorting (I just tested, and it sorts according 'a'). Now, for my own targeted use case, it happens that the datasets I am handling are ordered according 'a', so using
Ok, so focusing on groupby+aggregator, it would indeeed seems that selection does not work. # Test 3
vdf = vx.from_pandas(pd.DataFrame({'a':[22,10,10,22,31], 'b':[8,21,11,12,43]}))
vdf.select(vdf.a > 11)
vdf.groupby('a', agg={'c': vx.agg.last('b', selection=True)})
Out[48]:
# a c
0 31 43
1 22 12
2 10 11 With selection, I would expect: Out[48]:
# a c
0 31 43
1 22 12 Is this related to my setup? |
Vaex cannot hold the order of appearance, but it can sort. You can read how selections work in the docs, for instance the tutorial. |
I am sorry @maartenbreddels, I have not been a user of selection so far.
Which was basically the idea of my previous code snippet. # Test 3
vdf = vx.from_pandas(pd.DataFrame({'a':[22,10,10,22,31], 'b':[8,21,11,12,43]}))
vdf.select(vdf.a > 11)
vdf.groupby(vdf.evaluate('a', selection=True), agg={'c': vx.agg.last('b', selection=True)})
File "/home/yoh/Documents/code/vaex/packages/vaex-core/vaex/utils.py", line 640, in _ensure_string_from_expression
raise ValueError('%r is not of string or Expression type, but %r' % (expression, type(expression)))
ValueError: 22 is not of string or Expression type, but <class 'numpy.int64'> @maartenbreddels I feel I am making you lose your time here. Please, ignore my message. I am posting it so as to show that I do acknowledge your proposal for me to check this item and I thank you for that. However, it appears I am still too inexperienced in this to be of any help. I am sorry. |
I started making unit-tests for this PR. I stared by doing a test for
sometimes returns incorrect result. The expected result should be To reproduce: import sys
import os
sys.path.append(os.path.abspath("/home/user/vaex/tests"))
from common import *
ds = create_filtered()
for i in range(1000):
x = ds.last(ds.y, -ds.x, binby=[ds.x, ds.x+5], limits=[[0, 10], [5, 15]], shape=[2, 1]).tolist()
try:
np.testing.assert_array_almost_equal(
x,
[[0.], [0.]],
decimal=2)
except:
print(x) |
That's not unexpected I think, that's why I asked for tests, if it fails, great. |
Just added some additional tests. Some notes:
|
Also, if no order argument is given, the row number is used.
7e022c4
to
0d371b0
Compare
I've disabled the strings tests, but we should be able to put them back in once we implement those, lets's first get this in ok? |
Hi @maartenbreddels import vaex
import pandas as pd
start = "2022/03/01 2:00"
ts = pd.date_range(start=start, periods=4, freq='3H')
len_ts = len(ts)
pdf = pd.DataFrame({'val':range(len_ts), 'ts':ts})
vdf = vaex.from_pandas(pdf)
# '4h' binning
vdf.groupby(vaex.BinnerTime(vdf.ts, resolution='h', every=4), agg={'last': vaex.agg.last("val")}) Error File "/home/yoh/Documents/code/vaex/packages/vaex-core/vaex/dataframe.py", line 5282, in __getitem__
self.validate_expression(item)
File "/home/yoh/Documents/code/vaex/packages/vaex-core/vaex/dataframe.py", line 3386, in validate_expression
raise NameError(str(e)) from None
NameError: Column or variable 'v' does not exist. I get the same error with vdf.groupby(vaex.BinnerTime(vdf.ts, resolution='h', every=4), agg={'sum': vaex.agg.sum("val")})
Out[35]:
# ts sum
0 2022-03-01 02 1
1 2022-03-01 06 2
2 2022-03-01 10 3 |
Hmm, thanks for testing this, I think the error is on https://github.com/vaexio/vaex/pull/1848/files#diff-5d0bf7451a5252cb31d40c56df00367031e28f097cd7273e71376d50c2fc7b0fR487 Do you mind adding a unittests, and opening a PR, or if you spot the error you can try to fix it yourself (it's missing a list |
Also, if no order argument is given, the row number is used.
After #1847 it's easier to add new aggregators, so I took the time to fix first, and add last.
Replaces #1834
Closes #1378
cc @yohplala
TODO:
Either @JovanVeljanoski or @yohplala would you like to work on these last two points? If this shows up issues (I know selections are not implemented), I can fix those.