✨ last aggregor, and first and last now can have mixed types #1848

maartenbreddels · 2022-01-20T15:04:01Z

Also, if no order argument is given, the row number is used.

After #1847 it's easier to add new aggregators, so I took the time to fix first, and add last.

Replaces #1834
Closes #1378
cc @yohplala

TODO:

improve docstrings explaining this
test and implement selections, test missing/masked data

Either @JovanVeljanoski or @yohplala would you like to work on these last two points? If this shows up issues (I know selections are not implemented), I can fix those.

yohplala · 2022-01-21T13:30:48Z

Hi @maartenbreddels , yes I can try to have a look this week-end, but not sure to understand everything.
But this will definitely be a good exercise to learn new things about vaex :).
Regarding docstring, please, which functions are you expecting them for?
I can see that both last and first already have some. I will double-check them, not sure there is much more to say though. I will check deeper this week-end.

Thanks a lot for taking these points about last and mixed types @maartenbreddels!

yohplala · 2022-01-26T18:14:17Z

Hi @maartenbreddels ,

Doing a test with this branch, there are still things that I find surprising.

1/ Testing inversion of row index, as posted in #1378
-> still a question remaining.

import vaex as vx
import pandas as pd

# Test 1
vdf = vx.from_pandas(pd.DataFrame({'a':[10,10,20], 'b':[21,11,12]}))
vdf["row_index"] = vx.vrange(0, len(vdf))
vdf
Out[1]: 
  #    a    b    row_index
  0   10   21            0
  1   10   11            1
  2   20   12            2

# Setup 'agg_last', data is ordered according row index.
agg_last = {col: vx.agg.first(col, "-row_index") for col in vdf.get_column_names()}
# Drop duplicates based on 'a', keeping last.
vdf.groupby('a', agg=agg_last)
Out[2]: 
  #    a    b    row_index
  0   20   12            2
  1   10   11            1

Values are correct, but I am expecting the groupby to keep inital row order, meaning I am expecting:

  #    a    b    row_index
  0   10   11            1
  1   20   12            2

Is the behavior noticed normal?

2/ Testing last
-> same question than previous paragraph.

# Test 2
vdf = vx.from_pandas(pd.DataFrame({'a':[22,10,10,22,31], 'b':[8,21,11,12,43]}))
vdf["row_index"] = vx.vrange(0, len(vdf))
vdf
Out[3]: 
  #    a    b    row_index
  0   22    8            0
  1   10   21            1
  2   10   11            2
  3   22   12            3
  4   31   43            4

vdf.groupby('a', agg={'c': vx.agg.last('b')})
Out[4]: 
  #    a    c
  0   31   43
  1   22   12
  2   10   11

Values are correct, but I am expecting the groupby to keep inital row order, meaning I am expecting:

  #    a    c
  0   10   11
  1   22   12
  2   31   43

Is the behavior noticed normal?

3/ Willing to test select but....
not understanding parameters limits and shape for first or last aggregations and possibly resulting from that, not understanding obtained results.

Without using those 2 parameters (and without selection for now), I get this result I consider quite weird. Please, can you help me to understand?

vdf = vx.from_pandas(pd.DataFrame({'a':[22,10,10,22,31], 'b':[8,21,11,12,43]}))
vdf["row_index"] = vx.vrange(0, len(vdf))
vdf.last('b', 'row_index', binby=['a'])
Out[5]
array([               11,    94914693182688,               311,
        7318345117270239,       -4281860073,  -280710472531969,
       28148060314075378, 27303412049444890,  -280757702098694,
       61924589365559453,  1970788713431039, 16607461712592895,
       34902892829671423, 41659228561080319,  -280624573054935,
       59391215794389094, 53762601038577663, 12103423998558207,
        -280981039677170, 82191492082237439,  5348741817040895,
       63051146402398244,  -280358273679361,  -280341094203271,
       27022920614084857,  1970320546005199,  -280744832335651,
        -281011120308206,       -4279697174,  -280238026194795,
       77124143618719743, 23362418791415855,  -280736225754837,
       56859225195741461, 76561597405462527, 37155328285999103,
        -281354708451070, 12666373951914050,  -281200098869158,
       62768915528482914, 52636610918809735, 48695514571145322,
        -281264513810433,  -280409817087985,       -4289462113,
        -280293860769490,  -280444184624905, 35747317754626086,
       63895133260742844, 14919114363437275, 56294995342065896,
       77688356304257023, 58265315894952001,            -65522,
        -281457796841473, 39125167810216239,       -4286185259,
        -280637450092456, 43911363399057663,       -4285923282,
             -4285071361, 69244085527445503,                -1,
       19422245839372505,  -280916630044673, 74309389562282113,
        -281062659850241,            -65227, 16607023625928703,
       64457765376622654,                -1,            -65248,
       36310271995609177,                12,       -4287758058,
       32933225120923819, 33778019424075775,  7600116446461963,
        -280182188343297,       -4287692510, 57420916723810303,
       21111323332968447,       -4289527616, 19704296345370743,
       29836347531329535, 23080943815426047, 36028797018898617,
       21956250788233215,  -281328947822593, 83880079944319213,
                  -65301,  3659174697173203, 50665491528417367,
        -280903735574267, 21955043897704480,       -4292542313,
       32088143060205567,  -281440600915788,  -280641753055233,
        2815192166367237, 68398784414482472,  -280349676994260,
             -4293525323,            -65342, 52072866411511924,
        -281341821976577, 57140008701919231, 41377478720618495,
        -281389059932161, 85851049029533800, 14636698788888636,
             -4274716673,  -280985332809729, 16044756572307455,
        -280143530295297,  -281307462434752, 40532392355823615,
                  -65536, 82473246216290428,  5066545289560063,
        -281466386841386, 79375943432339465, 53480245574959344,
       40814064901357567, 80220368362471478,            -65465,
             -4286054169, 51229291869831323])

maartenbreddels · 2022-01-26T20:49:46Z

Values are correct, but I am expecting the groupby to keep inital row order, meaning I am expecting:

groupy doesn't sort by default, if you need that, pass sort=True

I get this result I consider quite weird. Please, can you help me to understand?

Yes, these are uninitialized values, meaning there is no data and you get garbage it seems. Focus on the groupby+aggregator if you are confused about df.first/df.last

Regards,

Maarten

yohplala · 2022-01-26T21:05:45Z

Thanks Maarten for the quick feedback.

groupy doesn't sort by default, if you need that, pass sort=True

I don't request sorting (I just tested, and it sorts according 'a').
I have in mind that 'expected behavior' would be that initial order of rows (as they appear in the initial dataframe) is preserved.

Now, for my own targeted use case, it happens that the datasets I am handling are ordered according 'a', so using sort=True is fine, but means as I understand additional processing, while initial dataframe is anyway already sorted.

Yes, these are uninitialized values, meaning there is no data and you get garbage it seems. Focus on the groupby+aggregator if you are confused about df.first/df.last

Ok, so focusing on groupby+aggregator, it would indeeed seems that selection does not work.
But here again, I am not so sure about my setup to use selection, though I have no error message.

# Test 3
vdf = vx.from_pandas(pd.DataFrame({'a':[22,10,10,22,31], 'b':[8,21,11,12,43]}))
vdf.select(vdf.a > 11)
vdf.groupby('a', agg={'c': vx.agg.last('b', selection=True)})
Out[48]: 
  #    a    c
  0   31   43
  1   22   12
  2   10   11

With selection, I would expect:

Out[48]: 
  #    a    c
  0   31   43
  1   22   12

Is this related to my setup?

maartenbreddels · 2022-01-31T18:40:44Z

I don't request sorting (I just tested, and it sorts according 'a').

Vaex cannot hold the order of appearance, but it can sort.

You can read how selections work in the docs, for instance the tutorial.

yohplala · 2022-01-31T19:37:39Z

You can read how selections work in the docs, for instance the tutorial.

I am sorry @maartenbreddels, I have not been a user of selection so far.
What I understood from the selection tutorial is that selection is managed in 2 steps:

1st you define it,
then you can take it into account if the method you are using next has the parameter selection that can be set to True.

Which was basically the idea of my previous code snippet.
Now, if I copy more precisely the methods used in the tutorial, this means I need to use evaluate ?
I tried so, but this is obviously not the right step.

# Test 3
vdf = vx.from_pandas(pd.DataFrame({'a':[22,10,10,22,31], 'b':[8,21,11,12,43]}))
vdf.select(vdf.a > 11)
vdf.groupby(vdf.evaluate('a', selection=True), agg={'c': vx.agg.last('b', selection=True)})


  File "/home/yoh/Documents/code/vaex/packages/vaex-core/vaex/utils.py", line 640, in _ensure_string_from_expression
    raise ValueError('%r is not of string or Expression type, but %r' % (expression, type(expression)))

ValueError: 22 is not of string or Expression type, but <class 'numpy.int64'>

@maartenbreddels I feel I am making you lose your time here. Please, ignore my message. I am posting it so as to show that I do acknowledge your proposal for me to check this item and I thank you for that. However, it appears I am still too inexperienced in this to be of any help. I am sorry.
Bests,

JovanVeljanoski · 2022-02-21T17:03:23Z

@maartenbreddels

I started making unit-tests for this PR. I stared by doing a test for last, which is just the basic test for first modified. I noticed that this line:

ds.last(ds.y, -ds.x, binby=[ds.x, ds.x+5], limits=[[0, 10], [5, 15]], shape=[2, 1]).tolist()

sometimes returns incorrect result. The expected result should be [[0], [0]] but sometimes [[nan], [0]] is returned, and other times other very large non-zero numbers (or very small, as in large but negative).

To reproduce:

import sys 
import os
sys.path.append(os.path.abspath("/home/user/vaex/tests"))
from common import *

ds = create_filtered()

for i in range(1000):
    x = ds.last(ds.y, -ds.x, binby=[ds.x, ds.x+5], limits=[[0, 10], [5, 15]], shape=[2, 1]).tolist()
    try:
        np.testing.assert_array_almost_equal(
            x,
            [[0.], [0.]], 
            decimal=2)
    except:
        print(x)

maartenbreddels · 2022-02-21T17:27:24Z

That's not unexpected I think, that's why I asked for tests, if it fails, great.

JovanVeljanoski · 2022-02-21T23:23:36Z

Just added some additional tests. Some notes:

would be nice to support ordering of string columns (added a test for it).
Selections do not seem to work (added tests)
in dataframe.py first() contains some code that is not being executed
Improved / added some docstrings.

Also, if no order argument is given, the row number is used.

maartenbreddels · 2022-03-03T12:11:11Z

I've disabled the strings tests, but we should be able to put them back in once we implement those, lets's first get this in ok?

yohplala · 2022-03-05T21:05:41Z

Hi @maartenbreddels
Having installed last update of master (integrating your work), I have a weird behavior: name of column to aggregate is reduced to a single letter when initiating the aggregation, and as a result, the column is not found (if I understand correctly the error message)

import vaex
import pandas as pd

start = "2022/03/01 2:00"
ts = pd.date_range(start=start, periods=4, freq='3H')
len_ts = len(ts)
pdf = pd.DataFrame({'val':range(len_ts), 'ts':ts})
vdf = vaex.from_pandas(pdf)

# '4h' binning
vdf.groupby(vaex.BinnerTime(vdf.ts, resolution='h', every=4), agg={'last': vaex.agg.last("val")})

Error

  File "/home/yoh/Documents/code/vaex/packages/vaex-core/vaex/dataframe.py", line 5282, in __getitem__
    self.validate_expression(item)

  File "/home/yoh/Documents/code/vaex/packages/vaex-core/vaex/dataframe.py", line 3386, in validate_expression
    raise NameError(str(e)) from None

NameError: Column or variable 'v' does not exist.

I get the same error with vaex.agg.first.
However, if I switch to vaex.agg.sum for instance, it is running ok.

vdf.groupby(vaex.BinnerTime(vdf.ts, resolution='h', every=4), agg={'sum': vaex.agg.sum("val")})
Out[35]: 
  #  ts               sum
  0  2022-03-01 02      1
  1  2022-03-01 06      2
  2  2022-03-01 10      3

maartenbreddels · 2022-03-07T15:18:28Z

Hmm, thanks for testing this, I think the error is on https://github.com/vaexio/vaex/pull/1848/files#diff-5d0bf7451a5252cb31d40c56df00367031e28f097cd7273e71376d50c2fc7b0fR487

Do you mind adding a unittests, and opening a PR, or if you spot the error you can try to fix it yourself (it's missing a list [...]).

maartenbreddels force-pushed the feat_agg_first_last_mixed branch from 0611a3d to 7fe0475 Compare January 20, 2022 15:40

yohplala mentioned this pull request Jan 21, 2022

[NOT TO BE MERGED] Agg first / cast or raise error in case expression & order_expression are not the same dtypes #1834

Closed

yohplala mentioned this pull request Feb 20, 2022

groupby-agg all columns based on one column #828

Open

JovanVeljanoski mentioned this pull request Feb 22, 2022

Feat: add skew and kurtosis aggregators. #1946

Merged

4 tasks

✨ last aggregor, and first and last now can have mixed types

0d371b0

Also, if no order argument is given, the row number is used.

maartenbreddels force-pushed the feat_agg_first_last_mixed branch from 7e022c4 to 0d371b0 Compare March 3, 2022 10:38

maartenbreddels merged commit 7b3d3cf into master Mar 3, 2022

JovanVeljanoski mentioned this pull request Mar 8, 2022

Fix for first/last agg #1962

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ last aggregor, and first and last now can have mixed types #1848

✨ last aggregor, and first and last now can have mixed types #1848

maartenbreddels commented Jan 20, 2022 •

edited

Loading

yohplala commented Jan 21, 2022

yohplala commented Jan 26, 2022

maartenbreddels commented Jan 26, 2022

yohplala commented Jan 26, 2022 •

edited

Loading

maartenbreddels commented Jan 31, 2022

yohplala commented Jan 31, 2022 •

edited

Loading

JovanVeljanoski commented Feb 21, 2022

maartenbreddels commented Feb 21, 2022

JovanVeljanoski commented Feb 21, 2022

maartenbreddels commented Mar 3, 2022

yohplala commented Mar 5, 2022

maartenbreddels commented Mar 7, 2022

✨ last aggregor, and first and last now can have mixed types #1848

✨ last aggregor, and first and last now can have mixed types #1848

Conversation

maartenbreddels commented Jan 20, 2022 • edited Loading

yohplala commented Jan 21, 2022

yohplala commented Jan 26, 2022

maartenbreddels commented Jan 26, 2022

yohplala commented Jan 26, 2022 • edited Loading

maartenbreddels commented Jan 31, 2022

yohplala commented Jan 31, 2022 • edited Loading

JovanVeljanoski commented Feb 21, 2022

maartenbreddels commented Feb 21, 2022

JovanVeljanoski commented Feb 21, 2022

maartenbreddels commented Mar 3, 2022

yohplala commented Mar 5, 2022

maartenbreddels commented Mar 7, 2022

maartenbreddels commented Jan 20, 2022 •

edited

Loading

yohplala commented Jan 26, 2022 •

edited

Loading

yohplala commented Jan 31, 2022 •

edited

Loading