Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ last aggregor, and first and last now can have mixed types #1848

Merged
merged 1 commit into from
Mar 3, 2022

Conversation

maartenbreddels
Copy link
Member

@maartenbreddels maartenbreddels commented Jan 20, 2022

Also, if no order argument is given, the row number is used.

After #1847 it's easier to add new aggregators, so I took the time to fix first, and add last.

Replaces #1834
Closes #1378
cc @yohplala

TODO:

  1. improve docstrings explaining this
  2. test and implement selections, test missing/masked data

Either @JovanVeljanoski or @yohplala would you like to work on these last two points? If this shows up issues (I know selections are not implemented), I can fix those.

@yohplala
Copy link
Contributor

Hi @maartenbreddels , yes I can try to have a look this week-end, but not sure to understand everything.
But this will definitely be a good exercise to learn new things about vaex :).
Regarding docstring, please, which functions are you expecting them for?
I can see that both last and first already have some. I will double-check them, not sure there is much more to say though. I will check deeper this week-end.

Thanks a lot for taking these points about last and mixed types @maartenbreddels!

@yohplala
Copy link
Contributor

Hi @maartenbreddels ,

Doing a test with this branch, there are still things that I find surprising.

1/ Testing inversion of row index, as posted in #1378
-> still a question remaining.

import vaex as vx
import pandas as pd

# Test 1
vdf = vx.from_pandas(pd.DataFrame({'a':[10,10,20], 'b':[21,11,12]}))
vdf["row_index"] = vx.vrange(0, len(vdf))
vdf
Out[1]: 
  #    a    b    row_index
  0   10   21            0
  1   10   11            1
  2   20   12            2

# Setup 'agg_last', data is ordered according row index.
agg_last = {col: vx.agg.first(col, "-row_index") for col in vdf.get_column_names()}
# Drop duplicates based on 'a', keeping last.
vdf.groupby('a', agg=agg_last)
Out[2]: 
  #    a    b    row_index
  0   20   12            2
  1   10   11            1

Values are correct, but I am expecting the groupby to keep inital row order, meaning I am expecting:

  #    a    b    row_index
  0   10   11            1
  1   20   12            2

Is the behavior noticed normal?

2/ Testing last
-> same question than previous paragraph.

# Test 2
vdf = vx.from_pandas(pd.DataFrame({'a':[22,10,10,22,31], 'b':[8,21,11,12,43]}))
vdf["row_index"] = vx.vrange(0, len(vdf))
vdf
Out[3]: 
  #    a    b    row_index
  0   22    8            0
  1   10   21            1
  2   10   11            2
  3   22   12            3
  4   31   43            4

vdf.groupby('a', agg={'c': vx.agg.last('b')})
Out[4]: 
  #    a    c
  0   31   43
  1   22   12
  2   10   11

Values are correct, but I am expecting the groupby to keep inital row order, meaning I am expecting:

  #    a    c
  0   10   11
  1   22   12
  2   31   43

Is the behavior noticed normal?

3/ Willing to test select but....
not understanding parameters limits and shape for first or last aggregations ​and possibly resulting from that, not understanding obtained results.

Without using those 2 parameters (and without selection for now), I get this result I consider quite weird. Please, can you help me to understand?

vdf = vx.from_pandas(pd.DataFrame({'a':[22,10,10,22,31], 'b':[8,21,11,12,43]}))
vdf["row_index"] = vx.vrange(0, len(vdf))
vdf.last('b', 'row_index', binby=['a'])
Out[5]
array([               11,    94914693182688,               311,
        7318345117270239,       -4281860073,  -280710472531969,
       28148060314075378, 27303412049444890,  -280757702098694,
       61924589365559453,  1970788713431039, 16607461712592895,
       34902892829671423, 41659228561080319,  -280624573054935,
       59391215794389094, 53762601038577663, 12103423998558207,
        -280981039677170, 82191492082237439,  5348741817040895,
       63051146402398244,  -280358273679361,  -280341094203271,
       27022920614084857,  1970320546005199,  -280744832335651,
        -281011120308206,       -4279697174,  -280238026194795,
       77124143618719743, 23362418791415855,  -280736225754837,
       56859225195741461, 76561597405462527, 37155328285999103,
        -281354708451070, 12666373951914050,  -281200098869158,
       62768915528482914, 52636610918809735, 48695514571145322,
        -281264513810433,  -280409817087985,       -4289462113,
        -280293860769490,  -280444184624905, 35747317754626086,
       63895133260742844, 14919114363437275, 56294995342065896,
       77688356304257023, 58265315894952001,            -65522,
        -281457796841473, 39125167810216239,       -4286185259,
        -280637450092456, 43911363399057663,       -4285923282,
             -4285071361, 69244085527445503,                -1,
       19422245839372505,  -280916630044673, 74309389562282113,
        -281062659850241,            -65227, 16607023625928703,
       64457765376622654,                -1,            -65248,
       36310271995609177,                12,       -4287758058,
       32933225120923819, 33778019424075775,  7600116446461963,
        -280182188343297,       -4287692510, 57420916723810303,
       21111323332968447,       -4289527616, 19704296345370743,
       29836347531329535, 23080943815426047, 36028797018898617,
       21956250788233215,  -281328947822593, 83880079944319213,
                  -65301,  3659174697173203, 50665491528417367,
        -280903735574267, 21955043897704480,       -4292542313,
       32088143060205567,  -281440600915788,  -280641753055233,
        2815192166367237, 68398784414482472,  -280349676994260,
             -4293525323,            -65342, 52072866411511924,
        -281341821976577, 57140008701919231, 41377478720618495,
        -281389059932161, 85851049029533800, 14636698788888636,
             -4274716673,  -280985332809729, 16044756572307455,
        -280143530295297,  -281307462434752, 40532392355823615,
                  -65536, 82473246216290428,  5066545289560063,
        -281466386841386, 79375943432339465, 53480245574959344,
       40814064901357567, 80220368362471478,            -65465,
             -4286054169, 51229291869831323])

@maartenbreddels
Copy link
Member Author

Values are correct, but I am expecting the groupby to keep inital row order, meaning I am expecting:

groupy doesn't sort by default, if you need that, pass sort=True

I get this result I consider quite weird. Please, can you help me to understand?

Yes, these are uninitialized values, meaning there is no data and you get garbage it seems. Focus on the groupby+aggregator if you are confused about df.first/df.last

Regards,

Maarten

@yohplala
Copy link
Contributor

yohplala commented Jan 26, 2022

Thanks Maarten for the quick feedback.

groupy doesn't sort by default, if you need that, pass sort=True

I don't request sorting (I just tested, and it sorts according 'a').
I have in mind that 'expected behavior' would be that initial order of rows (as they appear in the initial dataframe) is preserved.

Now, for my own targeted use case, it happens that the datasets I am handling are ordered according 'a', so using sort=True is fine, but means as I understand additional processing, while initial dataframe is anyway already sorted.

Yes, these are uninitialized values, meaning there is no data and you get garbage it seems. Focus on the groupby+aggregator if you are confused about df.first/df.last

Ok, so focusing on groupby+aggregator, it would indeeed seems that selection does not work.
But here again, I am not so sure about my setup to use selection, though I have no error message.

# Test 3
vdf = vx.from_pandas(pd.DataFrame({'a':[22,10,10,22,31], 'b':[8,21,11,12,43]}))
vdf.select(vdf.a > 11)
vdf.groupby('a', agg={'c': vx.agg.last('b', selection=True)})
Out[48]: 
  #    a    c
  0   31   43
  1   22   12
  2   10   11

With selection, I would expect:

Out[48]: 
  #    a    c
  0   31   43
  1   22   12

Is this related to my setup?

@maartenbreddels
Copy link
Member Author

I don't request sorting (I just tested, and it sorts according 'a').

Vaex cannot hold the order of appearance, but it can sort.

You can read how selections work in the docs, for instance the tutorial.

@yohplala
Copy link
Contributor

yohplala commented Jan 31, 2022

You can read how selections work in the docs, for instance the tutorial.

I am sorry @maartenbreddels, I have not been a user of selection so far.
What I understood from the selection tutorial is that selection is managed in 2 steps:

  • 1st you define it,
  • then you can take it into account if the method you are using next has the parameter selection that can be set to True.

Which was basically the idea of my previous code snippet.
Now, if I copy more precisely the methods used in the tutorial, this means I need to use evaluate ?
I tried so, but this is obviously not the right step.

# Test 3
vdf = vx.from_pandas(pd.DataFrame({'a':[22,10,10,22,31], 'b':[8,21,11,12,43]}))
vdf.select(vdf.a > 11)
vdf.groupby(vdf.evaluate('a', selection=True), agg={'c': vx.agg.last('b', selection=True)})


  File "/home/yoh/Documents/code/vaex/packages/vaex-core/vaex/utils.py", line 640, in _ensure_string_from_expression
    raise ValueError('%r is not of string or Expression type, but %r' % (expression, type(expression)))

ValueError: 22 is not of string or Expression type, but <class 'numpy.int64'>

@maartenbreddels I feel I am making you lose your time here. Please, ignore my message. I am posting it so as to show that I do acknowledge your proposal for me to check this item and I thank you for that. However, it appears I am still too inexperienced in this to be of any help. I am sorry.
Bests,

@JovanVeljanoski
Copy link
Member

@maartenbreddels

I started making unit-tests for this PR. I stared by doing a test for last, which is just the basic test for first modified. I noticed that this line:

ds.last(ds.y, -ds.x, binby=[ds.x, ds.x+5], limits=[[0, 10], [5, 15]], shape=[2, 1]).tolist()

sometimes returns incorrect result. The expected result should be [[0], [0]] but sometimes [[nan], [0]] is returned, and other times other very large non-zero numbers (or very small, as in large but negative).

To reproduce:

import sys 
import os
sys.path.append(os.path.abspath("/home/user/vaex/tests"))
from common import *

ds = create_filtered()

for i in range(1000):
    x = ds.last(ds.y, -ds.x, binby=[ds.x, ds.x+5], limits=[[0, 10], [5, 15]], shape=[2, 1]).tolist()
    try:
        np.testing.assert_array_almost_equal(
            x,
            [[0.], [0.]], 
            decimal=2)
    except:
        print(x)

@maartenbreddels
Copy link
Member Author

That's not unexpected I think, that's why I asked for tests, if it fails, great.

@JovanVeljanoski
Copy link
Member

Just added some additional tests. Some notes:

  • would be nice to support ordering of string columns (added a test for it).
  • Selections do not seem to work (added tests)
  • in dataframe.py first() contains some code that is not being executed
  • Improved / added some docstrings.

Also, if no order argument is given, the row number is used.
@maartenbreddels maartenbreddels force-pushed the feat_agg_first_last_mixed branch from 7e022c4 to 0d371b0 Compare March 3, 2022 10:38
@maartenbreddels
Copy link
Member Author

I've disabled the strings tests, but we should be able to put them back in once we implement those, lets's first get this in ok?

@maartenbreddels maartenbreddels merged commit 7b3d3cf into master Mar 3, 2022
@yohplala
Copy link
Contributor

yohplala commented Mar 5, 2022

Hi @maartenbreddels
Having installed last update of master (integrating your work), I have a weird behavior: name of column to aggregate is reduced to a single letter when initiating the aggregation, and as a result, the column is not found (if I understand correctly the error message)

import vaex
import pandas as pd

start = "2022/03/01 2:00"
ts = pd.date_range(start=start, periods=4, freq='3H')
len_ts = len(ts)
pdf = pd.DataFrame({'val':range(len_ts), 'ts':ts})
vdf = vaex.from_pandas(pdf)

# '4h' binning
vdf.groupby(vaex.BinnerTime(vdf.ts, resolution='h', every=4), agg={'last': vaex.agg.last("val")})

Error

  File "/home/yoh/Documents/code/vaex/packages/vaex-core/vaex/dataframe.py", line 5282, in __getitem__
    self.validate_expression(item)

  File "/home/yoh/Documents/code/vaex/packages/vaex-core/vaex/dataframe.py", line 3386, in validate_expression
    raise NameError(str(e)) from None

NameError: Column or variable 'v' does not exist.

I get the same error with vaex.agg.first.
However, if I switch to vaex.agg.sum for instance, it is running ok.

vdf.groupby(vaex.BinnerTime(vdf.ts, resolution='h', every=4), agg={'sum': vaex.agg.sum("val")})
Out[35]: 
  #  ts               sum
  0  2022-03-01 02      1
  1  2022-03-01 06      2
  2  2022-03-01 10      3

@maartenbreddels
Copy link
Member Author

Hmm, thanks for testing this, I think the error is on https://github.com/vaexio/vaex/pull/1848/files#diff-5d0bf7451a5252cb31d40c56df00367031e28f097cd7273e71376d50c2fc7b0fR487

Do you mind adding a unittests, and opening a PR, or if you spot the error you can try to fix it yourself (it's missing a list [...]).

@JovanVeljanoski JovanVeljanoski mentioned this pull request Mar 8, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE-REQUEST] Create OHLC from tick data
3 participants