Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Reimplement and undeprecate DataFrame.lookup #39171

Closed
impredicative opened this issue Jan 14, 2021 · 26 comments
Closed

ENH: Reimplement and undeprecate DataFrame.lookup #39171

impredicative opened this issue Jan 14, 2021 · 26 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@impredicative
Copy link

impredicative commented Jan 14, 2021

Is your feature request related to a problem?

I seriously get the impression that Pandas is actively being sabotaged. As a case in point, DataFrame.lookup was deprecated in v1.2. The problems with this are:

  1. The doc for DataFrame.lookup says to see itself for an example. Well, there is no example in the doc page. The only example I'm aware of is here instead which is a different page.
  2. The changelog for this deprecation points to GH18682 which looks to be wholly irrelevant to this change. A lot has gone wrong here.
  3. There is significant user code that already uses lookup. Why break it? If the current implementation of lookup is suboptimal, shouldn't it be optimized instead?
  4. It is extremely complicated to use melt when lookup works quite simply. For example, compare this simple answer using lookup with this complicated answer using melt.

Does nobody review changes, docs, and release notes anymore prior to the release? It looks this way.

Describe the solution you'd like

  1. Optimize lookup if attainable using melt or otherwise.
  2. Undeprecate lookup.
@impredicative impredicative added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2021
@jreback
Copy link
Contributor

jreback commented Jan 14, 2021

Does nobody review changes, docs, and release notes anymore prior to the release? It looks this way.

this is not a friendly way to ask for things.

DataFrame.lookup has seen NO love at all since its initial implemention. It is a duplicative and unmaintained function. This is also polluting the namespace and is not in any way integrated to all of the other indexers. We do not need N ways to do the same thing.

@jreback jreback closed this as completed Jan 14, 2021
@jreback
Copy link
Contributor

jreback commented Jan 14, 2021

#35224 is the PR and #18262 is the issue

@jreback
Copy link
Contributor

jreback commented Jan 14, 2021

@impredicative you are welcome to contribute patches, e.g. doc fixes or other things.

@erfannariman
Copy link
Member

erfannariman commented Jan 25, 2021

@impredicative out of curiosity, what is exactly difficult about the new proposed method, which is also significantly faster since DataFrame.lookup was a for-loop on the background.

I agree that we can fix the documentation in the deprecated message and in the whatsnew though.

@jreback Would you be open to un-deprecate DataFrame.lookup and implement the melt + loc method on the background? I don't say it was an extremely popular method, but on SO I saw it being used sometimes.

@impredicative
Copy link
Author

impredicative commented Jan 25, 2021

@erfannariman What's difficult, for example, is how complicated this answer using melt is relative to how simple this answer using lookup is. The answer using melt requires three complicated method calls relative to the answer using lookup which is a single simple method call. In summary, just as you're saying, DataFrame.lookup could've been reimplemented using melt, thereby satisfying everyone.

@erfannariman
Copy link
Member

@erfannariman What's difficult, for example, is how complicated this answer using melt is relative to how simple this answer using lookup is. The answer using melt requires three complicated method calls relative to the answer using lookup which is a single simple method call. In summary, just as you're saying, DataFrame.lookup could've been reimplemented using melt, thereby satisfying everyone.

I don't disagree with your point, although I don't think the proposed method is persé difficult, but I might be biased. I would not be against un-deprecating DataFrame.lookup. I will await @jreback response.

@impredicative

This comment has been minimized.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2021

@impredicative there is almost no usage of lookup AFAICT. very very few issues / tests / SO entries. If you have a valid, real work very common then am happy to show a doc example / recipe. Having a method must be a high bar.

@impredicative
Copy link
Author

impredicative commented Jan 26, 2021

@jreback A google.com search for "df.lookup" site:stackoverflow.com lists 462 hits. That's more than for df.bfill (208) and df.ffill (375). Will you now be deprecating these two methods as well? What is your cutoff?

@jreback
Copy link
Contributor

jreback commented Jan 26, 2021

@impredicative you are not comparing apples to apples here. try with indexing operators, we already have .loc, .iloc, .at, and .iat, not to mention partial slicing and of course []. which one shall we cut to keep .lookup????

I in fact have an issue to remove .iat and .at. We simply do not need this much choice.

@impredicative
Copy link
Author

impredicative commented Jan 26, 2021

@jreback I don't use .iat or .at.

In general, I would ask the question: "Is the alternative more simple or more difficult to use?" If a method simplifies code, making it less verbose and prevents users from reimplementing common patterns, then it's valuable, even if you personally think it's a duplicate, proportional to how well it accomplishes this goal.

@jreback
Copy link
Contributor

jreback commented Jan 27, 2021

@jreback I don't use .iat or .at.

In general, I would ask the question: "Is the alternative more simple or more difficult to use?" If a method simplifies code, making it less verbose and prevents users from reimplementing common patterns, then it's valuable, even if you personally think it's a duplicate, proportional to how well it accomplishes this goal.

this is balanced against an already huge api. There are alternatives from this. If you have use case where this is especially awkward / burdensome I would encourage you to open an issue with a reproducible example and show why a new method makes sense.

@quanghm
Copy link

quanghm commented Feb 4, 2021

@erfannariman

out of curiosity, what is exactly difficult about the new proposed method, which is also significantly faster since DataFrame.lookup was a for-loop on the background.

I learned about the deprecation of lookup when answering this SO question. And this is my thought on the proposed alternative for lookup. Of course, if current lookup is a for loop, this method would be faster for small dataset. But for bigger dataset, things would change due to potentially larger memory allocation, especially when column names are long strings and data to be looked up are homogeneous.

@impredicative impredicative changed the title ENH: Undeprecate and reimplement DataFrame.lookup ENH: Reimplement and undeprecate DataFrame.lookup Feb 4, 2021
@erfannariman
Copy link
Member

@erfannariman

out of curiosity, what is exactly difficult about the new proposed method, which is also significantly faster since DataFrame.lookup was a for-loop on the background.

I learned about the deprecation of lookup when answering this SO question. And this is my thought on the proposed alternative for lookup. Of course, if current lookup is a for loop, this method would be faster for small dataset. But for bigger dataset, things would change due to potentially larger memory allocation, especially when column names are long strings and data to be looked up are homogeneous.

Sure I would be happy to make a PR with a new proposed more efficient method and there we can discuss into more details and the core devs can share their thoughts as well. What would be your suggestion, the answer you gave on SO? Could you maybe share a reproducible example in a new ticket and mention this issue? @quanghm

@quanghm
Copy link

quanghm commented Feb 4, 2021

@erfannariman

out of curiosity, what is exactly difficult about the new proposed method, which is also significantly faster since DataFrame.lookup was a for-loop on the background.

I ran a quick test and the results said the reverse:

np.random.seed(43)
# also tested for n = 1000, 10_000, 100_000
n=1_000_000
cols = list('abcdef')
df = pd.DataFrame(np.random.randint(0, 10, size=(n,len(cols))), columns=cols)
df['col'] = np.random.choice(cols, n)

# the proposed method
%%timeit -n 10
melt = df.melt('col')
melt = melt.loc[lambda x: x['col']==x['variable'], 'value']
melt = melt.reset_index(drop=True)
# 623 ms ± 6.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# What I would do without `lookup`
%%timeit -n 10
cols, idx = np.unique(df['col'], return_inverse=True)
df.reindex(cols,axis=1).to_numpy()[np.arange(df.shape[0]), idx]
# 430 ms ± 4.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Current lookup on 1.1.4
%%timeit -n 10
df.lookup(df.index, df['col'])
# 321 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The proposed method always comes last by a big margin in terms of speed. This only gets worse if the column names get longer.

@impredicative
Copy link
Author

lambda x: x['col']==x['variable']

Using lambdas is a non-starter. It shouldn't even be suggested unless it is the last option on earth.

cols, idx = np.unique(df['col'], return_inverse=True)

np.unique obviously will support only a very limited number of Pandas datatypes, so it's probably a non-starter too. Numerical programming using Numpy is different from more generic dataframe programming using Pandas.

@quanghm
Copy link

quanghm commented Feb 4, 2021

@erfannariman

Regarding the lambda function, that was originally an attempt to chain the three lines to detect possible speed improvement. Answer was not.

Using .loc[melt['col']==melt['variable'],...] wouldn't improve/change anything.

np.unique is used on the column names, so most likely will be strings. In general case, we can use instead, pd.Series.unique() and pd.Series.factorize() for that single line. But as far as I know, those two also returns np.array.

Also, can you name some Pandas datatypes that wouldn't work with np.unique, especially those one would use as column names?

All those aside, the message here is that the proposed method is slower than current lookup method by factor of 2.

@erfannariman
Copy link
Member

Thanks for the extensive example and speedtests @quanghm . I did a quick check and seems like lookup wasn't a for loop actually, only for columns with mixed types. So the in terms of speed it looks okay. Only thing is that I remember it being quite slow for certain cases, but those must have been mixed type columns in my cases I think. See here for source code

pandas/pandas/core/frame.py

Lines 3848 to 3861 in b5958ee

if not self._is_mixed_type or n > thresh:
values = self.values
ridx = self.index.get_indexer(row_labels)
cidx = self.columns.get_indexer(col_labels)
if (ridx == -1).any():
raise KeyError("One or more row labels was not found")
if (cidx == -1).any():
raise KeyError("One or more column labels was not found")
flat_index = ridx * len(self.columns) + cidx
result = values.flat[flat_index]
else:
result = np.empty(n, dtype="O")
for i, (r, c) in enumerate(zip(row_labels, col_labels)):
result[i] = self._get_value(r, c)

For your approach, I agree that it's more efficient both in speed and memory allocation, but I remember writing multiple methods to replace lookup and I decided to go with melt since it was most "readable", without making it too complex for the documentation.

So I think it all boils down to @jreback and the argument that DataFrame.lookup has not been used as much, which is being discussed by @impredicative

@quanghm
Copy link

quanghm commented Feb 4, 2021

@erfannariman Thanks. The code for lookup indeed reflects what I imagine, that is resolving to numpy advance indexing.

However, in the case that col is a column in the data (e.g. in my test or in the proposed solution) it's almost certainly that self._is_mixed_type == True. So you are right that it is just a for loop as of now. Still, it's faster than the proposed melt method.

On another note, this is certainly new for me

np.unique obviously will support only a very limited number of Pandas datatypes

can you provide some details?

@quanghm
Copy link

quanghm commented Feb 4, 2021

np.unique obviously will support only a very limited number of Pandas datatypes, so it's probably a non-starter too. Numerical programming using Numpy is different from more generic dataframe programming using Pandas.

I confirmed that pd.factorize() works and is much faster than np.unique, at least in my test above:

%%timeit -n 10
idx, cols = pd.factorize(df['col'])
df.reindex(cols,axis=1).to_numpy()[np.arange(df.shape[0]), idx]
# 48.4 ms ± 716 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@espdev
Copy link

espdev commented Feb 13, 2021

I think it is premature to deprecate lookup function. At the same time, the documentation for melt function is too superficial. For example when I read the documentation (1, 2) I do not understand how I need to port my code from lookup to melt.

I do simple things. Here is a simple example:

>>> index = pd.Series(['2020-01-04', '2020-01-03'], index=['A', 'B'])
>>> df = pd.DataFrame({'A': [1,2,3], 'B': [4,5,6]}, index=['2020-01-03', '2020-01-04', '2020-01-05'])

>>> index
A    2020-01-04
B    2020-01-03
dtype: object

>>> df
            A  B
2020-01-03  1  4
2020-01-04  2  5
2020-01-05  3  6

>>> values = df.lookup(index.values, index.index)
>>> values
array([2, 4], dtype=int64)

>>> pd.Series(values, index=index.index)
A    2
B    4
dtype: int64

What should I do to use melt instead of lookup? And why can't I use high-level lookup API like before? Maybe I just stupid, but I do not want to think about it and dive into ornate concepts, I just want to implement the algorithm in high-level clean code. I want to keep my code simple and the code should clearly express my intention.

The code example from SO is tricky.

df['new_col'] = df.melt(id_vars='names', value_vars=['a', 'b'], ignore_index=False).query('names == variable').loc[df.index, 'value']

Ask yourself what intention it expresses. This is some kind of esoteric gibberish without reference to the subject area. The code is too low-level and verbose compared to the simple and concise lookup(rows, cols).

@quanghm
Copy link

quanghm commented Feb 15, 2021

@impredicative you are not comparing apples to apples here. try with indexing operators, we already have .loc, .iloc, .at, and .iat, not to mention partial slicing and of course []. which one shall we cut to keep .lookup????

@jreback I think it's you that not comparing apples to apples here: none of those APIs you listed offer lookup functionality (otherwise, you wouldn't have used melt as an alternative). lookup is a direct translation of arr[list_1, list_2] from Numpy to Pandas. The Numpy API is there and it looks like it will stay for a while, why remove Pandas' equivalent.

The example that @espdev listed in his comment above is a more common case for the use of lookup. And from that example, I agree that it's really is unnatural to think melt at all, let alone using it.

@MarcoGorelli
Copy link
Member

First off, I can confirm I can reproduce the performance difference from #39171 (comment), even without the lambda function (this is all on the v1.1.4 tag):

In [58]: %%timeit -n 10
    ...: df.melt('col', ignore_index=False).query('col==variable')['value'].rein
    ...: dex(df.index).to_numpy()
    ...: 
    ...: 
671 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [59]: %%timeit -n 10
    ...: df.lookup(df.index, df['col'])
    ...: 
    ...: 

290 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Second, is there a case when the solution in #39171 (comment) would not work well enough? If not, should the user guide just be updated to use that and we can move on from this issue?

Third - @impredicative please consider your tone

@impredicative
Copy link
Author

Third - @impredicative please consider your tone

My tone is justifiable. I have had enough of Pandas being unprofessionally developed. Panda works only for simple manipulations of small dataframes, and scales extremely poorly or not at all. I have spent weeks and weeks trying to work around its various issues, and the best thing for me to do right now is to move to a real package like Dask, PySpark, etc. I'm sure that many other users here feel similarly.

Fourth - @MarcoGorelli please consider how Pandas is actually developed.

@MarcoGorelli
Copy link
Member

If anyone wants to take this forward with a respectful tone (e.g. constructive comments like #39171 (comment) ), then feel free to open a new issue - closing and locking as this is not the way to have a productive discussion and several comments have been off-topic

@pandas-dev pandas-dev locked as too heated and limited conversation to collaborators Feb 17, 2021
@jreback
Copy link
Contributor

jreback commented Feb 17, 2021

thanks @MarcoGorelli

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

6 participants