Skip to content

TYP: libgroupby int64->intp #40635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 26, 2021
Merged

TYP: libgroupby int64->intp #40635

merged 3 commits into from
Mar 26, 2021

Conversation

jbrockmendel
Copy link
Member

@realead pushing this any further requires getting a HashTable for intp_t. Would that be difficult?

@WillAyd
Copy link
Member

WillAyd commented Mar 25, 2021

Just out of curiosity what is the motivation behind this change? I think ideally would have been a better starting point, but feels a bit strange to go back from 64 bit integers to 32 bit integers on 32 bit platforms with some of these

@WillAyd
Copy link
Member

WillAyd commented Mar 25, 2021

Hmm I see your response already here:

#40623 (comment)

Do we have tests that run on 32 bit platforms that use groups exceeding 2 ** 32 -1 in size? IIUC this might introduce a regression in that case, though I may also not be thinking through these sizes and how they are mapped in the entire ecosystem of things correctly

@jbrockmendel
Copy link
Member Author

Do we have tests that run on 32 bit platforms that use groups exceeding 2 ** 32 -1 in size?

IIUC a 32bit system should have no more than 4GB of memory, so shouldn't be able to create such a test

@WillAyd
Copy link
Member

WillAyd commented Mar 25, 2021

32 bit systems can still represent 64 bit integers. The compiler will just have to generate extra instructions that makes it inefficient compared to using a 64 bit system for the same calculations, but it is still something you can do (much like performing complex128 math on a 64 bit system)

https://stackoverflow.com/questions/28977587/is-it-ok-to-use-64bit-integers-in-a-32bit-application

@WillAyd
Copy link
Member

WillAyd commented Mar 25, 2021

From some of these changes I don't think it's the ngroups actually that matters as much as the values being stored. Moving these from 64 down to 32 I think could introduce undefined behavior on those platforms

@jbrockmendel
Copy link
Member Author

32 bit systems can still represent 64 bit integers

Its not about being able to represent 64 bit integers. Its about how many groups you can possibly have. max(labels) is going to match len(unique(values)) - 1, which should be bounded by len(values). My claim is that you shouldn't be able to get len(values) > 2**32-1

@WillAyd
Copy link
Member

WillAyd commented Mar 25, 2021

Oh OK I see that this is just labels that you've changed this for

@realead
Copy link
Contributor

realead commented Mar 26, 2021

@realead pushing this any further requires getting a HashTable for intp_t. Would that be difficult?

It wouldn't be a big deal to introduce an intp_t-HashTable, in a similar way to all other HashTables.

But the question is, is it really needed? One could just use int64_t-HashTable on 64bit systems and int32_t-HashTable on 32bit system. That would have the advantage of so/pyd being smaller.

If you think, that having a dedicated intp_t-HashTable is a cleaner way (which is probably true), I'll add it.

@jreback jreback added 32bit 32-bit systems Groupby Typing type annotations, mypy/pyright type checking labels Mar 26, 2021
@jreback jreback added this to the 1.3 milestone Mar 26, 2021
@jreback jreback merged commit 8fb6164 into pandas-dev:master Mar 26, 2021
@jreback
Copy link
Contributor

jreback commented Mar 26, 2021

thanks @jbrockmendel

if you can do an asv since before you started moving things to intp to make sure things are about the same (maybe slightly better).

@jbrockmendel jbrockmendel deleted the typ-libgb branch March 26, 2021 15:24
@jbrockmendel
Copy link
Member Author

if you can do an asv since before you started moving things to intp to make sure things are about the same (maybe slightly better).

i'll do that instead of turning on the heater this morning. i expect it to be swamped by unrelated changes in the interim

@jbrockmendel
Copy link
Member Author

But the question is, is it really needed? One could just use int64_t-HashTable on 64bit systems and int32_t-HashTable on 32bit system. That would have the advantage of so/pyd being smaller.

This makes sense. Would that just be a matter of:

if np.dtype(np.intp) == np.dtype(np.int64):
    IntpHashtable = Int64HashTable
elif np.dtype(np.intp) == np.dtype(np.int32):
    IntpHashTable = Int32Hashtable
else:
    raise ValueError(np.dtype(np.intp))

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
32bit 32-bit systems Groupby Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants