Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAS7BDAT parser: Fast byteswap #47403

Merged
merged 27 commits into from
Oct 5, 2022
Merged

Conversation

jonashaag
Copy link
Contributor

@jonashaag jonashaag commented Jun 17, 2022

Speed up SAS7BDAT int/float reading.

This is order of magnitude faster than using struct.unpack(fmt, data) or precompiled_unpacker = struct.Struct(fmt).unpack; ...; precompiled_unpacker(data).

Unfortunately Python does not expose a low-level interface to struct or a byteswapping interface. The byteswap implementation in this change is from pyreadstat.

Today this brings a modest 10-20% performance improvement. But together with the other changes I will be proposing it will be a major bottleneck.

  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@jonashaag jonashaag changed the title Fast byteswap SAS7BDAT parser: Fast byteswap Jun 17, 2022
uint8_t,
uint16_t,
uint32_t,
uint64_t,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, are these interchangeable with the versions of these we cimport from numpy?

Copy link
Contributor Author

@jonashaag jonashaag Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't make a huge difference. on the one hand itd be nice to avoid dependency on numpy when possible, on the other im inevitably going to forget and ask again in 6 months if we dont use the numpy versions


cdef inline float _byteswap_float(float num):
cdef uint32_t answer = 0
memcpy(&answer, &num, 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use sizeof instead of hard coding the size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a verbatim copy from ReadStat, do you still want me to make that modification?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. I think fine to keep as is then

@jonashaag
Copy link
Contributor Author

jonashaag commented Jul 4, 2022

One thing I realized is that we could also use NumPy's byteswapping, at the cost of around 10% performance relative to this impl.

@jbrockmendel
Copy link
Member

One thing I realized is that we could also use NumPy's byteswapping, at the cost of around 10% performance relative to this impl.

How much of this could be replaced with that?

@jonashaag
Copy link
Contributor Author

jonashaag commented Jul 7, 2022

How much of this could be replaced with that?

Everything copied from readstat so ~ 50 lines of Cython code

@jreback jreback added Performance Memory or execution speed performance IO SAS SAS: read_sas labels Jul 8, 2022
@jreback jreback added this to the 1.5 milestone Jul 8, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asv's that cover this case?

can you add a whatsnew note

@@ -433,3 +439,73 @@ cdef class Parser:
self.current_row_on_page_index += 1
self.current_row_in_chunk_index += 1
self.current_row_in_file_index += 1


def read_float_with_byteswap(const uint8_t *data, bint byteswap):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments here on what and why you are doing this.

@jonashaag
Copy link
Contributor Author

ASV from #47405:

       before           after         ratio
     [e915b0a4]       [435a003c]
     <main>           <sas/byteswap~1>
-      81.9±0.7ms       73.7±0.3ms     0.90  io.sas.SAS.time_read_sas7bdat_2_chunked
-      78.7±0.7ms       69.5±0.6ms     0.88  io.sas.SAS.time_read_sas7bdat_2

@jonashaag
Copy link
Contributor Author

Test failure seems unrelated

@jonashaag jonashaag requested a review from jreback July 16, 2022 12:28
@jonashaag
Copy link
Contributor Author

jonashaag commented Jul 21, 2022

I'm trying an intrinsics-based version: Fewer SLOC and faster.

@datapythonista
Copy link
Member

Do you want me to add unit tests?

Personally I think it should be useful. Probably just one parametrized to test the function for every type is enough.

@datapythonista
Copy link
Member

The release note in the other PR is not accurate now. The changes in this PR won't be available until pandas 1.6/2.0, while for what you mention the release note in the other PR mentions this issue/PR for 1.5. Probably no big deal, but if you want to open a small PR to update that note in the releases to only include the issues/PRs that have already been merged, that would leave things more accurate (ping me in that PR, so I backport it to 1.5). And then you can add another release note for 1.6 here.

@jonashaag
Copy link
Contributor Author

Or we merge the PRs noted in those release notes to 1.5? The PRs have been ready for review with no code changes for a long time.

@datapythonista
Copy link
Member

We already released pandas 1.5 release candidate, and we are only backporting regressions to it, not new features, performance improvements. It's also unclear to me this will be merged before the release.

I know this has been forgotten for a long time, sorry about that. But that's unfortunately part of how open source development works in a project like pandas.

@jonashaag
Copy link
Contributor Author

I added tests using Hypothesis and moved the byteswapping code to a module. The byteswapping code also be moved somewhere else outside the SAS stuff.

@jonashaag
Copy link
Contributor Author

Updated release notes. I took the liberty of including #47656 already.

@mroeschke mroeschke added this to the 1.6 milestone Oct 3, 2022
pandas/io/sas/sas7bdat.py Outdated Show resolved Hide resolved
@mroeschke mroeschke merged commit c855be8 into pandas-dev:main Oct 5, 2022
@mroeschke
Copy link
Member

Thanks @jonashaag

@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* Fast byteswap

* Add types

* Review feedback

* Slightly faster variant (1 less bytes obj construction)

* Make MyPy happy?

* Update sas7bdat.py

* Use intrinsics

* Lint

* Add tests + move byteswap to module

* Add float tests + refactoring

* Undo unrelated changes

* Undo unrelated changes

* Lint

* Update v1.6.0.rst

* read_int -> read_uint

* Lint

* Update sas7bdat.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants