SAS7BDAT parser: Fast byteswap #47403

jonashaag · 2022-06-17T11:15:19Z

Speed up SAS7BDAT int/float reading.

This is order of magnitude faster than using struct.unpack(fmt, data) or precompiled_unpacker = struct.Struct(fmt).unpack; ...; precompiled_unpacker(data).

Unfortunately Python does not expose a low-level interface to struct or a byteswapping interface. The byteswap implementation in this change is from pyreadstat.

Today this brings a modest 10-20% performance improvement. But together with the other changes I will be proposing it will be a major bottleneck.

closes #xxxx (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2022-06-17T17:54:03Z

pandas/io/sas/sas.pyx

+    uint8_t,
+    uint16_t,
+    uint32_t,
+    uint64_t,


just curious, are these interchangeable with the versions of these we cimport from numpy?

Interesting, I didn't know those exist.

size_t and friends:

https://github.com/cython/cython/blob/f753deecd09e011a1bc276b78ccc0f1c0ad67f09/Cython/Includes/numpy/__init__.pxd#L27-L32

uint64_t and friends:

https://github.com/cython/cython/blob/f753deecd09e011a1bc276b78ccc0f1c0ad67f09/Cython/Includes/numpy/__init__.pxd#L746 -> https://github.com/cython/cython/blob/f753deecd09e011a1bc276b78ccc0f1c0ad67f09/Cython/Includes/numpy/__init__.pxd#L325

So this looks identical in both cases, but I'm happy to import from NumPy if that's preferred.

doesn't make a huge difference. on the one hand itd be nice to avoid dependency on numpy when possible, on the other im inevitably going to forget and ask again in 6 months if we dont use the numpy versions

WillAyd · 2022-06-29T03:41:53Z

pandas/io/sas/sas.pyx

+
+cdef inline float _byteswap_float(float num):
+    cdef uint32_t answer = 0
+    memcpy(&answer, &num, 4)


Can you use sizeof instead of hard coding the size?

This is a verbatim copy from ReadStat, do you still want me to make that modification?

Ah ok. I think fine to keep as is then

jonashaag · 2022-07-04T14:13:43Z

One thing I realized is that we could also use NumPy's byteswapping, at the cost of around 10% performance relative to this impl.

jbrockmendel · 2022-07-06T21:00:21Z

One thing I realized is that we could also use NumPy's byteswapping, at the cost of around 10% performance relative to this impl.

How much of this could be replaced with that?

jonashaag · 2022-07-07T06:25:44Z

How much of this could be replaced with that?

Everything copied from readstat so ~ 50 lines of Cython code

jreback

asv's that cover this case?

can you add a whatsnew note

jreback · 2022-07-08T23:04:38Z

pandas/io/sas/sas.pyx

@@ -433,3 +439,73 @@ cdef class Parser:
        self.current_row_on_page_index += 1
        self.current_row_in_chunk_index += 1
        self.current_row_in_file_index += 1
+
+
+def read_float_with_byteswap(const uint8_t *data, bint byteswap):


can you add some comments here on what and why you are doing this.

jonashaag · 2022-07-09T07:51:23Z

ASV from #47405:

       before           after         ratio
     [e915b0a4]       [435a003c]
     <main>           <sas/byteswap~1>
-      81.9±0.7ms       73.7±0.3ms     0.90  io.sas.SAS.time_read_sas7bdat_2_chunked
-      78.7±0.7ms       69.5±0.6ms     0.88  io.sas.SAS.time_read_sas7bdat_2

jonashaag · 2022-07-11T17:23:55Z

Test failure seems unrelated

jonashaag · 2022-07-21T07:10:48Z

I'm trying an intrinsics-based version: Fewer SLOC and faster.

datapythonista · 2022-09-10T15:31:03Z

Do you want me to add unit tests?

Personally I think it should be useful. Probably just one parametrized to test the function for every type is enough.

datapythonista · 2022-09-10T15:37:00Z

The release note in the other PR is not accurate now. The changes in this PR won't be available until pandas 1.6/2.0, while for what you mention the release note in the other PR mentions this issue/PR for 1.5. Probably no big deal, but if you want to open a small PR to update that note in the releases to only include the issues/PRs that have already been merged, that would leave things more accurate (ping me in that PR, so I backport it to 1.5). And then you can add another release note for 1.6 here.

jonashaag · 2022-09-10T15:49:36Z

Or we merge the PRs noted in those release notes to 1.5? The PRs have been ready for review with no code changes for a long time.

datapythonista · 2022-09-10T15:57:34Z

We already released pandas 1.5 release candidate, and we are only backporting regressions to it, not new features, performance improvements. It's also unclear to me this will be merged before the release.

I know this has been forgotten for a long time, sorry about that. But that's unfortunately part of how open source development works in a project like pandas.

jonashaag · 2022-09-10T22:35:29Z

I added tests using Hypothesis and moved the byteswapping code to a module. The byteswapping code also be moved somewhere else outside the SAS stuff.

jonashaag · 2022-09-15T09:41:09Z

Updated release notes. I took the liberty of including #47656 already.

pandas/io/sas/sas7bdat.py

mroeschke · 2022-10-05T16:03:07Z

Thanks @jonashaag

* Fast byteswap * Add types * Review feedback * Slightly faster variant (1 less bytes obj construction) * Make MyPy happy? * Update sas7bdat.py * Use intrinsics * Lint * Add tests + move byteswap to module * Add float tests + refactoring * Undo unrelated changes * Undo unrelated changes * Lint * Update v1.6.0.rst * read_int -> read_uint * Lint * Update sas7bdat.py

jonashaag changed the title ~~Fast byteswap~~ SAS7BDAT parser: Fast byteswap Jun 17, 2022

jonashaag mentioned this pull request Jun 17, 2022

Meta issue: SAS7BDAT parser improvements #47339

Open

jbrockmendel reviewed Jun 17, 2022

View reviewed changes

jonashaag force-pushed the sas/byteswap branch from dbddc93 to a63f9b7 Compare June 25, 2022 14:05

jonashaag requested a review from jbrockmendel June 27, 2022 18:55

WillAyd reviewed Jun 29, 2022

View reviewed changes

jonashaag added 2 commits July 2, 2022 18:17

Fast byteswap

5b9cd4b

Add types

17c965f

jonashaag force-pushed the sas/byteswap branch from a63f9b7 to 17c965f Compare July 2, 2022 16:17

jreback added Performance Memory or execution speed performance IO SAS SAS: read_sas labels Jul 8, 2022

jreback added this to the 1.5 milestone Jul 8, 2022

jreback requested changes Jul 8, 2022

View reviewed changes

jonashaag added 2 commits July 9, 2022 09:37

Merge branch 'main' into sas/byteswap

51499fb

Review feedback

435a003

jonashaag force-pushed the sas/byteswap branch from 96b9ea0 to d29d3a9 Compare July 9, 2022 21:32

Slightly faster variant (1 less bytes obj construction)

10ab87f

jonashaag force-pushed the sas/byteswap branch from d29d3a9 to 10ab87f Compare July 9, 2022 21:33

jonashaag added 3 commits July 10, 2022 22:22

Make MyPy happy?

ad74f5c

Update sas7bdat.py

9c5b4b3

Merge branch 'main' into sas/byteswap

21c364c

Merge branch 'main' into sas/byteswap

148fa75

jonashaag requested a review from jreback July 16, 2022 12:28

Use intrinsics

f3c63f0

Lint

c310c0d

Add tests + move byteswap to module

3b7ba83

jonashaag added 6 commits September 11, 2022 01:02

Add float tests + refactoring

53fbce2

Undo unrelated changes

9cbc5be

Undo unrelated changes

4802848

Lint

41abe02

Merge branch 'main' into sas/byteswap

2abd8e0

Update v1.6.0.rst

bf0976a

datapythonista mentioned this pull request Sep 15, 2022

DOC: Fix read_sas 1.5 release notes #48563

Merged

Merge branch 'main' into sas/byteswap

c725d49

mroeschke added this to the 1.6 milestone Oct 3, 2022

mroeschke reviewed Oct 3, 2022

View reviewed changes

pandas/io/sas/sas7bdat.py Outdated Show resolved Hide resolved

jonashaag added 6 commits October 4, 2022 11:46

read_int -> read_uint

c7c1a2f

Lint

6a4a556

Merge branch 'main' into sas/byteswap

9f5ba3f

Update sas7bdat.py

a439434

Merge branch 'main' into sas/byteswap

55bd863

Merge branch 'main' into sas/byteswap

bdf8203

mroeschke approved these changes Oct 5, 2022

View reviewed changes

mroeschke merged commit c855be8 into pandas-dev:main Oct 5, 2022

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

Uh oh!

SAS7BDAT parser: Fast byteswap #47403

SAS7BDAT parser: Fast byteswap #47403

Uh oh!

Conversation

jonashaag commented Jun 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrockmendel Jun 17, 2022

Choose a reason for hiding this comment

Uh oh!

jonashaag Jun 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Jul 6, 2022

Choose a reason for hiding this comment

Uh oh!

WillAyd Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

jonashaag Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

WillAyd Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

jonashaag commented Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrockmendel commented Jul 6, 2022

Uh oh!

jonashaag commented Jul 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

jonashaag commented Jul 9, 2022

Uh oh!

jonashaag commented Jul 11, 2022

Uh oh!

jonashaag commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datapythonista commented Sep 10, 2022

Uh oh!

datapythonista commented Sep 10, 2022

Uh oh!

jonashaag commented Sep 10, 2022

Uh oh!

datapythonista commented Sep 10, 2022

Uh oh!

jonashaag commented Sep 10, 2022

Uh oh!

jonashaag commented Sep 15, 2022

Uh oh!

Uh oh!

mroeschke commented Oct 5, 2022

Uh oh!

Uh oh!

jonashaag commented Jun 17, 2022 •

edited

Loading

jonashaag Jun 20, 2022 •

edited

Loading

jonashaag commented Jul 4, 2022 •

edited

Loading

jonashaag commented Jul 7, 2022 •

edited

Loading

jonashaag commented Jul 21, 2022 •

edited

Loading