Skip to content

bpo-47177: Replace f_lasti with prev_instr #32208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 7, 2022
Merged

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Mar 31, 2022

A nice 2% improvement:

Slower (2):
- telco: 38.6 ms +- 2.7 ms -> 40.9 ms +- 2.6 ms: 1.06x slower
- regex_dna: 1.33 sec +- 0.01 sec -> 1.34 sec +- 0.01 sec: 1.01x slower

Faster (32):
- logging_silent: 640 ns +- 27 ns -> 582 ns +- 35 ns: 1.10x faster
- scimark_sor: 736 ms +- 16 ms -> 689 ms +- 13 ms: 1.07x faster
- pycparser: 7.52 sec +- 0.10 sec -> 7.12 sec +- 0.10 sec: 1.06x faster
- chameleon: 39.8 ms +- 2.7 ms -> 37.8 ms +- 2.4 ms: 1.05x faster
- html5lib: 417 ms +- 35 ms -> 397 ms +- 31 ms: 1.05x faster
- scimark_monte_carlo: 422 ms +- 13 ms -> 402 ms +- 11 ms: 1.05x faster
- pyflate: 2.65 sec +- 0.02 sec -> 2.53 sec +- 0.02 sec: 1.05x faster
- deltablue: 22.5 ms +- 1.3 ms -> 21.6 ms +- 1.4 ms: 1.05x faster
- unpickle_pure_python: 1.43 ms +- 0.08 ms -> 1.37 ms +- 0.07 ms: 1.04x faster
- spectral_norm: 628 ms +- 10 ms -> 604 ms +- 13 ms: 1.04x faster
- hexiom: 41.0 ms +- 3.1 ms -> 39.5 ms +- 2.4 ms: 1.04x faster
- raytrace: 1.84 sec +- 0.02 sec -> 1.78 sec +- 0.02 sec: 1.03x faster
- scimark_lu: 669 ms +- 20 ms -> 648 ms +- 20 ms: 1.03x faster
- xml_etree_generate: 474 ms +- 11 ms -> 459 ms +- 13 ms: 1.03x faster
- richards: 296 ms +- 12 ms -> 287 ms +- 14 ms: 1.03x faster
- nbody: 555 ms +- 17 ms -> 540 ms +- 15 ms: 1.03x faster
- pickle_dict: 170 us +- 8 us -> 165 us +- 7 us: 1.03x faster
- crypto_pyaes: 510 ms +- 12 ms -> 497 ms +- 13 ms: 1.03x faster
- scimark_fft: 1.99 sec +- 0.03 sec -> 1.95 sec +- 0.02 sec: 1.02x faster
- regex_compile: 819 ms +- 11 ms -> 800 ms +- 10 ms: 1.02x faster
- chaos: 423 ms +- 12 ms -> 414 ms +- 10 ms: 1.02x faster
- meteor_contest: 643 ms +- 13 ms -> 630 ms +- 12 ms: 1.02x faster
- pidigits: 1.18 sec +- 0.01 sec -> 1.16 sec +- 0.01 sec: 1.02x faster
- sympy_expand: 2.86 sec +- 0.02 sec -> 2.81 sec +- 0.02 sec: 1.02x faster
- nqueens: 498 ms +- 13 ms -> 490 ms +- 13 ms: 1.02x faster
- xml_etree_parse: 953 ms +- 20 ms -> 938 ms +- 16 ms: 1.02x faster
- 2to3: 1.60 sec +- 0.01 sec -> 1.58 sec +- 0.01 sec: 1.02x faster
- xml_etree_process: 331 ms +- 12 ms -> 326 ms +- 11 ms: 1.02x faster
- sympy_str: 1.71 sec +- 0.02 sec -> 1.69 sec +- 0.02 sec: 1.01x faster
- fannkuch: 2.42 sec +- 0.02 sec -> 2.39 sec +- 0.02 sec: 1.01x faster
- dulwich_log: 381 ms +- 11 ms -> 377 ms +- 11 ms: 1.01x faster
- xml_etree_iterparse: 624 ms +- 14 ms -> 618 ms +- 13 ms: 1.01x faster

Benchmark hidden because not significant (28): django_template, float, go, json, json_dumps, json_loads, logging_format, logging_simple, mako, pathlib, pickle, pickle_list, pickle_pure_python, python_startup, python_startup_no_site, regex_effbot, regex_v8, scimark_sparse_mat_mult, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_integrate, sympy_sum, thrift, tornado_http, unpack_sequence, unpickle, unpickle_list

Geometric mean: 1.02x faster

https://bugs.python.org/issue47177

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very promising.

I think there is an off by one error, and I have a few suggestions and questions.

code->co_filename,
PyCode_Addr2Line(frame->f_code, frame->f_lasti*sizeof(_Py_CODEUNIT)),
int addr = _PyInterpreterFrame_LASTI(frame) * sizeof(_Py_CODEUNIT);
int line = PyCode_Addr2Line(frame->f_code, addr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly off-topic, but I wonder why we are going to the expense of computing the line here, rather than just storing the instruction offset?

Python/ceval.c Outdated
OPCODE_EXE_INC(op); \
_py_stats.opcode_stats[lastopcode].pair_count[op]++; \
lastopcode = op; \
} while (0)
#else
#define INSTRUCTION_START(op) frame->f_lasti = INSTR_OFFSET(); next_instr++
#define INSTRUCTION_START(op) (frame->next_instr = ++next_instr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's something to consider.

Rather that storing next_instr, we could store last_instr (strictly, next_instr_less_one) as it would break the dependency of the store on the addition.
frame->next_instr_less_one = next_instr; next_instr++
The compiler can then merge the next_instr++ with any subsequent JUMPBY(INLINE_CACHE_ENTRIES_...)

It would also simplify the changes as you wouldn't need to change all the offsets by one.

Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I'll try that out in another branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance numbers show no difference between that branch and this one. The other branch does result in a net reduction of "adjust by 1" moves, so I think I slightly prefer it.

Thoughts?

Copy link
Member

@markshannon markshannon Apr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the other version. It should be quicker, and feels less intrusive. I'm not surprised that there is no measurable difference in performance, I would expect it to be very slight.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@markshannon
Copy link
Member

markshannon commented Mar 31, 2022

My performance test shows the same 2% speedup but with less variation, and no 6% slowdown for telco.

@brandtbucher
Copy link
Member Author

My performance test shows the same 2% but with less variation, and no 6% slowdown for telco.

I was initially suspicious of the broad improvement too, but I was able to reproduce it with two fresh runs:

Slower (5):
- pidigits: 1.14 sec +- 0.01 sec -> 1.21 sec +- 0.01 sec: 1.07x slower
- nbody: 552 ms +- 15 ms -> 577 ms +- 22 ms: 1.05x slower
- pycparser: 7.12 sec +- 0.10 sec -> 7.41 sec +- 0.09 sec: 1.04x slower
- pickle_dict: 166 us +- 8 us -> 169 us +- 9 us: 1.02x slower
- xml_etree_parse: 927 ms +- 15 ms -> 942 ms +- 19 ms: 1.02x slower

Faster (31):
- richards: 301 ms +- 11 ms -> 276 ms +- 12 ms: 1.09x faster
- scimark_sor: 739 ms +- 13 ms -> 682 ms +- 14 ms: 1.08x faster
- nqueens: 523 ms +- 14 ms -> 484 ms +- 12 ms: 1.08x faster
- logging_silent: 647 ns +- 25 ns -> 604 ns +- 40 ns: 1.07x faster
- spectral_norm: 633 ms +- 14 ms -> 592 ms +- 14 ms: 1.07x faster
- unpack_sequence: 338 ns +- 28 ns -> 318 ns +- 28 ns: 1.06x faster
- go: 873 ms +- 14 ms -> 827 ms +- 13 ms: 1.05x faster
- deltablue: 22.7 ms +- 1.1 ms -> 21.6 ms +- 1.3 ms: 1.05x faster
- raytrace: 1.85 sec +- 0.01 sec -> 1.77 sec +- 0.02 sec: 1.05x faster
- html5lib: 415 ms +- 31 ms -> 396 ms +- 30 ms: 1.05x faster
- pickle_pure_python: 1.90 ms +- 0.13 ms -> 1.82 ms +- 0.15 ms: 1.05x faster
- pyflate: 2.65 sec +- 0.02 sec -> 2.54 sec +- 0.02 sec: 1.05x faster
- scimark_fft: 2.03 sec +- 0.02 sec -> 1.95 sec +- 0.04 sec: 1.04x faster
- chaos: 423 ms +- 10 ms -> 407 ms +- 10 ms: 1.04x faster
- unpickle_pure_python: 1.42 ms +- 0.07 ms -> 1.37 ms +- 0.06 ms: 1.04x faster
- json_loads: 172 us +- 19 us -> 166 us +- 8 us: 1.04x faster
- hexiom: 41.3 ms +- 2.7 ms -> 39.8 ms +- 2.5 ms: 1.04x faster
- scimark_lu: 672 ms +- 15 ms -> 649 ms +- 15 ms: 1.03x faster
- float: 460 ms +- 11 ms -> 446 ms +- 10 ms: 1.03x faster
- scimark_monte_carlo: 417 ms +- 12 ms -> 404 ms +- 12 ms: 1.03x faster
- chameleon: 38.8 ms +- 2.6 ms -> 37.6 ms +- 3.1 ms: 1.03x faster
- regex_compile: 824 ms +- 13 ms -> 800 ms +- 14 ms: 1.03x faster
- xml_etree_process: 333 ms +- 14 ms -> 323 ms +- 14 ms: 1.03x faster
- sympy_expand: 2.88 sec +- 0.02 sec -> 2.81 sec +- 0.04 sec: 1.02x faster
- sympy_str: 1.72 sec +- 0.01 sec -> 1.69 sec +- 0.02 sec: 1.02x faster
- sympy_sum: 969 ms +- 13 ms -> 951 ms +- 14 ms: 1.02x faster
- dulwich_log: 383 ms +- 10 ms -> 377 ms +- 10 ms: 1.02x faster
- meteor_contest: 641 ms +- 14 ms -> 634 ms +- 16 ms: 1.01x faster
- 2to3: 1.60 sec +- 0.01 sec -> 1.58 sec +- 0.01 sec: 1.01x faster
- crypto_pyaes: 505 ms +- 14 ms -> 500 ms +- 11 ms: 1.01x faster
- sqlalchemy_declarative: 841 ms +- 15 ms -> 834 ms +- 23 ms: 1.01x faster

Benchmark hidden because not significant (26): django_template, fannkuch, json, json_dumps, logging_format, logging_simple, mako, pathlib, pickle, pickle_list, python_startup, python_startup_no_site, regex_dna, regex_effbot, regex_v8, scimark_sparse_mat_mult, sqlalchemy_imperative, sqlite_synth, sympy_integrate, telco, thrift, tornado_http, unpickle, unpickle_list, xml_etree_iterparse, xml_etree_generate

Geometric mean: 1.02x faster

So now we're triply sure.

@brandtbucher brandtbucher requested a review from markshannon April 1, 2022 00:01
@markshannon
Copy link
Member

Did you get the version with prev_instr ready for review/merging?

@brandtbucher
Copy link
Member Author

It's ready, just slipped off my radar. Would you prefer a separate PR, or just to bring those commits over to this branch?

@markshannon
Copy link
Member

Whatever is easier. I don't mind

@brandtbucher brandtbucher changed the title bpo-47177: Replace f_lasti with next_instr bpo-47177: Replace f_lasti with prev_instr Apr 6, 2022
@brandtbucher
Copy link
Member Author

I brought the commits over here.

@markshannon markshannon added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 7, 2022
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @markshannon for commit 670b94b 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 7, 2022
@markshannon
Copy link
Member

Looks good.
This is quite an intrusive change, so I'm running the buildbots before merging.

@stratakis
Copy link
Contributor

It seems there are many reference leaks caught by various buildbots. e.g.: https://buildbot.python.org/all/#/builders/766/builds/117

@brandtbucher
Copy link
Member Author

Looking into it.

@brandtbucher
Copy link
Member Author

Oh wait, it looks like those are failing on other PRs, too.

@brandtbucher
Copy link
Member Author

@markshannon
Copy link
Member

Feel free to merge once you are confident that all the failures are unrelated. I think they are unrelated, but I've not checked them individually.

@brandtbucher
Copy link
Member Author

The refleaks disappear when I apply the fix in #32403. So not my fault. 🙂

@vstinner
Copy link
Member

vstinner commented Apr 7, 2022

The coverage project uses the PyFrameObject.f_frame.f_lasti field and so is broken by this PR. I wrote nedbat/coveragepy#1331 to read the Python attribute, but the coverage maintainer has worries about worse performance. Maybe adding a new PyFrame_GetLastInstr() getter function would help.

@brandtbucher
Copy link
Member Author

The PR adds a _PyInterpreterFrame_LASTI macro. Does that help?

@markshannon
Copy link
Member

#32413

@brandtbucher brandtbucher deleted the lasti branch July 21, 2022 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants