perf(gcm): shrink Shoup table and tune GCM loop (IDFGH-13409) #14314

bryghtlabs-richard · 2024-08-06T16:08:58Z

Profiling showed a lot of time in gcm_mult() during downloads.

Tune GCM loop for pure 32-bit processors like Xtensa and RV32.

With ESP32-S3-GCC 12.2.0, -O2:

Item	Before	After	Notes
len(.rodata.last4)	128B	64B
len(.text.gcm_mult)	328B	368B
gcm_mult() cycles	~1200	~930	IRAM/DRAM + xthal_get_ccount()

github-actions · 2024-08-06T16:09:42Z

	Warnings
⚠️	Some issues found for the commit messages in this PR: the commit message `"change(mbedtls/port): optimize gcm_mult()"`: summary looks too short Please fix these commit messages - here are some basic tips: follow Conventional Commits style correct format of commit message should be: `<type/action>(<scope/component>): <summary>`, for example `fix(esp32): Fixed startup timeout issue` allowed types are: `change,ci,docs,feat,fix,refactor,remove,revert,test` sufficiently descriptive message summary should be between 20 to 72 characters and start with upper case letter avoid Jira references in commit messages (unavailable/irrelevant for our customers) `TIP:` Install pre-commit hooks and run this check when committing (uses the Conventional Precommit Linter).

👋 Hello bryghtlabs-richard, we appreciate your contribution to this project!

📘 Please review the project's Contributions Guide for key guidelines on code, documentation, testing, and more.

🖊️ Please also make sure you have read and signed the Contributor License Agreement for this project.

Click to see more instructions ...

This automated output is generated by the PR linter DangerJS, which checks if your Pull Request meets the project's requirements and helps you fix potential issues.

DangerJS is triggered with each push event to a Pull Request and modify the contents of this comment.

Please consider the following:
- Danger mainly focuses on the PR structure and formatting and can't understand the meaning behind your code or changes.
- Danger is not a substitute for human code reviews; it's still important to request a code review from your colleagues.
- Resolve all warnings (⚠️ ) before requesting a review from human reviewers - they will appreciate it.
- To manually retry these Danger checks, please navigate to the Actions tab and re-run last Danger workflow.

Review and merge process you can expect ...

We do welcome contributions in the form of bug reports, feature requests and pull requests via this public GitHub repository.

This GitHub project is public mirror of our internal git repository

1. An internal issue has been created for the PR, we assign it to the relevant engineer.
2. They review the PR and either approve it or ask you for changes or clarifications.
3. Once the GitHub PR is approved, we synchronize it into our internal git repository.
4. In the internal git repository we do the final review, collect approvals from core owners and make sure all the automated tests are passing.
- At this point we may do some adjustments to the proposed change, or extend it by adding tests or documentation.
5. If the change is approved and passes the tests it is merged into the default branch.
5. On next sync from the internal git repository merged change will appear in this public GitHub repository.

Generated by 🚫 dangerJS against 1bb9db8

mahavirj · 2024-08-07T05:15:02Z

@bryghtlabs-richard

I see some recent improvements in the upstream code too: Mbed-TLS/mbedtls@0767fda. We will check, might as well align to the upstream version. Just fyi.

bryghtlabs-richard · 2024-08-07T18:25:26Z

It's certainly worth testing the upstream approach. It seems upstream assumes unaligned access is possible, but for ESP32 it is not, so we'll spend more time doing xor, but I haven't measured it.

bryghtlabs-richard · 2024-08-07T19:25:58Z

New MbedTLS version is slower. Each with IRAM_ATTR, last4 with DRAM_ATTR, counted with xthal_get_ccount():

Implementation	Cycles/Block	Cycles/Byte
OldMbed/EspUpstream	1214-1219	75.9
NewMbedSmall	4139-4141	258.7
NewMbedLarge	2168	135.5
ThisPatch	917-920	57.3

Edit: added Mbed's New, LargeTable approach, same test setup. Function runtimes depend slightly on caller, and slightly on instruction alignment in memory.

KaeLL · 2024-08-08T01:26:22Z

@bryghtlabs-richard do you mind sharing the benchmark setup?

rsaxvc · 2024-08-08T04:55:29Z

I should also include the large table mbedtls approach

bryghtlabs-richard · 2024-08-08T14:25:49Z

@KaeLL , I've put my cycle-counting code into https://github.com/bryghtlabs-richard/esp-gcm-bench

@mahavirj , I've also tested the mbedTLS new large-table function, but it's worse than the old mbedTLS / current ESP-IDF approach. My preshift-unroll approach still seems to be the best for Xtensa.

AdityaHPatwardhan · 2024-08-13T04:11:51Z

Hi @bryghtlabs-richard Thanks for the PR.

The changes look good to me.

AdityaHPatwardhan · 2024-08-13T07:12:40Z

@bryghtlabs-richard Can you please squash all the commits into one commit.

bryghtlabs-richard · 2024-08-13T14:25:35Z

@AdityaHPatwardhan done. I think #14317 should go in first though.

AdityaHPatwardhan · 2024-08-14T05:32:23Z

Okay, #14317 has been merged in the internal code-base, the GitHub PR should be updated once the code is available on GitHub.

AdityaHPatwardhan · 2024-08-14T05:34:51Z

sha=9b6dab9edb71290061e7f718ba48d76a0dd93e13

1) pre-shift GCM last4 to use 32-bit shift On 32-bit architectures like Aarch32, RV32, Xtensa, shifting a 64-bit variable by 32-bits is free, since it changes the register representing half of the 64-bit var. Pre-shift the last4 array to take advantage of this. 2) unroll first GCM iteration The first loop of gcm_mult() is different from the others. By unrolling it separately from the others, the other iterations may take advantage of the zero-overhead loop construct, in addition to saving a conditional branch in the loop.

AdityaHPatwardhan · 2024-08-16T04:33:51Z

sha=1bb9db875896da2605cf96bc0fd29b0111af2283

espressif-bot added the Status: Opened Issue is new label Aug 6, 2024

github-actions bot changed the title ~~perf(gcm): shrink Shoup table and tune GCM loop~~ perf(gcm): shrink Shoup table and tune GCM loop (IDFGH-13409) Aug 6, 2024

bryghtlabs-richard force-pushed the perf/gcm branch from 93cfd23 to 1abf631 Compare August 6, 2024 20:00

espressif-bot assigned AdityaHPatwardhan Aug 7, 2024

AdityaHPatwardhan approved these changes Aug 13, 2024

View reviewed changes

AdityaHPatwardhan added the PR-Sync-Merge Pull request sync as merge commit label Aug 13, 2024

bryghtlabs-richard force-pushed the perf/gcm branch from 1abf631 to 9b6dab9 Compare August 13, 2024 14:25

bryghtlabs-richard force-pushed the perf/gcm branch from 9b6dab9 to 1bb9db8 Compare August 14, 2024 18:40

AdityaHPatwardhan added PR-Sync-Update Pull request sync fetch new changes PR-Sync-Merge Pull request sync as merge commit and removed PR-Sync-Merge Pull request sync as merge commit PR-Sync-Update Pull request sync fetch new changes labels Aug 18, 2024

espressif-bot added Status: Done Issue is done internally Resolution: NA Issue resolution is unavailable and removed Status: Opened Issue is new labels Aug 21, 2024

espressif-bot closed this in ad3a257 Aug 26, 2024

bryghtlabs-richard deleted the perf/gcm branch September 13, 2024 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(gcm): shrink Shoup table and tune GCM loop (IDFGH-13409) #14314

perf(gcm): shrink Shoup table and tune GCM loop (IDFGH-13409) #14314

bryghtlabs-richard commented Aug 6, 2024

github-actions bot commented Aug 6, 2024 •

edited

Loading

mahavirj commented Aug 7, 2024

bryghtlabs-richard commented Aug 7, 2024

bryghtlabs-richard commented Aug 7, 2024 •

edited

Loading

KaeLL commented Aug 8, 2024

rsaxvc commented Aug 8, 2024

bryghtlabs-richard commented Aug 8, 2024

AdityaHPatwardhan commented Aug 13, 2024

AdityaHPatwardhan commented Aug 13, 2024 •

edited

Loading

bryghtlabs-richard commented Aug 13, 2024

AdityaHPatwardhan commented Aug 14, 2024 •

edited

Loading

AdityaHPatwardhan commented Aug 14, 2024

AdityaHPatwardhan commented Aug 16, 2024

perf(gcm): shrink Shoup table and tune GCM loop (IDFGH-13409) #14314

perf(gcm): shrink Shoup table and tune GCM loop (IDFGH-13409) #14314

Conversation

bryghtlabs-richard commented Aug 6, 2024

github-actions bot commented Aug 6, 2024 • edited Loading

mahavirj commented Aug 7, 2024

bryghtlabs-richard commented Aug 7, 2024

bryghtlabs-richard commented Aug 7, 2024 • edited Loading

KaeLL commented Aug 8, 2024

rsaxvc commented Aug 8, 2024

bryghtlabs-richard commented Aug 8, 2024

AdityaHPatwardhan commented Aug 13, 2024

AdityaHPatwardhan commented Aug 13, 2024 • edited Loading

bryghtlabs-richard commented Aug 13, 2024

AdityaHPatwardhan commented Aug 14, 2024 • edited Loading

AdityaHPatwardhan commented Aug 14, 2024

AdityaHPatwardhan commented Aug 16, 2024

github-actions bot commented Aug 6, 2024 •

edited

Loading

bryghtlabs-richard commented Aug 7, 2024 •

edited

Loading

AdityaHPatwardhan commented Aug 13, 2024 •

edited

Loading

AdityaHPatwardhan commented Aug 14, 2024 •

edited

Loading