Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve email regexp on edge cases #10601

Merged
merged 3 commits into from
Oct 11, 2024

Conversation

AlekseyLobanov
Copy link
Contributor

@AlekseyLobanov AlekseyLobanov commented Oct 10, 2024

  • Drastically improves performance on cases like "<" + " " * N
  • Last spaces are not needed anyway because this group is stripped later. Also spaces will be caught by . anyway.

Change Summary

I found that one single change in email regexp solves slowdowns on special invalid email strings. See related issue for details

Related issue number

Fixes #10600

Checklist

  • The pull request title is a good summary of the changes - it will be used in the changelog
  • Unit tests for the changes exist
  • Tests pass on CI
  • Documentation reflects the changes where applicable
  • My PR is ready to review, please add a comment including the phrase "please review" to assign reviewers

Selected Reviewer: @sydney-runkle

- Drastically improves performance on cases like `"<" + " " * N`
- Last spaces are not needed anyway because this group is
stripped later. Also spaces will be caught by `.` anyway.
@github-actions github-actions bot added the relnotes-fix Used for bugfixes. label Oct 10, 2024
@AlekseyLobanov
Copy link
Contributor Author

please review

Copy link

codspeed-hq bot commented Oct 10, 2024

CodSpeed Performance Report

Merging #10601 will not alter performance

Comparing AlekseyLobanov:fix.email-regex (e35c507) with main (c772b43)

Summary

✅ 38 untouched benchmarks

@AlekseyLobanov
Copy link
Contributor Author

AlekseyLobanov commented Oct 10, 2024

How performance changes?
I use my own POC in #10600 and run it as /usr/bin/time python pydantic-poc.py 500

  • Before: 5.35user 0.01system 0:05.37elapsed 99%CPU (0avgtext+0avgdata 36840maxresident)k
  • After: 0.20user 0.01system 0:00.22elapsed 99%CPU (0avgtext+0avgdata 37004maxresident)k

About 25x speed improvement.

Copy link
Contributor

github-actions bot commented Oct 10, 2024

Coverage report

This PR does not seem to contain any modification to coverable code.

Copy link
Member

@Viicos Viicos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks reasonable but to be extra careful we'll wait for other reviews as well.

Could you add the following to the test_address_valid test?:

        ('Samuel Colvin < s@muelcolvin.com>', 'Samuel Colvin', 's@muelcolvin.com'),
        ('Samuel Colvin <s@muelcolvin.com >', 'Samuel Colvin', 's@muelcolvin.com'),
        ('Samuel Colvin < s@muelcolvin.com >', 'Samuel Colvin', 's@muelcolvin.com'),

pydantic/networks.py Outdated Show resolved Hide resolved
Co-authored-by: Victorien <65306057+Viicos@users.noreply.github.com>
@AlekseyLobanov
Copy link
Contributor Author

Thanks, this looks reasonable but to be extra careful we'll wait for other reviews as well.

According to Wikipedia it is one of the valid DoS attack vectors. And at least some of known to me rate limiters will work only after validation step.

Could you add the following to the test_address_valid test?

I think that existing tests are already covering this edge cases (spaces before/after the group). Should I still add yours?

        ('foo BAR <foobar@example.com >', 'foo BAR', 'foobar@example.com'),
        ('FOO bar   <foobar@example.com> ', 'FOO bar', 'foobar@example.com'),
        ('Whatever < foobar@example.com>', 'Whatever', 'foobar@example.com'),

@sydney-runkle sydney-runkle added the relnotes-performance Used for performance improvements. label Oct 11, 2024
@sydney-runkle
Copy link
Member

Yep let's add those extra tests and fix the lints, but otherwise LGTM.

@Viicos
Copy link
Member

Viicos commented Oct 11, 2024

According to Wikipedia it is one of the valid DoS attack vectors. And at least some of known to me rate limiters will work only after validation step.

I agree, just wanted to be careful as changing regex can be a source of breaking changes.

I think that existing tests are already covering this edge cases (spaces before/after the group). Should I still add yours?

Missed these ones, then maybe only add these ones after it:

        ('Whatever <foobar@example.com >', 'Whatever', 'foobar@example.com'),
        ('Whatever < foobar@example.com >', 'Whatever', 'foobar@example.com'),

Covering name + email case with spaces surrounding the email
@AlekseyLobanov
Copy link
Contributor Author

Missed these ones, then maybe only add these ones after it:

Done.

Copy link
Member

@sydney-runkle sydney-runkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for the help here! We appreciate the thorough explanations / refs :).

@sydney-runkle sydney-runkle enabled auto-merge (squash) October 11, 2024 14:15
@sydney-runkle sydney-runkle merged commit 37d98a8 into pydantic:main Oct 11, 2024
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for review relnotes-fix Used for bugfixes. relnotes-performance Used for performance improvements.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Email parsing slowdown on edgecases
3 participants