gh-92081: Fix for email.generator.Generator with whitespace between encoded words. #92281

abadger · 2022-05-04T00:57:38Z

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines. The
current generator will create an encoded word for each line. If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words. This
fix places the space inside of the second encoded word.

Test case from #92081

Issue: BytesGenerator breaks UTF8 string #92081

abadger · 2022-05-05T15:58:01Z

This is a work in progress because #92081 has one other issue that also needs to be fixed. Whitespace at the start of the Subject is being omitted as well. I'm leaning towards fixing that one in the decoder but I'm still reading through the rfcs to see if it has anything to say about that.

Note: the other issue has been fixed here as well and this is ready for review.

abadger · 2023-04-26T22:55:36Z

Lib/test/test_email/test_headerregistry.py

@@ -1628,7 +1629,7 @@ def test_address_display_names(self):
                    'Lôrem ipsum dôlôr sit amet, cônsectetuer adipiscing. '
                    'Suspendisse pôtenti. Aliquam nibh. Suspendisse pôtenti.',
                    '=?utf-8?q?L=C3=B4rem_ipsum_d=C3=B4l=C3=B4r_sit_amet=2C_c'
-                    '=C3=B4nsectetuer?=\n =?utf-8?q?adipiscing=2E_Suspendisse'
+                    '=C3=B4nsectetuer?=\n =?utf-8?q?_adipiscing=2E_Suspendisse'


Note, the data in the unittest was buggy. If you run the original output through decode_header() you'll find that it is missing the space between cônsectetuer and adipiscing

abadger · 2023-04-26T22:59:08Z

This fix is now ready to be reviewed.

abadger · 2023-07-21T21:41:51Z

Hey @warsaw , this fix is ready for review if you have some cycles to spare for thinking about email and the stdlib.

alex-pobeditel-2004 · 2023-12-12T16:12:44Z

@abadger I tested this fix on my project and it worked like a charm. Inexpressible thanks! Will use your patch until this will not be merged into CPython :)

email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. A second problem happens with continuation lines. A continuation line that starts with whitespace and is followed by a non-encoded word is fine because the newline between such continuation lines is defined as condensing to a single space character. When the continuation line starts with whitespace followed by an encoded word, however, the RFCs specify that the word is run together with the encoded word on the previous line. This is because normal words are filded on syntactic breaks by encoded words are not. The solution to this is to add the whitespace to the start of the encoded word on the continuation line. Test cases are from python#92081

warsaw

Thanks for your contribution to Python!

miss-islington-app · 2024-05-20T19:10:51Z

Thanks @abadger for the PR, and @warsaw for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

…ween encoded words. (pythonGH-92281) * Fix for email.generator.Generator with whitespace between encoded words. email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. A second problem happens with continuation lines. A continuation line that starts with whitespace and is followed by a non-encoded word is fine because the newline between such continuation lines is defined as condensing to a single space character. When the continuation line starts with whitespace followed by an encoded word, however, the RFCs specify that the word is run together with the encoded word on the previous line. This is because normal words are filded on syntactic breaks by encoded words are not. The solution to this is to add the whitespace to the start of the encoded word on the continuation line. Test cases are from pythonGH-92081 * Rename a variable so it's not confused with the final variable. (cherry picked from commit a6fdb31) Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>

bedevere-app · 2024-05-20T19:11:03Z

GH-119245 is a backport of this pull request to the 3.13 branch.

bedevere-app · 2024-05-20T19:11:08Z

GH-119246 is a backport of this pull request to the 3.12 branch.

…tween encoded words. (GH-92281) (#119245) * Fix for email.generator.Generator with whitespace between encoded words. email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. A second problem happens with continuation lines. A continuation line that starts with whitespace and is followed by a non-encoded word is fine because the newline between such continuation lines is defined as condensing to a single space character. When the continuation line starts with whitespace followed by an encoded word, however, the RFCs specify that the word is run together with the encoded word on the previous line. This is because normal words are filded on syntactic breaks by encoded words are not. The solution to this is to add the whitespace to the start of the encoded word on the continuation line. Test cases are from GH-92081 * Rename a variable so it's not confused with the final variable. (cherry picked from commit a6fdb31) Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>

…tween encoded words. (GH-92281) (#119246) * Fix for email.generator.Generator with whitespace between encoded words. email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. A second problem happens with continuation lines. A continuation line that starts with whitespace and is followed by a non-encoded word is fine because the newline between such continuation lines is defined as condensing to a single space character. When the continuation line starts with whitespace followed by an encoded word, however, the RFCs specify that the word is run together with the encoded word on the previous line. This is because normal words are filded on syntactic breaks by encoded words are not. The solution to this is to add the whitespace to the start of the encoded word on the continuation line. Test cases are from GH-92081 * Rename a variable so it's not confused with the final variable. (cherry picked from commit a6fdb31) Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>

…ween encoded words. (python#92281) * Fix for email.generator.Generator with whitespace between encoded words. email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. A second problem happens with continuation lines. A continuation line that starts with whitespace and is followed by a non-encoded word is fine because the newline between such continuation lines is defined as condensing to a single space character. When the continuation line starts with whitespace followed by an encoded word, however, the RFCs specify that the word is run together with the encoded word on the previous line. This is because normal words are filded on syntactic breaks by encoded words are not. The solution to this is to add the whitespace to the start of the encoded word on the continuation line. Test cases are from python#92081 * Rename a variable so it's not confused with the final variable.

abadger requested a review from a team as a code owner May 4, 2022 00:57

bedevere-bot added the awaiting review label May 4, 2022

abadger marked this pull request as draft May 4, 2022 00:57

AlexWaygood added the topic-email label May 4, 2022

abadger force-pushed the email-bytes-generator-breakage branch from bc7a42d to 80f5cfa Compare April 26, 2023 22:03

bedevere-bot mentioned this pull request Apr 26, 2023

BytesGenerator breaks UTF8 string #92081

Closed

abadger marked this pull request as ready for review April 26, 2023 22:24

abadger force-pushed the email-bytes-generator-breakage branch from bec90d8 to 69b205d Compare April 26, 2023 22:48

abadger commented Apr 26, 2023

View reviewed changes

abadger force-pushed the email-bytes-generator-breakage branch from 69b205d to 4104b8e Compare July 21, 2023 14:47

abadger force-pushed the email-bytes-generator-breakage branch from 4104b8e to 5071b52 Compare May 20, 2024 17:40

Rename a variable so it's not confused with the final variable.

12e9e11

warsaw self-assigned this May 20, 2024

warsaw added 3.12 3.13 needs backport to 3.12 needs backport to 3.13 labels May 20, 2024

warsaw approved these changes May 20, 2024

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting review labels May 20, 2024

warsaw enabled auto-merge (squash) May 20, 2024 18:07

warsaw merged commit a6fdb31 into python:main May 20, 2024
39 checks passed

bedevere-app bot removed the awaiting merge label May 20, 2024

abadger deleted the email-bytes-generator-breakage branch May 20, 2024 19:11

bedevere-app bot removed the needs backport to 3.13 label May 20, 2024

bedevere-app bot removed the needs backport to 3.12 label May 20, 2024

JelleZijlstra mentioned this pull request May 28, 2024

backport a9a74da 3.13 #119642

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-92081: Fix for email.generator.Generator with whitespace between encoded words. #92281

gh-92081: Fix for email.generator.Generator with whitespace between encoded words. #92281

abadger commented May 4, 2022 •

edited by bedevere-bot

Loading

abadger commented May 5, 2022 •

edited

Loading

abadger Apr 26, 2023

abadger commented Apr 26, 2023

abadger commented Jul 21, 2023

alex-pobeditel-2004 commented Dec 12, 2023

warsaw left a comment

miss-islington-app bot commented May 20, 2024

bedevere-app bot commented May 20, 2024

bedevere-app bot commented May 20, 2024

gh-92081: Fix for email.generator.Generator with whitespace between encoded words. #92281

gh-92081: Fix for email.generator.Generator with whitespace between encoded words. #92281

Conversation

abadger commented May 4, 2022 • edited by bedevere-bot Loading

abadger commented May 5, 2022 • edited Loading

abadger Apr 26, 2023

Choose a reason for hiding this comment

abadger commented Apr 26, 2023

abadger commented Jul 21, 2023

alex-pobeditel-2004 commented Dec 12, 2023

warsaw left a comment

Choose a reason for hiding this comment

miss-islington-app bot commented May 20, 2024

bedevere-app bot commented May 20, 2024

bedevere-app bot commented May 20, 2024

abadger commented May 4, 2022 •

edited by bedevere-bot

Loading

abadger commented May 5, 2022 •

edited

Loading