Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect missing of trimming all-space text events when trim_text_start = false and trim_text_end = true #755

Merged
merged 13 commits into from
Jun 14, 2024

Conversation

Mingun
Copy link
Collaborator

@Mingun Mingun commented Jun 12, 2024

I was refactoring for another task and accidentally found 2 errors:

  • in ReaderState::emit_text there was a misprint: if string does not contain non-space symbols it remains unchanged instead of making empty

  • This error was actually masked by the previous. After fixing it ReaderState::emit_text could make empty slice which then translated to Event::Eof. However, this function is called in two cases:

    • when we found < which switches parser from InsideText to InsideMarkup state
    • when we found EOF while parser in the InsideText state

    Event::Eof should be generated only in the latter case.

The changes also shows small performance improvements:

> critcmp master trim -t 5
group                                                          master                                 fix-trim-end
-----                                                          ------                                 ------------
One event/Comment                                              1.08    157.2±2.92ns        ? ?/sec    1.00    146.0±2.42ns        ? ?/sec
decode_and_parse_document/rpm_filelists.xml                    1.06     68.4±1.24µs   160.6 MB/sec    1.00     64.5±1.93µs   170.3 MB/sec
decode_and_parse_document_with_namespaces/rpm_filelists.xml    1.07     95.7±1.62µs   114.7 MB/sec    1.00     89.3±1.63µs   122.9 MB/sec
decode_and_parse_document_with_namespaces/rpm_primary.xml      1.06    205.8±4.17µs    98.5 MB/sec    1.00    193.6±3.81µs   104.7 MB/sec
decode_and_parse_document_with_namespaces/rpm_primary2.xml     1.06     66.9±1.38µs   107.1 MB/sec    1.00     63.4±1.34µs   113.1 MB/sec
parse_document_nocopy/rpm_filelists.xml                        1.09     63.1±1.27µs   174.0 MB/sec    1.00     58.1±1.06µs   189.1 MB/sec
parse_document_nocopy/rpm_primary2.xml                         1.05     45.1±1.34µs   159.0 MB/sec    1.00     42.8±0.88µs   167.6 MB/sec
read_event/trim_text = false                                   1.27    206.1±3.85µs        ? ?/sec    1.00    162.9±2.66µs        ? ?/sec
read_event/trim_text = true                                    1.15    199.5±3.72µs        ? ?/sec    1.00    172.8±3.04µs        ? ?/sec
unescape_text/mixed                                            1.05    384.1±6.97ns        ? ?/sec    1.00    364.9±6.66ns        ? ?/sec

@Mingun Mingun added the bug label Jun 12, 2024
@Mingun Mingun requested a review from dralley June 12, 2024 10:09
@codecov-commenter
Copy link

codecov-commenter commented Jun 12, 2024

Codecov Report

Attention: Patch coverage is 88.67925% with 12 lines in your changes missing coverage. Please review.

Project coverage is 61.81%. Comparing base (5d76174) to head (28795e1).
Report is 40 commits behind head on master.

Files Patch % Lines
src/reader/buffered_reader.rs 83.33% 12 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #755      +/-   ##
==========================================
+ Coverage   61.24%   61.81%   +0.57%     
==========================================
  Files          39       41       +2     
  Lines       16277    16798     +521     
==========================================
+ Hits         9969    10384     +415     
- Misses       6308     6414     +106     
Flag Coverage Δ
unittests 61.81% <88.67%> (+0.57%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -211,7 +211,7 @@ macro_rules! read_event_impl {
) => {{
let event = loop {
break match $self.state.state {
ParseState::Init => { // Go to OpenedTag state
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message says "renamed for brevity". I'm not sure that brevity is the correct word since these names are longer than the original ones.

I'm fine with the new names, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I confused it with "clarity"

Copy link
Collaborator Author

@Mingun Mingun Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed commit message in force push

@@ -14,11 +14,15 @@

### Bug Fixes

- [#755]: Fix incorrect missing of trimming all-space text events when
`trim_text_start = false` and `trim_text_end = true`.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: " events should not be trimmed at boundary of text / CDATA, or text / PI, or text / comment in some cases", is there an issue filed for that with some examples of such boundary cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, such issue does not exist. I made investigation in #520 for serde deserializer and implement reading text properly. I cross-checked that using Java XmlBeans library.

- OpenedTag -> InsideMarkup
- ClosedTag -> InsideText
- Empty     -> InsideEmpty
- Exit      -> Done
…t_start = false and trim_text_end = true

This is still not complete fix, because we will generate empty Event::Text although we should not do that,
but it is hard to prevent generation of such event. Moreover it would be better to remove ability of
automatic trimming completely, because it is anyway does not work correctly -- events should not be
trimmed at boundary of text / CDATA, or text / PI, or text / comment in some cases
@Mingun Mingun merged commit 65bf651 into tafia:master Jun 14, 2024
6 checks passed
@Mingun Mingun deleted the fix-trim-end branch June 14, 2024 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants