Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[work-in-progress]: Perform all decoding internally #420

Closed
wants to merge 5 commits into from

Conversation

dralley
Copy link
Collaborator

@dralley dralley commented Jul 15, 2022

  • The data being streamed from the reader would be converted from whichever encoding it is currently in to UTF-8 when copied to a buffer, or else validated as UTF-8 if that is the appropriate encoding, using streaming APIs
  • UTF-8 would be the only encoding used within the library itself
  • All decoding functions would be removed from the API
  • quick-xml would never have to allocate during parsing except from resizing a buffer or doing (un)escaping
  • All API functions would be changed to return &str / String / Cow<str> rather than raw bytes
  • Known character boundaries can be used to ensure the safety of unsafe fn std::str::from_utf8_unchecked(), which would not incur any overhead

@dralley
Copy link
Collaborator Author

dralley commented Jul 15, 2022

@Mingun seriously I need some feedback on the plan. It would conflict with or render pointless a lot of other changes.

@dralley dralley changed the title WIP: Perform all decoding internally, pass utf-8 types [work-in-progress]: Perform all decoding internally, pass utf-8 types Jul 15, 2022
@dralley dralley force-pushed the buffer-decode branch 3 times, most recently from ac63646 to a676d30 Compare July 15, 2022 18:42
@Mingun
Copy link
Collaborator

Mingun commented Jul 15, 2022

I'm going to sleep now, I'll give my thoughts later. General note: I prefer moving by small steps, first preparing the necessary infrastructure. To begin with, I suggest looking at #421 first

@codecov-commenter
Copy link

codecov-commenter commented Jul 15, 2022

Codecov Report

Merging #420 (0bb9922) into master (d34b779) will decrease coverage by 1.37%.
The diff coverage is 74.03%.

@@            Coverage Diff             @@
##           master     #420      +/-   ##
==========================================
- Coverage   49.53%   48.16%   -1.38%     
==========================================
  Files          22       22              
  Lines       13856    13673     -183     
==========================================
- Hits         6864     6586     -278     
- Misses       6992     7087      +95     
Flag Coverage Δ
unittests 48.16% <74.03%> (-1.38%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
benches/macrobenches.rs 0.00% <0.00%> (ø)
benches/microbenches.rs 0.00% <0.00%> (ø)
examples/custom_entities.rs 0.00% <0.00%> (ø)
src/lib.rs 9.14% <0.00%> (-3.19%) ⬇️
src/de/escape.rs 19.84% <33.33%> (-1.21%) ⬇️
src/de/var.rs 81.63% <33.33%> (-3.48%) ⬇️
src/reader.rs 57.69% <65.00%> (-2.44%) ⬇️
src/de/mod.rs 77.00% <71.87%> (-1.33%) ⬇️
src/de/map.rs 73.01% <75.00%> (+0.45%) ⬆️
src/events/mod.rs 70.15% <76.47%> (+1.48%) ⬆️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d34b779...0bb9922. Read the comment docs.

@dralley dralley force-pushed the buffer-decode branch 6 times, most recently from 2a9803c to 6b4a32b Compare July 16, 2022 04:51
@dralley
Copy link
Collaborator Author

dralley commented Jul 16, 2022

I would be fine with merging the commits incrementally.

@dralley dralley marked this pull request as ready for review July 16, 2022 20:09
@dralley
Copy link
Collaborator Author

dralley commented Jul 16, 2022

@Mingun These 5 commits are "ready", if you would like to review them now. There is a lot of temporary hacks in here that are already gone in the full patch series, but it would be an extra 1000+ lines added to this, and big PRs suck to review.

There are more patches here which mostly complete the transition to utf8 types, however, BytesTextStart is a little problematic and I'm not sure what to do with it yet. If you have any ideas, let me know.

false,
))
)) // TODO(dalley): this is temporary
Copy link
Collaborator Author

@dralley dralley Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these temporaries are already gone later in the patch series, but as I said, it's another 1000 lines of changes

@dralley dralley changed the title [work-in-progress]: Perform all decoding internally, pass utf-8 types [work-in-progress]: Perform all decoding internally Jul 17, 2022
@dralley dralley marked this pull request as draft July 17, 2022 04:17
@dralley
Copy link
Collaborator Author

dralley commented Jul 17, 2022

Changing approach, working on top of the first half of the async PR.

@dralley dralley closed this Jul 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants