Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backcompat parsing changes original document #104

Closed
kartikprabhu opened this issue May 26, 2018 · 3 comments
Closed

backcompat parsing changes original document #104

kartikprabhu opened this issue May 26, 2018 · 3 comments

Comments

@kartikprabhu
Copy link
Member

kartikprabhu commented May 26, 2018

Since mf2py does class substitutions for backcompat parsing it changes the original BeautifulSoup document given to parse. Not sure this is a bug for usage yet, but is a "quirk" for sure.

cc: @kevinmarks @snarfed @sknebel @bear

Example

The following is an example, with the html variable being the following HTML as a string

<article class="hentry">
    <section class="entry-content">
        <p class="entry-summary">This is a summary</p> 
        <p>This is <a href="/tags/mytag" rel="tag">mytag</a> inside content. </p>
    </section>
</article>

Now run the following in python

>>> from bs4 import BeautifulSoup
>>> from mf2py import parse
>>> bs = BeautifulSoup(html)
>>> bs.article

This will output the original HTML

<article class="hentry">\n    <section class="entry-content">\n        <p class="entry-summary">This is a summary</p> \n        <p>This is <a href="/tags/mytag" rel="tag">mytag</a> inside content. </p>\n    </section>\n</article>

Now run

>>> parse(bs)
>>> bs.article

This will output the "modified" HTML

<article class="hentry h-entry">\n    <section class="entry-content">\n        <p class="entry-summary">This is a summary</p> \n        <p>This is <a href="/tags/mytag" rel="tag">mytag</a> inside content. </p>\n    </section>\n</article>

Note the following:

  1. article gets an additional h-entry class from backcompat
  2. None of the children get the corresponding mf2 class transformations
  3. using >>> parse(bs) again will give erroneous results as it will skip all the properties!

Problem code

This happens because of https://github.com/microformats/mf2py/blob/master/mf2py/backcompat.py#L112 in backcompat. This creates a shallow copy of the element to apply the backcompat rules (BeautifulSoup does not support deepcopy yet.) But this does not affect the children of the element somehow.

Possible solutions

  1. This is not a problem and leave it as is.
  2. Change the entire document to mf2 equivalent i.e. don't make shallow copies while applying mf1 to mf2 conversion rules. (This was the original behaviour which I changed! my bad.)
  3. Implement some workaround to make a deep copy and work on that for parsing to completely preserve the original document.
@snarfed
Copy link
Member

snarfed commented May 26, 2018

great capturing and write-up!

seems very low priority to me, especially given other higher priority spec bug fixes and features on tap (whitespace, alt text, etc). i think just documenting it in the docstring is fine.

@kartikprabhu
Copy link
Member Author

@snarfed I captured this because I have already fixed the other things in my fork ;)

@kartikprabhu
Copy link
Member Author

fixed in experimental version by making a deepcopy of the BS doc given by user. The original doc is never changed by mf2py now.

https://github.com/kartikprabhu/mf2py/blob/42364eab436cfd6f9f4c2f3f93c5c740fcb85cfe/mf2py/parser.py#L111-L113

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants