backcompat parsing changes original document #104

kartikprabhu · 2018-05-26T21:14:38Z

Since mf2py does class substitutions for backcompat parsing it changes the original BeautifulSoup document given to parse. Not sure this is a bug for usage yet, but is a "quirk" for sure.

cc: @kevinmarks @snarfed @sknebel @bear

Example

The following is an example, with the html variable being the following HTML as a string

<article class="hentry">
    <section class="entry-content">
        <p class="entry-summary">This is a summary</p> 
        <p>This is <a href="/tags/mytag" rel="tag">mytag</a> inside content. </p>
    </section>
</article>

Now run the following in python

>>> from bs4 import BeautifulSoup
>>> from mf2py import parse
>>> bs = BeautifulSoup(html)
>>> bs.article

This will output the original HTML

<article class="hentry">\n    <section class="entry-content">\n        <p class="entry-summary">This is a summary</p> \n        <p>This is <a href="/tags/mytag" rel="tag">mytag</a> inside content. </p>\n    </section>\n</article>

Now run

>>> parse(bs)
>>> bs.article

This will output the "modified" HTML

<article class="hentry h-entry">\n    <section class="entry-content">\n        <p class="entry-summary">This is a summary</p> \n        <p>This is <a href="/tags/mytag" rel="tag">mytag</a> inside content. </p>\n    </section>\n</article>

Note the following:

article gets an additional h-entry class from backcompat
None of the children get the corresponding mf2 class transformations
using >>> parse(bs) again will give erroneous results as it will skip all the properties!

Problem code

This happens because of https://github.com/microformats/mf2py/blob/master/mf2py/backcompat.py#L112 in backcompat. This creates a shallow copy of the element to apply the backcompat rules (BeautifulSoup does not support deepcopy yet.) But this does not affect the children of the element somehow.

Possible solutions

This is not a problem and leave it as is.
Change the entire document to mf2 equivalent i.e. don't make shallow copies while applying mf1 to mf2 conversion rules. (This was the original behaviour which I changed! my bad.)
Implement some workaround to make a deep copy and work on that for parsing to completely preserve the original document.

The text was updated successfully, but these errors were encountered:

snarfed · 2018-05-26T21:29:28Z

great capturing and write-up!

seems very low priority to me, especially given other higher priority spec bug fixes and features on tap (whitespace, alt text, etc). i think just documenting it in the docstring is fine.

kartikprabhu · 2018-05-26T21:30:15Z

@snarfed I captured this because I have already fixed the other things in my fork ;)

kartikprabhu · 2018-05-27T18:51:27Z

fixed in experimental version by making a deepcopy of the BS doc given by user. The original doc is never changed by mf2py now.

https://github.com/kartikprabhu/mf2py/blob/42364eab436cfd6f9f4c2f3f93c5c740fcb85cfe/mf2py/parser.py#L111-L113

Zegnat mentioned this issue May 26, 2018

Do we change the DOMDocument instance that get passed in, and is this an issue? microformats/php-mf2#174

Closed

kartikprabhu added the backcompat label Jun 8, 2018

kartikprabhu mentioned this issue Jun 16, 2018

new version 1.1.1 #106

Merged

kartikprabhu closed this as completed Jul 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backcompat parsing changes original document #104

backcompat parsing changes original document #104

kartikprabhu commented May 26, 2018 •

edited

Loading

snarfed commented May 26, 2018

kartikprabhu commented May 26, 2018

kartikprabhu commented May 27, 2018

backcompat parsing changes original document #104

backcompat parsing changes original document #104

Comments

kartikprabhu commented May 26, 2018 • edited Loading

Example

Problem code

Possible solutions

snarfed commented May 26, 2018

kartikprabhu commented May 26, 2018

kartikprabhu commented May 27, 2018

kartikprabhu commented May 26, 2018 •

edited

Loading