Allow better form handling and tag merging #6

jsonn · 2018-03-27T19:24:49Z

This is based on the discussion in #5 and #3. I'll leave further improving the cleanup of nested paragraphs for now, it would be nice to have, but it is not entirely trivial to do.

matthiask · 2018-03-27T19:39:29Z

Looks mostly good except for that p-in-p thing. I'm a bit surprised, but haven't looked deeply into it yet, but we have tests covering p-in-p which still pass, so maybe some reordering of the cleaning steps would be required? Or start another pass?

jsonn · 2018-03-27T20:00:14Z

That works because lxml.html.fromstring splits it up already. I'm not sure we really want to go via the full HTML chain again?

matthiask · 2018-03-27T20:51:38Z

The p-in-p test works, but for a different reason than I thought. I thought it would work because lxml's cleaner somehow understands that p-in-p does not make sense, but p-in-p's are removed because there was no text between adjacent opening p tags...

A possible fix looks like this, and changes to the test suite are minimal:

~/Projects/html-sanitizer$ git diff
diff --git a/html_sanitizer/sanitizer.py b/html_sanitizer/sanitizer.py
index a8436f0..9e2931e 100644
--- a/html_sanitizer/sanitizer.py
+++ b/html_sanitizer/sanitizer.py
@@ -204,7 +204,7 @@ class Sanitizer(object):
                 element.drop_tree()
                 continue
 
-            if element.tag == 'li':
+            if element.tag in {'li', 'p'}:
                 # remove p-in-li tags
                 for p in element.findall('p'):
                     if getattr(p, 'text', None):
diff --git a/html_sanitizer/tests.py b/html_sanitizer/tests.py
index 3e1e8d6..c2f3ac4 100644
--- a/html_sanitizer/tests.py
+++ b/html_sanitizer/tests.py
@@ -65,7 +65,7 @@ class SanitizerTestCase(TestCase):
             # Suboptimal, should be cleaned further
             (
                 '<form><p>Zeile 2</p></form>',
-                '<p><p>Zeile 2</p></p>',
+                '<p> Zeile 2 </p>',
             ),
         ]

jsonn · 2018-03-27T21:26:43Z

Looking at the behavior of web browsers, I'm not sure the current -in-<li> is even desirable. Having paragraph breaks in <li> makes semantically sense, just like having paragraphs in a table makes sense. For example:

    <p>
      foo:
    <ul>
      <li><p> foo </p> <p> bar </p> </li>
    </ul>
      bar
    </p>

currently gives:

   <p> foo: </p>
   <ul>
      <li>  foo     bar   </li>
   </ul>
   bar

which seems suboptimal. This is slightly different if the  is the only child of the <li>. I wonder if a better approach would be to inline  if it is the only child of another block tag and also implement #4 by ensuring that if there is one block level child of a node, all of them are block level by adding  as necessary.

matthiask · 2018-03-29T14:17:10Z

Well, my CSS code can be made simpler by just prohibiting paragraphs inside list elements because there will be no double margins. I also think that one could argue that lists with long elements look bad, so I'd rather just avoid those.

jsonn added 2 commits March 27, 2018 21:02

Translate <form> into and disable LXML's form cleanup.

8ef89a8

Add hook for deciding whether two tags can be merged

bfde931

matthiask merged commit bfde931 into matthiask:master Mar 27, 2018

matthiask mentioned this pull request Mar 27, 2018

Allow Cleaner use with forms=False #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow better form handling and tag merging #6

Allow better form handling and tag merging #6

jsonn commented Mar 27, 2018

matthiask commented Mar 27, 2018

jsonn commented Mar 27, 2018

matthiask commented Mar 27, 2018 •

edited

Loading

jsonn commented Mar 27, 2018

matthiask commented Mar 29, 2018

Allow better form handling and tag merging #6

Allow better form handling and tag merging #6

Conversation

jsonn commented Mar 27, 2018

matthiask commented Mar 27, 2018

jsonn commented Mar 27, 2018

matthiask commented Mar 27, 2018 • edited Loading

jsonn commented Mar 27, 2018

matthiask commented Mar 29, 2018

matthiask commented Mar 27, 2018 •

edited

Loading