Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow better form handling and tag merging #6

Merged
merged 2 commits into from
Mar 27, 2018

Conversation

jsonn
Copy link
Contributor

@jsonn jsonn commented Mar 27, 2018

This is based on the discussion in #5 and #3. I'll leave further improving the cleanup of nested paragraphs for now, it would be nice to have, but it is not entirely trivial to do.

@matthiask matthiask merged commit bfde931 into matthiask:master Mar 27, 2018
@matthiask
Copy link
Owner

Looks mostly good except for that p-in-p thing. I'm a bit surprised, but haven't looked deeply into it yet, but we have tests covering p-in-p which still pass, so maybe some reordering of the cleaning steps would be required? Or start another pass?

@jsonn
Copy link
Contributor Author

jsonn commented Mar 27, 2018

That works because lxml.html.fromstring splits it up already. I'm not sure we really want to go via the full HTML chain again?

@matthiask
Copy link
Owner

matthiask commented Mar 27, 2018

The p-in-p test works, but for a different reason than I thought. I thought it would work because lxml's cleaner somehow understands that p-in-p does not make sense, but p-in-p's are removed because there was no text between adjacent opening p tags...

A possible fix looks like this, and changes to the test suite are minimal:

~/Projects/html-sanitizer$ git diff
diff --git a/html_sanitizer/sanitizer.py b/html_sanitizer/sanitizer.py
index a8436f0..9e2931e 100644
--- a/html_sanitizer/sanitizer.py
+++ b/html_sanitizer/sanitizer.py
@@ -204,7 +204,7 @@ class Sanitizer(object):
                 element.drop_tree()
                 continue
 
-            if element.tag == 'li':
+            if element.tag in {'li', 'p'}:
                 # remove p-in-li tags
                 for p in element.findall('p'):
                     if getattr(p, 'text', None):
diff --git a/html_sanitizer/tests.py b/html_sanitizer/tests.py
index 3e1e8d6..c2f3ac4 100644
--- a/html_sanitizer/tests.py
+++ b/html_sanitizer/tests.py
@@ -65,7 +65,7 @@ class SanitizerTestCase(TestCase):
             # Suboptimal, should be cleaned further
             (
                 '<form><p>Zeile 2</p></form>',
-                '<p><p>Zeile 2</p></p>',
+                '<p> Zeile 2 </p>',
             ),
         ] 

@jsonn
Copy link
Contributor Author

jsonn commented Mar 27, 2018

Looking at the behavior of web browsers, I'm not sure the current <p>-in-<li> is even desirable. Having paragraph breaks in <li> makes semantically sense, just like having paragraphs in a table makes sense. For example:

    <p>
      foo:
    <ul>
      <li><p> foo </p> <p> bar </p> </li>
    </ul>
      bar
    </p>

currently gives:

   <p> foo: </p>
   <ul>
      <li>  foo     bar   </li>
   </ul>
   bar

which seems suboptimal. This is slightly different if the <p> is the only child of the <li>. I wonder if a better approach would be to inline <p> if it is the only child of another block tag and also implement #4 by ensuring that if there is one block level child of a node, all of them are block level by adding <p> as necessary.

@matthiask
Copy link
Owner

Well, my CSS code can be made simpler by just prohibiting paragraphs inside list elements because there will be no double margins. I also think that one could argue that lists with long elements look bad, so I'd rather just avoid those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants