Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: outputs "corrected" HTML, even if input was sloppy #53

Open
XLTechie opened this issue Aug 27, 2022 · 1 comment
Open

Bug: outputs "corrected" HTML, even if input was sloppy #53

XLTechie opened this issue Aug 27, 2022 · 1 comment

Comments

@XLTechie
Copy link

XLTechie commented Aug 27, 2022

HTMLQ "purifies" incorrect HTML, even when that isn't desirable.

Example input:

<h3 class=subhead>Some Heading</h3>

When selecting the .subhead class as desired output, the heading is returned as:

<h3 class="subhead">Some Heading</h3>

That's fine if you want to render in a browser, but if you're using the result as a search and replace pattern to awk, sed, or fsed, as I am, the pattern will fail to match because of the quotes which htmlq added.

In short, HTMLQ is re-constructing the HTML to be more spec-correct, and by doing so it is breaking character-for-character matches between otherwise unchanged parts of the throughput.

N.B. While it would still be a problem, I wouldn't care about this so much if #36 was implemented.

Maybe htmlq needs a --purify or --no-purify option?

@BobBorges
Copy link

I also get extra tags in my output:

<table><tr>

becomes

<table><tbody><tr>

I'm using this tool to look at source documents, think sloppy html in the 10s or 100s of thousands of characters with no white space, line breaks or indentation, in order to figure out the structure and extract contents in a reasonable way. More than once now I'm pulling my hair out -- why can't I find the tbody elem -- only to find out these aren't in the source.

+1 for a flag option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants