Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improperly closed tags break output #152

Closed
AlenToma opened this issue Sep 15, 2021 · 5 comments
Closed

Improperly closed tags break output #152

AlenToma opened this issue Sep 15, 2021 · 5 comments

Comments

@AlenToma
Copy link

AlenToma commented Sep 15, 2021

I am unable to parse the current html from some reason

<div>
<div id="chr-content">
<span>
  lkjasdkjasdkljakldj
</div>
</div>
`

My current Code is

  const validate =()=> {
       html = html.replace(/<!DOCTYPE html>/g, "").replace(/[[class]]/g, "").replace(/[[id]]/g, "")
       var container = parse("<div>" + html + "</div>");
       var content = container.querySelector("#chr-content");
       console.log(content != null ? "found" : "not found")
  }

This is on Node.js with the latest version 4.1.4

Here is snack example I created that contain the problem if you would like to test it and see.

snack

Right now I am using react-native-html-parser together with this library to be able to fix the incorrect html that contains no end tags.

it seems that node-html-parser simple ignore and rewrite the html and remove <div id="chr-content"> from some reason.


Contributor's Note

Although this library was built with the known limitation of requiring proper HTML, we are looking at revising the logic in a way which will not impact performance but will be able to more reasonably handle issues of unmatched open and close tags.

This issue will be left open until that has been addressed

@nonara

@taoqf
Copy link
Owner

taoqf commented Sep 17, 2021

This lib is not suppose to deal with incorrect html. I am so sorry for that. If you could fix this, I am happy to merge you pr.

@taoqf taoqf closed this as completed Sep 17, 2021
@taoqf taoqf reopened this Sep 18, 2021
@nonara nonara changed the title Unable to rightly parse html when it contain some html tags with no ends Improperly closed tags break output Sep 29, 2021
@AlenToma
Copy link
Author

AlenToma commented Feb 1, 2022

I have checket the code and noticed something wrong in the code that couse the output to break.

Check this line

// Single error  <div> <h3> </div> handle: Just removes <h3>
oneBefore.removeChild(last);

This will remove the child and its content, why dont we just close it ?

@AlenToma
Copy link
Author

AlenToma commented Feb 1, 2022

Hi again.
I inspected the code above and did some test.

I do not know why you really remove the last element since you already found the none closed tags.

Anyway here is a possible solution that worked

I added parseNoneClosedTags option

and changed the code to below

export function parse(data: string, options = { lowerCaseTagName: false, comment: false } as Partial<Options>) {
	const stack = base_parse(data, options);
	const [root] = stack;
	while (stack.length > 1) {
		// Handle each error elements.
		const last = stack.pop();
		const oneBefore = arr_back(stack);
		if (last.parentNode && last.parentNode.parentNode) {
			if (last.parentNode === oneBefore && last.tagName === oneBefore.tagName) {
				// Pair error case <h3> <h3> handle : Fixes to <h3> </h3> 
				// this is wrong, becouse this will put the H3 outside the current right position which should be inside the current Html Element, see issue 152 for more info
				if (options.parseNoneClosedTags !== true) {
					oneBefore.removeChild(last);
					last.childNodes.forEach((child) => {
						oneBefore.parentNode.appendChild(child);
					});
					stack.pop();
				} 
		
			} else {
			
				// Single error  <div> <h3> </div> handle: Just removes <h3>
				// Why remove? this is already a HtmlElement and the missing <H3> is already added in this case. see issue 152 for more info
				if (options.parseNoneClosedTags !== true) {
					oneBefore.removeChild(last);
					last.childNodes.forEach((child) => {
						oneBefore.appendChild(child);
					});
				}
			}
		} else {
			// If it's final element just skip.
		}
	}

	return root;
}

And here is the test for this issue which passed.

const { parse } = require('@test/test-target');

describe('issue 152', function () {
	it('shoud parse attributes right', function () {
		const html = `<div>
<div id="chr-content">
<span>
  lkjasdkjasdkljakldj
</div>
</div>`;
		const expected = `<div>
<div id="chr-content">
<span>
  lkjasdkjasdkljakldj

</span></div></div>`;

		const root = parse(html, { parseNoneClosedTags: true });
		root.toString().should.eql(expected);
		// const div = root.firstChild;
		// div.getAttribute('#input').should.eql('');
		// div.getAttribute('(keyup)').should.eql('applyFilter($event)');
		// div.getAttribute('placeholder').should.eql('Ex. IMEI');
		// root.innerHTML.should.eql(html);
	});
});

could you please have a look and let me know if this could work, and even better if it did then please check it in and publish it on npm so we could use it.

@taoqf
Copy link
Owner

taoqf commented Mar 12, 2022

merged your code in v5.2.2

@taoqf taoqf closed this as completed Mar 12, 2022
@VityaSchel
Copy link

Please tell me if there is a library that can parse malformed html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants