-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid HTML while parsing inline HTML string #13
Comments
@sir4ju1 Hmm so the parse with options isn't really any different because IIRC the plain <!DOCTYPE html>
<html>
<body>
<div>
<p>1 <a> some link </a></p>
<p>2</p>
<p>3</p>
<p>4</p>
<p>5</p>
<p>6</p>
</div>
</body>
</html> Perhaps my handling of null/empty root node(s) by throwing an exception is not appropriate. My use case for this library was parsing "in the wild" HTML straight off websites for a content filter. This library builds a static tree/index of everything queryable with selectors for fast matching, and starts out at the root node working downward. |
Full html code not required , if I use
Now my initial code is working fine and also it can fetch all 3 nodes. Thanks for the library! |
@sir4ju1 Ahh, indeed locale construction under linux. I had a PR I accepted lately where I think this was mentioned but I have not tested it there. I'll have a look at the gumbo_parse issue you mention. There might be some extra restrictions imposed on the validity of the input HTML due to the way my version of the selector engine works. Original gumbo-query simply dynamically created and searched things. My project builds a one time index in order to make selection as fast as possible. But I'll check it out and report back, thanks. |
I also had this problem. I uncommented HTML5's default character encoding is UTF-8, so I think @sir4ju1's modification is the right solution. Thanks! |
I forgot about this issue so thanks for raising it here again. Been overwhelmed with work. |
I get
The exception comes from
Then I tried to compile a new simple program to check the command:
It also throws the same exception. I do not know anything about locales but I searched a bit and I suspect that instead of "en_US", one should use one of the lines of the file
It does not contain "en_US" but "en_US.UTF8". The following program runs without problems:
So maybe one should change |
I'll look at how boost handles this. The problem IIRC is that on Linux /standard c++ there's no default constructor for locale. Boost however does provide a default initialization of STD::locale platform IIRC so I'll just copy whatever they're doing. This whole naming convention stuff is ridiculous. |
It works for me, if I use "en_US.UTF8" instead of "en_US" in the constructor of Parser in |
Here is my code:
Error Thrown: In Document::Parse(const std::string&) - Failed to generate any HTML nodes from parsing process. The supplied string is most likely invalid HTML.
It used to parse correctly before on gumbo-query with selector mistake.
I can see in Document.cpp
gumbo_parse_with_options
is used previous repo was usinggumbo_parse
. Not sure if this has anything to do with this?The text was updated successfully, but these errors were encountered: