-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed really isn't that great #16
Comments
any progress on this ? |
Nope and this probably won't be getting attention for a long time as I have nothing driving that change. |
The parsing is HORRIBLY slow: I have packed gqsrc.zip such that it is reproducible. The only thing I changed in the source is to use "en_US.UTF8" instead of "en_US" as described in Simply unpack everything and compile. The program is very easy and is based on the example code:
You can compile it in the unpacked directory
And yes, that is more than 100 seconds using all my CPU resources on a i7 7700HQ!!! If I turn on heavy optimization with
which is still a lot. For comparison, the following Julia code is doing the same (use
and takes this amount of time on cold start (Julia is JIT):
and this amount of time on repeated evaluations:
This is less than a second. I wanted to switch to C++ for more speed less needed ressources... |
I did the same for gumbo-query and packed gumboquerysrc.zip. With the program
I get the following results.
and from
I have to admit that I like the GQ interface with namespaces more than that of gumbo-query, but without a performance improvement, GQ is not competetive. |
Parse with raw gumbo interface for comparison please and then pass the constructed gumbo output to the appropriate constructor for gq document. Please benchmark both the raw gumbo parsing and the construction of gq document separately. Please also parse the results on your machine of the exact example program without modifications (except the locale fix I guess). Parsing takes a hit because of the tree I build for rapid selector execution but this obviously isn't right whatever is happening and I've never seen this. Also you said about gq competing with gumbo-parser. Was never meant to be a competition. Gumbo parser is so full of bugs I can't even tell you. Just try to run some of the tests I've made here with it, they'll fail entirely. |
Using a modified
The new
Note that I cannot use the exact example program because it does not provide the functionality to load a string from file and it does not have the timing checkpoints. But you can see that the program I use is basically the exact example program with time measuring functionality and the ability to load a string from a file, so nothing special. I packed it in gqsrc.zip with three test html files which can be passed as the first command line argument to the executable to be generated with
For a mid HTML (160KB)
For a small HTML (75KB)
I think the algorithm used in |
@phinzphinz Ah ok so there it is, it's the size variation. That whole tree system I made was sort of whipped up in the moment without any planning. I'm thinking I need to just blow that away and let the queries execute dynamically like gumo-query does. Thanks for the reports and for doing the testing. I'm pretty piled up with work right now but I'll see if I can't get around to looking at this. Pretty sure my goal at this point is going to be delete the Tree class files and fix until it compiles/works again. |
@TechnikEmpire Here's code:
Here's the log: The html is about 442KB. It takes 10 seconds to parse html, and 2.4 seconds to extract information. This is too slow for my application. |
Ok I'm going to actually look at this today because that's brutal. I merged in a PR or two without testing, that's about all I can figure in terms of what went wrong. I've been looking at myhtml and it has a selector engine in modest. This project might get killed and become a wrapper around that if this issue is that serious. Last time I benchmarked this code, it took 83 msec to parse Yahoo's landing page and build the selector tree/cache. So I dunno what's happened but I'll render some action against this issue at some point today. |
Is everyone here testing on Linux? |
I'm using GQ on windows. |
@sinall Thanks. |
Ok folks well, you have to weigh in. I can bring parsing back to basically raw gumbo parsing speed, but the consequence is that the super fast selector system is RIP in peace. Time taken to run 27646 selectors against the document: 75036.2 ms producing 1685 total matches.
Processed at a rate of 2.71418 milliseconds per selector or 0.368436 selectors per millisecond.
Benchmarking mutation.
Time taken to run 27646 selectors against the document while serializing with mutations 1 times: 74865.3 ms.
Time per cycle 74865.3 ms.
Processed at a rate of 2.708 milliseconds per selector or 0.369277 selectors per millisecond. Not included here, the parsing stage was ~180 msec IIRC. Using @phinzphinz 's huge >2mb HTML sample he submitted, those are the stats. Takes nearly 3 msec to run a single selector through the entire parsed body. I guess it's up to what people want. To me, this makes the whole project garbage and a waste of time, but that's just my opinion. |
@sinall Oh, definitely don't use debug lol, debug is absolutely horridly slow. Test speeds in release. Anyway I'm working on this. |
@TechnikEmpire I tried release mode, here's the log:
This is acceptable for me at this moment. And I'm looking forward to your fast selector system in future. |
@sinall definitely there is an issue that is killing performance and I have identified it. I'm working on rewriting this stuff but it's core functionality so it's extensive work. In the meantime, smaller HTML shouldn't be a big deal. |
My advice to everyone here is to switch to my_html (or whatever bundle of his has selectors). The design of Gumbo is such that internals are kept away from the user entirely, and to make this function properly and without a slowdown as HTML size increases, I need to do invasive changes to Gumbo. Frankly I can't decide what is more painful: writing my own compliant parser based on the current spec in C++ or being forced to maintain an abandoned C library written with precisely 0 comments detailing anything. |
It would seem that our tree building process is quite expensive. IIRC, when I looked into this some time ago, it wasn't gumbo's parsing speed really dragging things down, but rather the tree building we do post-gumbo parsing.
Should be able to speed that up.
The text was updated successfully, but these errors were encountered: