-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any reason to make parser methods private? #56
Comments
Thanks for your interest in the project. I have no qualms with making private methods protected. Send me a pull request and I’ll merge it. As for the streaming parser (still not a real streaming parser as it will only work for streams to files – which are rewindable and advancable at will): that would be a wonderful addition and a very welcome change. I see two main obstacles (as the project currently stands):
On a related note: What this project currently does not do is the parsing of selectors and the like. It even breaks fundamentally on something like |
Re the streaming parser: you are actually incorrect about the need for the stream to be a file resource. We use PHP temporary streams all over the place. Tremendously useful for things like reading data from an IMAP server. Since mail attachments can, and frequently are, large data chunks - and often base64 encoded - they would normally take 10-20 MB to store in a PHP string and then another 10-20 MB to decode using the built-in PHP base 64 decoding function. By instead feeding the IMAP server output directly into a PHP temporary stream, we guarantee that no more than 2 MB of memory is ever used. And the base64 decoding is done on-the-fly, so there is no need to duplicate data. Thus we can handle 10+MB attachments with only a 2MB hit to memory - useful since we often have 1000's of connections a second on a single machine. Now for CSS parsing, the odds that someone has a 2MB+ CSS data string is very small so memory savings is not a killer feature in this regard. The flip side is that you are treating a string like a stream currently anyway - doing tokenization character by character - so using streams just makes sense from a coding perspective. So there's two things that could be done:
However, the problem with 2 is that, by allowing non-ASCII input, you do have to write your stream to correctly parse multibyte characters. This is annoying. (Aside: we do string tokenization for IMAP data, but the nice thing about IMAP data is that it is ASCII only so I don't have to worry about this). It shouldn't be that hard to find a stream class/code example that handles UTF-8 stream data. But this becomes progressively more difficult if you want to support other multibyte charsets. From our project's perspective, we don't need any of the multibyte stuff (or, at the most, anything outside of UTF-8) since we require UTF-8 for everything and don't use non-ASCII stuff in any of our CSS files. So if this is going to be too complicated to add upstream, I'll stick with the conversion of private -> protected and do what we need in our code. However, since you have shown interest in having stream handling code here, it is worth my time to work with you and see if we can come up with something. |
There are different definitions of what a streaming parser is but I was thinking of a stream that possibly may not yet be read till the very end. Such parsers may stop at any given moment and wait for more data to become available. They can either do this using asynchronous I/O or on a thread by waiting. Since neither is possible with PHP, we can’t do this kind of streaming parser. But of course you’re right: streams whose data is available at any given time are not limited to point to files. I think if we’re going to implement the changes necessary to deal with a stream instead of a string, we might as well do it in the base class. |
I've submitted a PR for the |
A bit of background (if you are wondering why I've suddenly shown up and made a bunch of pull requests :) ):
I'm a dev on the Horde project. We finally got fed up with our csstidy parser module and decided to look for code that could actually parse CSS3, and do it in a coherent fashion (i.e. stateful parsing, not preg crap). Found this project and am happy. We are going to package the parser in a PEAR package to be able to be used in our framework (i.e. autoloading is setup, etc.).
As such, one of the things I would like to see is stream support instead of reading all data into memory. We use streams all over the place to reduce memory usage - since we have installations that have millions of users, we need to cut down memory usage as much as possible.
One way to do that is to open CSS files with fopen() and then pass the stream resource to the CSS parser instead of reading the whole file into memory (which can be 200KB+ of data). I would then write a local Horde wrapper which extends Parser to overwrite the strlen, substr (and new strpos method I just added a pull request for) to access the stream instead of making them string operations.
However, with those methods being marked as private, there is no way to do that. As a project, we mark everything as protected for precisely this reason - so end users can extend our classes as they see fit.
Any chance of making this change upstream? Appreciate the requests you've already processed from me - it would be great to keep our copy as pristine as possible in order to be able to update from the main source as needed, thus the reason for the request.
The text was updated successfully, but these errors were encountered: