-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DetectReader might OOM #479
Comments
This argument can go both ways: if detection on first chunk is bad and detection for first+second chunks is ok then we performed 2 detections when we could have done just one. So what operation brings more overhead? Reading one big chunk or performing multiple detections? I don't know, I never benchmarked this or even thought about it, but I am curious what's more efficient. I guess this depends a lot on reader, a There is also the problem of figuring out when to stop reading chunks. Take for example I agree that reading a |
Exactly why support for streams(#480) is needed.
anything can happen and i do not see how this is not bad code to have even the option to bring the entire system down |
I understand the point of streaming/buffering input and I will look how I can make it work with all existing code and if it's worth bothering. This library initially started as an alternative to http.DetectContentType but with support for more formats.
There are many places where you can bring your system down. You can access out of bounds array indexes, you can call os.ReadFile on a file that does not fit in memory. For now, I will make some changes to docs to make it clear that the input is never buffered and only read in one chunk. |
I did not say you should do that. The point was that as you have said yourself - you might not know where the magic bytes are located so you keep reading into the buffer until you detect some and then proceed from there. You might not need the buffer at all, that is beside the point. But it is more about cases where magic bytes are not at the beginning of the file or possibly they are not even at the end but spread out(you might detect a mime right from the get go but you might need some more information that is at the end of the file to get the full picture. It does not mean you have to scan entire file into the buffer). Essentially this library should behave just like a hashing function which does exactly that - has a fixed buffer and at the end produces hash of all the data - without loading the entire data stream into the buffer. So the internal implementation is irrelevant, you know what you need to do, but it is about adding io.Reader and io.Writer mime detectors to support streams and also getting rid of that fixed max-read limit. Essentially io.R/io.W should be the primary methods and Detect([]byte) should only wrap them - the same as encoders, like json, do internally. So the exactly opposite of the current design. |
If for some reason the readLimit gets set to 0, the DetectReader will read the entire file into memory. This can be many GBs in size that would be read into memory for no valid reason.
There should be a condition to never allow reading more than a MB, for example, for whatever conditions there might be regarding the readLimit. I would personally also read in chunks up to said limit and try to detect mime for each run for the so far read data instead of ingesting the entire source []byte because you might need only 100 bytes while reading 3072.
Thanks for this great library though! Something like this should be built-in. Good job.
The text was updated successfully, but these errors were encountered: