DetectReader might OOM #479

ivanjaros · 2024-02-07T11:54:53Z

If for some reason the readLimit gets set to 0, the DetectReader will read the entire file into memory. This can be many GBs in size that would be read into memory for no valid reason.

There should be a condition to never allow reading more than a MB, for example, for whatever conditions there might be regarding the readLimit. I would personally also read in chunks up to said limit and try to detect mime for each run for the so far read data instead of ingesting the entire source []byte because you might need only 100 bytes while reading 3072.

func DetectReader(r io.Reader) (*MIME, error) {
	var in []byte
	var err error

	// Using atomic because readLimit can be written at the same time in other goroutine.
	l := atomic.LoadUint32(&readLimit)
	if l == 0 {
		in, err = ioutil.ReadAll(r) <-------------------------- BAD !!
		if err != nil {
			return errMIME, err
		}
	} else {
		var n int
		in = make([]byte, l)
		// io.UnexpectedEOF means len(r) < len(in). It is not an error in this case,
		// it just means the input file is smaller than the allocated bytes slice.
		n, err = io.ReadFull(r, in)
		if err != nil && err != io.EOF && err != io.ErrUnexpectedEOF {
			return errMIME, err
		}
		in = in[:n]
	}

	mu.RLock()
	defer mu.RUnlock()
	return root.match(in, l), nil
}

Thanks for this great library though! Something like this should be built-in. Good job.

The text was updated successfully, but these errors were encountered:

gabriel-vasile · 2024-02-09T14:53:51Z

If for some reason the readLimit gets set to 0

readLimit does not get to 0 for some reason, it must be intentionally set by the programmer. The docs for SetLimit mention what happens when 0 is passed.

I would personally also read in chunks up to said limit and try to detect mime for each run for the so far read data instead of ingesting the entire source []byte because you might need only 100 bytes while reading 3072.

This argument can go both ways: if detection on first chunk is bad and detection for first+second chunks is ok then we performed 2 detections when we could have done just one. So what operation brings more overhead? Reading one big chunk or performing multiple detections? I don't know, I never benchmarked this or even thought about it, but I am curious what's more efficient. I guess this depends a lot on reader, a net.Conn might be slower than a bytes.Buffer.

There is also the problem of figuring out when to stop reading chunks. Take for example .docx format which is a .zip archive containing some "special" files. These "special" files can appear anywhere inside the zip. To tell that an input is zip takes only first 4 bytes. To find the "special" files inside zip... we might search until the end of the file.

I agree that reading a readLimit chunk is not ideal especially when the input is a very big or very small file but I think it's necessary compromise to keep the code simple and fast.

ivanjaros · 2024-02-09T15:42:46Z

There is also the problem of figuring out when to stop reading chunks. Take for example .docx format which is a .zip archive containing some "special" files. These "special" files can appear anywhere inside the zip. To tell that an input is zip takes only first 4 bytes. To find the "special" files inside zip... we might search until the end of the file.

Exactly why support for streams(#480) is needed.

readLimit does not get to 0 for some reason, it must be intentionally set by the programmer

anything can happen and i do not see how this is not bad code to have even the option to bring the entire system down

gabriel-vasile · 2024-02-13T05:49:16Z

I understand the point of streaming/buffering input and I will look how I can make it work with all existing code and if it's worth bothering. This library initially started as an alternative to http.DetectContentType but with support for more formats.
Doing it as suggested in #480 will not happen because performing repeated detections on increasingly bigger byte slices will not help with OOM.

anything can happen and i do not see how this is not bad code to have even the option to bring the entire system down

There are many places where you can bring your system down. You can access out of bounds array indexes, you can call os.ReadFile on a file that does not fit in memory. For now, I will make some changes to docs to make it clear that the input is never buffered and only read in one chunk.

ivanjaros · 2024-02-13T07:26:07Z

Doing it as suggested in #480 will not happen because performing repeated detections on increasingly bigger byte slices will not help with OOM.

I did not say you should do that. The point was that as you have said yourself - you might not know where the magic bytes are located so you keep reading into the buffer until you detect some and then proceed from there. You might not need the buffer at all, that is beside the point. But it is more about cases where magic bytes are not at the beginning of the file or possibly they are not even at the end but spread out(you might detect a mime right from the get go but you might need some more information that is at the end of the file to get the full picture. It does not mean you have to scan entire file into the buffer). Essentially this library should behave just like a hashing function which does exactly that - has a fixed buffer and at the end produces hash of all the data - without loading the entire data stream into the buffer.

So the internal implementation is irrelevant, you know what you need to do, but it is about adding io.Reader and io.Writer mime detectors to support streams and also getting rid of that fixed max-read limit.

Essentially io.R/io.W should be the primary methods and Detect([]byte) should only wrap them - the same as encoders, like json, do internally. So the exactly opposite of the current design.

for #479

gabriel-vasile added a commit that referenced this issue Feb 18, 2024

Specify that there is no buffering done when reading inputs

a3a04e6

for #479

gabriel-vasile mentioned this issue Feb 18, 2024

Specify that there is no buffering done when reading inputs #487

Merged

gabriel-vasile mentioned this issue Apr 25, 2024

Add ReaderDetector to use a non global limit #510

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DetectReader might OOM #479

DetectReader might OOM #479

ivanjaros commented Feb 7, 2024 •

edited

Loading

gabriel-vasile commented Feb 9, 2024

ivanjaros commented Feb 9, 2024

gabriel-vasile commented Feb 13, 2024

ivanjaros commented Feb 13, 2024 •

edited

Loading

DetectReader might OOM #479

DetectReader might OOM #479

Comments

ivanjaros commented Feb 7, 2024 • edited Loading

gabriel-vasile commented Feb 9, 2024

ivanjaros commented Feb 9, 2024

gabriel-vasile commented Feb 13, 2024

ivanjaros commented Feb 13, 2024 • edited Loading

ivanjaros commented Feb 7, 2024 •

edited

Loading

ivanjaros commented Feb 13, 2024 •

edited

Loading