-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/csv: add a way to limit bytes read per field/record #20169
Comments
/cc @bradfitz @josharian |
How about just hard-coding a limit, instead of adding API? I seriously doubt any reasonable csv file will contain a single field bigger than, say, 64k. We do something similar for, say, bufio.Scanner, with a note saying that those who find it problematic should switch to lower level reading. |
That also works, but could break existing users (although I agree that there probably won't be any breaking code). In this case a per record limit would make even less sense (IMO) as CSV files can often have many fields (I have worked with files having hundreds or even thousands of fields per record) and then even small fields could trigger the limit. In this case it's important to think about the limit. Too high and memory usage can still become "excessive", too low and existing programs could break. I would guess that in most cases when reading CSV the memory usage isn't really important or restricted and a limit of 32k or 64k would not be too high and still prevent most (if not all) problems related to the problem. A different variant of this could make the Reader only limit fields when LazyQuotes is set and a lazy quote was found. This would still solve the problem for the "common" case and not break users that either don't set LazyQuotes to true or don't have CSV files that trigger this behaviour |
This makes sense. Related: #19019 |
I have a small patch ready that adds a limit for the LazyQuotes == true case when there is at least one lazy quote in a field, but although I like the idea I'm still not 100% sure this is the best approach. I'd like to wait for a 3rd opinion on this. |
It seems OK to have a limit of some kind, but someone needs to propose a concrete API. |
On hold for concrete API. |
We already have type Reader struct {
...
// BytesPerField optionally specifies the maximum allowed number of bytes
// per field. Zero means no limit.
BytesPerField int |
I'm in favor of Brad's proposal. It's probably the simplest solution and can be used even for cases where LazyQuotes is off (basically the case I always want and hope for...) I can take this once the go 1.10 tree is open. (And Sorry for the delay, just found this in my list of issues to look at. WIll make me a reminder for after the go 1.9 release...) |
Updated the issue comment with what @bradfitz proposed. |
This class of issue is not specific to |
What about something like a new encoding/limit package?
This approach means that a client by default will read all data. A server can call |
I think this is to loose, especially since "the amount of data we should read for one reader" depends on what you are working with. I imagine the Reader function would be used multiple times, depending on the package (e.g. for each record in a CSV file) and it's behaviour would need to be documented for each package that uses it (eg. does it limit the combined length of each csv record or of each field?). Also I don't like that this adds new global state that changes the behaviour of while packages or types instead of only a single object. Maybe we can just define a new interface that all the concerned types can implement which can than document their way of limiting/their behaviour on the method(s). |
New global state is awful but my guess is that having to add new API for every affected package is worse. Ultimately the issue is that servers need to limit the amount of memory they are going to allocate when reading potentially very large data. If the only way to limit that is to pass some argument, then that argument needs to be exposed at every API layer back up to the server. If you don't like the reader idea then another approach would be to use a limited buffer to hold the incoming data. On the base assumption that this only applies to objects that are essentially []byte, we could invent a limited buffer that will start dropping bytes past some limit. But, again, using this implies global state or massive API changes. |
I think addressing this somewhere on the I wonder if the approach is to define a special interface that satisfies type ReadLimiter interface {
io.Reader
// StartLimiter informs the Reader that the user intends to perform an operation
// that may read an unbounded amount of memory from the Reader.
StartLimiter()
// StopLimiter informs the Reader that the potentially unbounded operation
// has come to completion.
StopLimiter()
} Thus, the Similarly, the |
@dsnet I'm not quite seeing it yet. Tell me more. What would an implementation of |
Consider the following simple implementation with the exported API: package ioquota
// QuotaReader an interface that consumers of io.Reader can type assert to
// in order to declare intent to perform read operations that may require
// an unbounded amount of resources.
type QuotaReader interface {
io.Reader
// WithQuota informs an io.Reader that the consumer intends to perform
// an operation that may read an unbounded amount of memory from the reader.
// It the responsibility of the consumer to call done when the operation
// is complete.
//
// When WithQuota is called, the io.Reader may impose some restriction on
// the number of byte that may be read until done is called.
// When the quota is exceeded, ErrQuota is returned from Read.
// Calling done lifts the quota restriction, even if ErrQuota is reached.
WithQuota() (done func())
}
func New(r io.Reader, quota int) QuotaReader End users can do the following: r := csv.NewReader(ioquota.New(r, 64*kibiByte))
rec, err := r.Read()
_ = err // err may be ErrQuota if the record is larger than 64KiB r, err := gzip.NewReader(ioquota.New(r, 64*kibiByte))
_ = err // err may be ErrTruncatedFields, however r still can be used r := tar.NewReader(ioquota.New(r, 64*kibiByte))
h, err := r.Next()
_ = err // err may be ErrQuota if the record is larger than 64KiB
io.Copy(ioutil.Discard, r) // This has no read limit, since quota only applies to Next The implementation within each standard library will be as follows: package csv
func (r *Reader) readRecord(dst []string) ([]string, error) {
// Every call to readRecord will be within the scope of a single WithQuota call
// to ensure that no single record exceeds the given quota.
if qr, ok := r.r.(ioquota.QuotaReader); ok {
done := qr.WithQuota()
defer done()
}
...
} package tar
func (tr *Reader) Next() (*Header, error) {
// Every call to Next will be within the scope of a single WithQuota call
// to ensure that no tar.Header exceeds the given quota.
if qr, ok := tr.r.(ioquota.QuotaReader); ok {
done := qr.WithQuota()
defer done()
}
...
} package gzip
// ErrTruncatedHeader is returned by NewReader and Reader.Reset when the string fields in
// Header are truncated. Even when this reader is returned, the Reader may still be used.
var ErrTruncatedHeader = errors.New("Header.Name or Header.Comment fields is truncated")
// If the underlying reader is a QuotaReader, it will call WithQuota prior to reading the string.
// If the quota is reach, the QuotaReader will truncate the field, and continue to discard the
// remainder of the string field.
func (z *Reader) readString() (hdr Header, err error) {
var done func()
if qr, ok := z.r.(ioquota.QuotaReader); ok {
done = qr.WithQuota()
}
var truncated bool
for {
b, err := r.r.ReadByte()
if err == ioquota.ErrQuota {
err = nil
truncated = true
done() // Allow future calls to ReadByte to succeed
done = nil
}
if err != nil {
...
}
if b == 0 {
break // Null terminator
}
if truncated {
continue // Avoid processing this byte
}
...
}
if done != nil {
done()
}
...
} This is the rough sketch I have, there's still a number of open questions. See the TODO's in my playground example. For example, the |
Seems promising. But also seems a bit awkward to use. And it's hard to translate the |
This issue has since expanded beyond just |
What is the status of this issue? Will this be fixed in csv? |
The status is that we haven't decided what to do here. |
considering that a file is buffered a record at a time, it seems that a useful control might be to limit the maximum record length. has the following, or similar, been considered? type Reader struct {
...
// MaxBytesPerRecord optionally specifies the maximum allowed number of bytes
// per record. Zero means no limit.
MaxBytesPerRecord int it is easy to implement, easy to understand as a package consumer, and it would mitigate the worst case scenarios for processing large, malformed files. |
This is a very serious issue that leads to OOM if an application is using |
The csv.Reader has no way to limit the size of the fields read. In combination with LazyQuotes, this can lead to cases where a simple syntax error can make the Reader read everything until io.EOF into a single field.
Probably the simplest case (with LazyQuotes == true) is a quoted field, where there is some other rune between the closing quote and the comma (e.g.
a,"b" ,c
). In this case the second field will contain all bytes until either a single quote folowed by a comma or EOF is found. (See my comment in #8059)This behaviour can lead to excessive memory usage or even OOM when reading broken CSV files.
To avoid this I propose to add a new, optional field to Reader to allow limiting the size of each field. When set the Reader would return an error as soon as it hits the limit.
Alternatively the limit could be per record instead. This could help especially with situations where FieldsPerRecord == -1, because there can be a different, basically unlimitted number of fields per record so a limit of 100 bytes per field doesn't help with e.g. missing new lines leading to records with many fields.The new field would be specified as (as proposed by @bradfitz)
The text was updated successfully, but these errors were encountered: