-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid Storing Large Strings in Memory #194
Conversation
Can we use a fastparse.ReaderParserInput or |
I tried using that one, but it doesn't support backtracking - which is used at some point when dealing with syntax errors |
On further thought, I wonder if using a custom |
I don't see any code that would make it work for Reader even if we changed checkTraceable. It think we'd need to call
On this point, we actually have 100MB-1GB input JSONs that take up a huge amount of memory. Depending on the order in which you process - you could easily OOM if you concurrently buffer and parse multiple of these (depending on the order in which Bazel requests a particular compile). I figured the easiest thing here was just to avoid the in memory buffering. There are further issues with 100+MB arrays - as they'll result in heap fragmentation and large object pinning - so even with plenty of space the GC may still declare OOM as it can't move the huge objects and can lose the ability to allocate new ones. |
Got it. The first parse should be fine, but the call to |
I think this looks fine. @szeiger leaving this open for a bit for you to take a look before we merge it |
private[this] def readPath(path: Path): Option[ResolvedFile] = { | ||
val osPath = path.asInstanceOf[OsPath].p | ||
if (os.exists(osPath) && os.isFile(osPath)) { | ||
Some(new CachedResolvedFile(path.asInstanceOf[OsPath], memoryLimitBytes = 2048L * 1024L * 1024L)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the significance of this number? Should it just be Int.MaxValue.toLong
or Int.MaxValue.toLong + 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to Int.MaxValue.toLong and added scaladoc:
* @param memoryLimitBytes The maximum size of a file that we will resolve. This is not the size of
* the buffer, but a mechanism to fail when being asked to resolve (and downstream parse) a file
* that is beyond this limit.
```.
Basically, we have some pathological imports (1GB+) which I eventually want to ban (all the ones I found could be trivially modified upstream to not produce such huge files). In a followup we can make this param configurable.
cc @carl-db
Why are we keeping the parsed source files at all? AFAIR it's just for cache invalidation. Maybe we could replace them with hashes. Parsing could be done with a standard streaming parse or even load the entire file into memory temporarily. |
@szeiger This PR does replace the cache entry with a CRC hash (was much faster than MD5- which made a huge difference for ~1GB files). Right now the PR will keep up to 1MB files in memory, but I can change that so that we always streaming parse. |
parseCache.getOrElseUpdate((path, txt), { | ||
val parsed = fastparse.parse(txt, new Parser(path, strictImportSyntax).document(_)) match { | ||
def parse(path: Path, content: ResolvedFile)(implicit ev: EvalErrorScope): Either[Error, (Expr, FileScope)] = { | ||
parseCache.getOrElseUpdate((path, content.contentHash.toString), { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, we're storing the hash as a string which explains why the ParseCache is unchanged.
We may want to skip Jsonnet parsing for these files altogether. If the file name ends with .json we could simply use uJSON and have it generate an Sjsonnet AST consisting of static objects. |
Sorry to haunt an old issue. Am I right in concluding that this PR breaks scala.js compatability given these
in Importer. As in, not all of these are supported in scala.js javalib. |
ResolvedFile
interface that encapsulates all file reads (import, importstr, etc). Behind this a file can be backed by a string in memory (old behavior) or by a file on diskParserInputs
for each - the in-memory one is trivial and the on-disk one is backed by a RandomAccessFile (with buffered reads around it for performance)Misc