Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage when reading large text files #24

Merged
merged 1 commit into from
Oct 4, 2018

Conversation

melissalinkert
Copy link
Member

Partial backport of a private PR.

Test data is in data_repo/inbox/ome-common-java-24, and includes a 66MB test.txt and the following test class:

import loci.common.DataTools;

public class Test {
    public static void main(String[] args) throws Exception {
      String file = args[0];
      int readCount = Integer.parseInt(args[1]);
      int successful = 0;

      try {
      for (int i=0; i<readCount; i++) {
        DataTools.readFile(file);
        successful++;
      }
      }
      catch (OutOfMemoryError e) {
        /* debug */ System.out.println("* stopped early on OOM");
      }
      /* debug */ System.out.println("successfully read " + file + " " + successful + " times");
    }
}

Without this change:

$ java -mx400m Test test.txt 10
...
successfully read test.txt 10 times
$ java -mx390m Test test.txt 10
...
* stopped early on OOM
successfully read test.txt 0 times

With this change:

$ java -mx300m Test test.txt 10
...
successfully read test.txt 10 times
$ java -mx290m Test test.txt 10
...
* stopped early on OOM
successfully read test.txt 0 times

i.e. there should be a reproducible 25% reduction in the minimum amount of memory required to read the test file. The original use case was large Micromanager acquisitions, where the *_metadata.txt file is tens or even >100 MB.

@joshmoore
Copy link
Member

Outwith this PR: Having little experience with readFile I took a look at https://github.com/openmicroscopy/bioformats/search?utf8=%E2%9C%93&q=readFile&type= -- it would be interesting to know what part of our usage is per line. If large, perhaps a method returning a closeable-java.util.Scanner rather than java.lang.String could reduce usage significantly.

@melissalinkert
Copy link
Member Author

git grep DataTools.readFile on ome/bioformats@dd94cee (current bioformats develop) indicates 51 uses. Of those, 12 split on a CR and/or LF in the same line of code and 16 split on CR and/or LF at some point after reading, either explicitly or implicitly by being passed to IniParser. Carded here for future consideration: https://trello.com/c/Zi5ySAw7/231-ome-common-consider-adding-a-datatoolsreadfile-signature-that-returns-a-javautilscanner

@dgault
Copy link
Member

dgault commented Oct 2, 2018

Using the test info provided this does appear to reduce the required memory. Im kind of surprised that it has as much of an effect as readString does pretty much the same thing with an extra check for the available size:

    int avail = available();
    if (n > avail) n = avail;
    byte[] b = new byte[n];
    readFully(b);
    return new String(b, encoding);

Given that we are reading the entire length of the file though that check is unnecessary and using readFully is more performant. PR is good to merge.

@sbesson sbesson added this to the 5.3.7 milestone Oct 4, 2018
@sbesson sbesson merged commit c27245e into ome:master Oct 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants