-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack overflow fix - eliminate recursive implementation #4997
Conversation
I'd suggest doing a bug fix release @lbergelson once this fix goes in. |
Tested with 10K intervals and 100 WES samples |
Codecov Report
@@ Coverage Diff @@
## master #4997 +/- ##
==============================================
+ Coverage 86.367% 86.397% +0.03%
- Complexity 28832 28945 +113
==============================================
Files 1791 1791
Lines 133603 134348 +745
Branches 14920 15063 +143
==============================================
+ Hits 115389 116073 +684
- Misses 12810 12858 +48
- Partials 5404 5417 +13
|
@lbergelson Do you want to look at this ? I can do a release tomorrow if we can get it in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One request for a test.
How do you want to test this? The error was triggered only if a large number of intervals (~1000) was imported by the tool. |
@kguraj It looks like there are existing integration tests that use intervals that cover a pretty wide genomic range. It should be easy to write a test that programmatically generates a large set of (10000) or so very small intervals (1bp) with small (1bp) gaps between them (the gaps are necessary since otherwise the intervals will be merged together by the engine) that fails without this change and passes with it. It doesn't necessarily have to verify the results, just successfully complete. |
FYI: the test will take a long time to run. Added the requested test. |
@kgururaj Thanks for adding the test. Running it locally on my laptop (without your fix) succeeds though - I have to bump it up from 1000 intervals to 9000 to reproduce the stack overflow. But if I do that, it takes a long time to run, since it appears to be creating lots of small partitions. Is there any way to get it to use fewer partitions in a case like this where there are lots of intervals ? Somewhat more concerning is that when with 8000 intervals, I see a different failure mode. First I see lots (thousands) of these messages:
followed by a failure that ends like this:
Can you reproduce that ? |
Is there any way to get it to use fewer partitions in a case like this where there are lots of intervals ? The multi-interval support in GenomicsDBImport tool is purely for convenience. For scalability with a large number of intervals and samples, you should use multiple processes, each writing to a small (1?) number of intervals. Somewhat more concerning is that when with 8000 intervals, I see a different failure mode. First I see lots (thousands) of these messages: The second set of messages are a result of too many file handles open per process - your system is limiting the number of file handles opened by a single process. Again, this goes back to the previous statement. |
@kgururaj Do those first messages originate here ? I'm not sure what GenomicsDB code path leads there, but it looks like TileDB considers them to be an error that results in a short-circuit return. I'd be concerned that masking them would hide some underlying error. Are there any tests that verify that data round-trips though GDB when an interval list large enough to trigger these messages is encountered ? |
Yes, that's the error message. You should see this part of the GenomicsDB code that retries the function when an error is detected. So, you don't need to be concerned by those error messages. They are annoying though and I will disable them from being printed. |
@kguraj Thanks for the response(s). The test in the PR was super useful as a temporary test, but as you mentioned it runs pretty slowly, and as it stands the test passes on current master anyway. It seems to require on the order of 9000-1000 intervals instead rather than 1000 to actually hit stack overflow. Since that would be a very slow running test, I'm inclined to back it out. Also, the user who originally reported the issue was using 11k intervals, and it seems that the stack overflow fix is unlikely to help in that case. Is there any guidance for users on what is a reasonable number of intervals per process ? It sounds like the intention was that it be used with pretty small intervals. Should we issue a warning message in GenomicsDBImport at some threshold number of intervals ? Are you planning to produce a jar with the error messages suppressed for this PR? |
It's your call whether you want the test or not Yeah, it would be good to put an advisory message if the number of intervals is more than 100. I uploaded a jar yesterday, but it's not showed up in Maven central |
New jar without spurious debug messages |
@kgururaj Thx for the updated jar. Can you remove the test commit now, and then we can run this on travis once more and then we can merge ? Thx. |
Use jar without TileDB verbose logging - error messages are printed out by GenomicsDB when required
Done |
@kgururaj It looks like the test is still included in the PR - I think it should come out since it doesn't use enough intervals to test the fix, and it would be too slow if it did. |
It's no longer a test - just a private function. I just kept it in case somebody wishes to make it a good test in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kgururaj!
Fix for #4994
The recursive implementation should not have been accepted - my mistake.