Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in updateBundleStack() when trying to run tests in parallel #1334

Closed
oharboe opened this issue Feb 11, 2020 · 23 comments
Closed

Crash in updateBundleStack() when trying to run tests in parallel #1334

oharboe opened this issue Feb 11, 2020 · 23 comments

Comments

@oharboe
Copy link
Contributor

oharboe commented Feb 11, 2020

@ucbjrl I realize this isn't a great bug report.... I'll try to flesh it out as I learn more.

Type of issue: bug report

Impact: no functional change

Development Phase: request

Other information

exceprt from my build.sbt:

val defaultVersions = Map(
  "chisel3" -> "3.2.2",
  "chisel-iotesters" -> "1.3.0",
  "chisel-testers2" -> "0.1.2"
)

// run in parallel with forking
fork := true
Test / testForkedParallel := true
Test / parallelExecution := true
// Avoid running out of memory when creating Verilog
javaOptions += "-Xmx32768m"
javaOptions += "-Xss128m"

If the current behavior is a bug, please provide the steps to reproduce the problem:

I can't provide the source code, but I'm trying to run tests in paralllel(it's going to cut down running time from 1h20 to 20m...).

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/chisel-users/F8PWmR79djE/W270FB1FFgAJ

What is the current behavior?

Exception.

java.lang.IndexOutOfBoundsException: at 0 deleting 3
at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:162)
at scala.collection.mutable.BufferLike.trimEnd(BufferLike.scala:185)
at scala.collection.mutable.BufferLike.trimEnd$(BufferLike.scala:185)
at scala.collection.mutable.AbstractBuffer.trimEnd(Buffer.scala:50)
at chisel3.internal.Builder$.updateBundleStack(Builder.scala:356)
at chisel3.Bundle.<init>(Aggregate.scala:781)

What is the expected behavior?

No exception.

Please tell us about your environment:

$ uname -a
Linux gale 5.3.0-28-generic #30~18.04.1-Ubuntu SMP Fri Jan 17 06:14:09 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ java --version
openjdk 11.0.6 2020-01-14
OpenJDK Runtime Environment (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1)
OpenJDK 64-Bit Server VM (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1, mixed mode, sharing)

What is the use case for changing the behavior?

@oharboe
Copy link
Contributor Author

oharboe commented Feb 11, 2020

I tried upgrading to the latest snapshot, somewhat different error message:

val defaultVersions = Map(
  "chisel3" -> "3.3-20191218-SNAPSHOT",
  "chisel-iotesters" -> "1.4-20191218-SNAPSHOT",
  "chisel-testers2" -> "0.2-20191218-SNAPSHOT"
)


[info] ascenium.tiles.DressRehearsalChiselNoCompressionTests *** ABORTED *** (14 milliseconds)
[info]   scala.MatchError: null
[info]   at chisel3.internal.Builder$.$anonfun$updateBundleStack$2(Builder.scala:353)
[info]   at chisel3.internal.Builder$.$anonfun$updateBundleStack$2$adapted(Builder.scala:353)
[info]   at scala.collection.IndexedSeqOptimized.segmentLength(IndexedSeqOptimized.scala:198)
[info]   at scala.collection.IndexedSeqOptimized.segmentLength$(IndexedSeqOptimized.scala:195)
[info]   at scala.collection.mutable.ArrayBuffer.segmentLength(ArrayBuffer.scala:49)
[info]   at scala.collection.GenSeqLike.prefixLength(GenSeqLike.scala:98)
[info]   at scala.collection.GenSeqLike.prefixLength$(GenSeqLike.scala:98)
[info]   at scala.collection.AbstractSeq.prefixLength(Seq.scala:45)
[info]   at chisel3.internal.Builder$.updateBundleStack(Builder.scala:353)
[info]   at chisel3.Bundle.<init>(Aggregate.scala:781)
[info]   ...
[info] ascenium.tiles.DressRehearsalChiselTests *** ABORTED *** (14 milliseconds)
[info]   scala.MatchError: null
[info]   at chisel3.internal.Builder$.$anonfun$updateBundleStack$2(Builder.scala:353)
[info]   at chisel3.internal.Builder$.$anonfun$updateBundleStack$2$adapted(Builder.scala:353)
[info]   at scala.collection.IndexedSeqOptimized.segmentLength(IndexedSeqOptimized.scala:198)
[info]   at scala.collection.IndexedSeqOptimized.segmentLength$(IndexedSeqOptimized.scala:195)
[info]   at scala.collection.mutable.ArrayBuffer.segmentLength(ArrayBuffer.scala:49)
[info]   at scala.collection.GenSeqLike.prefixLength(GenSeqLike.scala:98)
[info]   at scala.collection.GenSeqLike.prefixLength$(GenSeqLike.scala:98)
[info]   at scala.collection.AbstractSeq.prefixLength(Seq.scala:45)
[info]   at chisel3.internal.Builder$.updateBundleStack(Builder.scala:353)
[info]   at chisel3.Bundle.<init>(Aggregate.scala:781)
[info]   ...

@ucbjrl
Copy link
Contributor

ucbjrl commented Feb 12, 2020

Thanks for moving to the 20191218-SNAPSHOTs to explore this.

I am unable to duplicate this using our tests. I suspect it has something to do with the way a specific Bundle is being defined. Could we determine the definition that is provoking the error?

@oharboe
Copy link
Contributor Author

oharboe commented Feb 13, 2020

I tried, but I haven't figured out anything so far.

@oharboe
Copy link
Contributor Author

oharboe commented Feb 13, 2020

I found out something, but I don't know what it means:

If I create multiple classes (test suites), then I get the failure.

If I put all the tests into a single class and use with ParallelTestExecution, then tests are run in parallel. However, all the tests within that class is then run in parallel, which doesn't optimize ccache and verilator interaction...

@ucbjrl
Copy link
Contributor

ucbjrl commented Feb 14, 2020

Could you provide your build.sbt and project/ files? This could be an issue with the test environment. It seems that sbt may provide multiple class loaders to isolate parallel test executions. This can prove problematic with singleton management and thread-based state. Unfortunately, guide-lines on managing this scenario are few and far between.

@oharboe
Copy link
Contributor Author

oharboe commented Feb 16, 2020

Here you go:

issue1334.zip

@ucbjrl
Copy link
Contributor

ucbjrl commented Mar 3, 2020

Thanks @oharboe. I've tried duplicating your environment, but I'm unable to reproduce this with our code. Mind you, we only have one example (chisel-template) that uses both iotesters and testers2. This does have two test classes, but it doesn't look like they are running in parallel to me.

I've created a bundleStackCrash branch in this repository with my attempt to apply your sbt changes. Could you give this a try and see if it is capable of reproducing your problem?
It could well be that this isn't sufficient to provoke the bug, but if you can get it to display the problem, we should be able to diagnose it.

@oharboe
Copy link
Contributor Author

oharboe commented Mar 4, 2020

@ucbjrl I've just upgraded to 3.3-20200227-SNAPSHOT and the problem is still there.

@oharboe
Copy link
Contributor Author

oharboe commented Mar 4, 2020

@ucbjrl I'm not sure what you're asking me to test. What are the steps?

I've just upgraded to 3.3-20200227-SNAPSHOT, the problem is still there.

@ucbjrl
Copy link
Contributor

ucbjrl commented Mar 4, 2020

Sorry, @oharboe, I should have been more specific. I believe there's an environmental component to this issue, and we need to be able to duplicate enough of your environment to reproduce and ultimately address this issue.

chisel-template is the simplest publicly available code body that uses both testers and testers2 the inclusion of which may be instrumental in reproducing this issue.

The test I'd like your input on consists of:

$ git clone https://github.com/freechipsproject/chisel-template.git
$ cd chisel-template/
$ git checkout bundleStackCrash
$ git pull
$ sbt clean test

This should duplicate at least some of the sbt test environment under which you experienced the issue. If it doesn't demonstrate the issue, could you inspect the sbt settings (build.sbt, project/*) and update them if necessary. If you do manage to duplicate the issue, but it requires updates to the sbt settings, could you submit those updates as a pull request to the bundleStackCrash branch? I'm particularly concerned that the Verilator tests don't seem to be running in parallel for me.

I realize this may be a diversion for you and i understand there is a limit to the resources you have available to devote to this task. Your help is appreciated.

@oharboe
Copy link
Contributor Author

oharboe commented Mar 4, 2020

Excellent!

Each class, which can have multiple tests, is executed in serial, so you need many classes to get tests in paralllel.

Run git apply crash.txt on your bundleStackCrash branch and you can reproduce the crash:

 $ sbt test
[deleted]
test DecoupledGcd Success: 0 tests passed in 1 cycles in 0.016794 seconds 59.55 Hz
[info] GcdTesters9:
[info] - Gcd should calculate proper greatest common denominator *** FAILED *** (543 milliseconds)
[info]   scala.MatchError: null
[info]   at chisel3.internal.Builder$.$anonfun$updateBundleStack$2(Builder.scala:353)
[info]   at chisel3.internal.Builder$.$anonfun$updateBundleStack$2$adapted(Builder.scala:353)
[info]   at scala.collection.IndexedSeqOptimized.segmentLength(IndexedSeqOptimized.scala:198)
[info]   at scala.collection.IndexedSeqOptimized.segmentLength$(IndexedSeqOptimized.scala:195)
[info]   at scala.collection.mutable.ArrayBuffer.segmentLength(ArrayBuffer.scala:49)
[info]   at scala.collection.GenSeqLike.prefixLength(GenSeqLike.scala:98)
[info]   at scala.collection.GenSeqLike.prefixLength$(GenSeqLike.scala:98)
[info]   at scala.collection.AbstractSeq.prefixLength(Seq.scala:45)
[info]   at chisel3.internal.Builder$.updateBundleStack(Builder.scala:353)
[info]   at chisel3.Bundle.<init>(Aggregate.scala:781)
[info]   ...
[info] [0.000] Elaborating design...
[info] [0.024] Done elaborating.
[deleted]

crash.txt

@oharboe
Copy link
Contributor Author

oharboe commented Mar 4, 2020

@ucbjrl By the way... I've managed to break the back of the running time problems I had with my tests, I'm down from 1h15 to 20 minutes, which is manageable, but there's no reason why some of these iteration times on Chisel tests couldn't be in the sub-minute turnaround time. Especially if the Verilator binaries are cached/reused across runs(saves elaboration AND Verilator AND C/C++ compilation) and one is iterating on the Scala peek/poke code.

@ucbjrl
Copy link
Contributor

ucbjrl commented Mar 4, 2020

Thanks @oharboe. Yes, with your updates, I can reproduce the issue. I've updated the chisel-template bundleStackCrash branch.

And yes, there is clear consensus that caching the DUT simulation while iterating on tester development would be a significant improvement. It's largely a question of priority, and soliciting contributions from the community-at-large. BTW, you're welcome to attend the weekly developer meetings (Chisel/FIRRTL development meetings happen every Monday from 1100–1300 PT). I realize this may not be a convenient time for you.

@oharboe
Copy link
Contributor Author

oharboe commented Mar 4, 2020

@ucbjrl Thanks! 1100am PDT isn't impossible, just slightly incovenient. I may join in to hear what's going on.

@oharboe
Copy link
Contributor Author

oharboe commented Mar 4, 2020

@ucbjrl I'm travelling to the Bay Area in Monday May 11., might it be possible to join in person for one of these meetings? I wasn't able to make the CCC20 and it would be good to put some faces to the github handles :-)

@ucbjrl
Copy link
Contributor

ucbjrl commented Mar 4, 2020

Absolutely, but let's continue this discussion via email.

@phhorrein
Copy link

Hello !

Any news on this ? I ran in the same problem. I sometime have another error:

[info] - should process 200 packets *** FAILED ***
[info]   java.lang.reflect.InvocationTargetException:
[info]   at jdk.internal.reflect.GeneratedConstructorAccessor37.newInstance(Unknown Source)
[info]   at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[info]   at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
[info]   at chisel3.Bundle.cloneType(Aggregate.scala:879)
[info]   at chisel3.Bundle.cloneType(Aggregate.scala:707)
[info]   at chisel3.Record._makeLit(Aggregate.scala:522)
[info]   at chisel3.experimental.package$BundleLiterals$AddBundleLiteralConstructor.Lit(package.scala:80)

I'm using snapshot versions also of chisel3 and testers2 (no iotesters in my tests).

@ucbjrl
Copy link
Contributor

ucbjrl commented Apr 5, 2020

Unfortunately, this appears to be an architectural issue - multiple threads end up using the default Chisel context during parallel tests. I've put it on the priority PR discussion list for tomorrow's developer meeting.

@ucbjrl
Copy link
Contributor

ucbjrl commented Apr 7, 2020

After discussion at the chisel-dev meeting on April 6, 2020, we've decided to investigate a short-term solution which will attempt to ensure each thread has its own Chisel context at the place the collision typically occurs. Note: this is a whack-a-mole type of solution. The long term solution is to reduce the dependency on and use of global mutable structures.

@ucbjrl
Copy link
Contributor

ucbjrl commented Apr 10, 2020

Fixed by #1404

@ucbjrl ucbjrl closed this as completed Apr 10, 2020
@ucbjrl
Copy link
Contributor

ucbjrl commented Apr 13, 2020

@oharboe, I hope I didn't close this issue prematurely. Could you verify 3.3.0-RC2 (with #1404) does fix this issue?

@oharboe
Copy link
Contributor Author

oharboe commented Apr 13, 2020

@ucbjrl I have removed my workaround, seems to work fine. I will open a new issue if the problem resurfaces.

Thanks!

@phhorrein
Copy link

Thanks, it works for me also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants