Skip to content
stuarthalloway edited this page Dec 19, 2012 · 12 revisions

Solved!!

This issue appears to be solved, here is the key commit.

Summary

Take the Hornet ping example. Configure it for SSL. Put a loop around it so that it runs forever. Create and close the HornetQ session around each ping.

Within a few seconds, the Producer will hang, and eventually timeout and fail. If you change any of the variables above, the problem disappears.

This problem is easy to reproduce. If maven has already installed the Internet on your local machine, you should be able to see the problem yourself within a few minutes, following the instructions in the next section. The problem seems consistent on a bunch of different releases of Hornet from 2.2 to the present.

Suspicion: there is some timing issue around shutdown of HornetQ sessions that use SSL.

Reproduce the Problem

The Consumer and Producer programs are adapted from the ping example that ships with Hornet, with three key modifications:

  • two separate application mains
  • infinite loop of messages, instead of 1
  • SSL plus key and trust stores used by Datomic

You can run the consumer and producer as follows:

git clone git://github.com/Datomic/datomic-java-examples.git
cd datomic-java-examples
mvn compile

# from two different shells
mvn exec:java -Dexec.mainClass=hornet.samples.PingConsumer
mvn exec:java -Dexec.mainClass=hornet.samples.PingProducer

If you run first the consumer, and then the producer, they will attempt to loop forever, with the consumer consuming messages from the producer. In my tests, this loop fails in under a minute, with less than a few hundred pings.

Two errors jump out at me from the logs:

2012-12-02 09:33:02.630 DEBUG default    o.jboss.netty.handler.ssl.SslHandler - Failed to clean up SSLEngine.
javax.net.ssl.SSLException: Inbound closed before receiving peer's close_notify: possible truncation attack?

2012-12-02 09:33:02.645 DEBUG default    o.jboss.netty.handler.ssl.SslHandler - Swallowing an exception raised while writing non-app data
java.nio.channels.ClosedChannelException: null

These errors suggest some kind of lifecycle bug. Interestingly, the errors disappear if you run the same test without SSL:

Hornet Ping, SSL false runs fine

If you run the same Consumer and Producer above, but the change each file's m.put("ssl-enabled", true); to false, the example no longer fails. The attached log shows a run more than 10K pings, two orders of magnitude more than the number of pings needed to make the SSL example above fall over.

Hornet Ping, SSL true, Detailed SSL logging runs fine

If I return to the SSL-enabled code, and turn on detailed SSL logging with -Djavax.net.debug=all, the problem seems to go away. This furthers my impression that there is some kind of timing problem around closing SSL sessions, and that that additional overhead of detailed SSL logging is enough to mask the problem.

Testing with different versions of Hornet

In pom.xml: Exclude hornetq from Datomic dep, and add some other hornetq version as top level dep. The excerpt below shows where the two changes are:

<dependency>
  <groupId>com.datomic</groupId>
  <artifactId>datomic-free</artifactId>
  <version>0.8.3627</version>
  <exclusions>
    <exclusion>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-log4j12</artifactId>
    </exclusion>
    <!-- add the exclusion below -->
    <exclusion>
      <groupId>org.hornetq</groupId>
      <artifactId>hornetq-core</artifactId>
    </exclusion>
    <exclusion>
     <groupId>org.slf4j</groupId>
     <artifactId>slf4j-nop</artifactId>
    </exclusion>
  </exclusions>
</dependency>
<dependency>
  <groupId>ch.qos.logback</groupId>
  <artifactId>logback-classic</artifactId>
  <version>1.0.1</version>
</dependency>
<!-- add this dependency, set version to taste  -->
<dependency>
  <groupId>org.hornetq</groupId>
  <artifactId>hornetq-core</artifactId>
  <version>2.2.21.Final</version>
</dependency>

Evidence that this is a timing problem

  • mean time to fail is quite different on different machines, but seems consistent on a particular machine. While all of my (Stu's) machines hit the failure fairly quickly, a few people seem never able to hit it.
  • if you turn on detailed logging with -Djavax.net.debug=all, it takes many more pings to encounter the failure.

Other nonfactors

  • reproduced on single machine when running wired, wireless, and no network
  • reproduced on a bunch of different JDK versions
  • reproduced on a bunch of different netty versions up to 3.5.7.Final
  • reproduced on a bunch of differnet HornetQ versions, up to 2.3.0.Alpha

Anybody else hitting this?

This question on StackOverflow might be related. Poster sees the same ClosedChannelException exception, only when under heavy load. No resolution. :-(

Pairhost reproduction

On Dec 7 2012, Alex Redington reproduced the problem on ec2 pairhosts using Java version:

java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.4) (6b24-1.11.4-1ubuntu0.12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

On Ubuntu 12.04. Over the course of 5 trials the pair timed out after the following counts of HornetQ txes:

  • 44
  • 44
  • 283
  • 44
  • 76