-
Notifications
You must be signed in to change notification settings - Fork 121
Reproduction Harness
The Artio reproduction harness provides a way to reproduce sequences of events that happen through your Artio Gateway. The aim is to be able to reproduce problematic scenarios that lead to bugs within Artio or your own code. It requires cooperation from user code in order to be used effectively.
In order to reproduce a previous run of Artio you need to keep your Artio logFileDir()
and Aeron Archive instance from the previous run. These directories are used as the source of data in order to reproduce the previous run. In order to use reproduction mode your system must deterministically respond to every callback event from Artio in the same way that it did on the original run. For example if you received a NewOrderSingle message and replied with an ExecutionReport then you must do so with the exact same values and fields as before. If you requested a session be owned by a specific library on the original run, you must do so on the reproduction, etc.
When configuring the FixEngine
for reproduction mode you need to invoke EngineConfiguration.reproduceInbound()
with the start and end times of your reproduction. Then after you have launched your FixEngine
and your system is started up and ready to go you can call startReproduction()
- the reply returned from this method can be checked in order to know when the reproduction has completed.
The reproduction operation entails replaying messages from the Archive using Aeron's IPC_CHANNEL
if the default stream id clashes with an existing Aeron IPC stream then EngineConfiguration.reproductionLogStream()
can be used in order to configure the stream id.
In order to make a reproduction run more accurate then EngineConfiguration.writeReproductionLog(true)
can be configured. Artio doesn't record when TCP channels get back-pressured by default beyond noting the switches between slow consumer mode going on and off because it's not normally useful information. Enabling the reproduction log records when back-pressure happens and that information can be used during reproduction runs in order to control the order of events. Take care when enabling this flag. Normally it shouldn't generate too many events, but if you get into a situation where a gateway is under constant back-pressure then it can be spammy. The stream used to record and replay the reproduction log can be configured using EngineConfiguration.reproductionLogStream()
.
When configuring the FixLibrary
for reproduction mode you need to invoke LibraryConfiguration.reproduceInbound()
with the start and end times of your reproduction. Each library instance needs to use the same library id as with the original run, in order to facilitate this the LibraryConfiguration.libraryId()
method has been added that can set a library id, rather than them always being randomly generated. Take care to ensure that libraries are given distinct library ids from each other. Giving multiple libraries the same id within the same run isn't supported and can lead to bugs occurring.
- In reproduction mode Artio ignores bind operations.
- You cannot use a custom TcpChannelSupplier - it creates fake TCP channels for the purpose of replaying the reproduction.
- You should not set a custom clock with reproduction mode - it uses timestamps from the event stream in order to create a fake clock to trigger events.
- At the moment reproduction mode doesn't support initiated connections, only acceptor connections.
- You also have to have a way to generate your stream of outbound events that comes from your internal matching engines, with timestamps that advance at the rate of the fake clock.
- The reproduction of complex interleaving bugs requires the exact order of events and this wouldn't 100% guarantee that the order of the outbound events and replay events may not have the exact same interleaving. In other words it isn't a 100% deterministic reproduction.
The aim of the system is to make the changes minimally invasive when and as similar to production as possible when running in Reproduction mode.
In reproduction mode Artio takes the inbound events of the system from the archive. It creates "Fake" TCP connections for testing purposes when it identifies points in time when a TCP connection is received and for you to see the outbound message that go back out in order to reproduce the problem. It then replays the inbound messages, connection creations and disconnects from the system into your system. It creates a fake clock, driven by timestamps from the stream in order to trigger things like events being emitted from the Session logic (like heartbeats) that are time-driven.