Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Racecondition: Different requests get same response #583

Closed
dbaumann2 opened this issue May 3, 2023 · 13 comments
Closed

Racecondition: Different requests get same response #583

dbaumann2 opened this issue May 3, 2023 · 13 comments
Labels
Milestone

Comments

@dbaumann2
Copy link

In the IgActorSimulator and IgxActorSimulator an AdhocQueryResponseGenerator is set to a field called ahqrg. This can lead to mixed up responses when multiple requests are handled in parallel.

dbaumann2 pushed a commit to dbaumann2/iheos-toolkit2 that referenced this issue May 3, 2023
@dbaumann2
Copy link
Author

see Pull Request #584

@skbhaskarla
Copy link
Collaborator

@dbaumann2, could you please upload some test artifacts that demonstrate the race condition? Thanks

@dbaumann2
Copy link
Author

dbaumann2 commented May 4, 2023

Unfortunately I have no automated test demonstrating the issue. Here is some logging output from our application. Response message with ID 21 is expected to have Message ID "urn:uuid:0e5b9323-7076-47ca-b0f6-9d34c17d6eaf" and therefore cannot be correlated by our application. This led to us investigating the code of the toolkit. Hope this helps to understand the issue.

Message ID 21: Request, AdhocQueryRequest, MessageId urn:uuid:0e5b9323-7076-47ca-b0f6-9d34c17d6eaf
Message ID 20: Request, AdhocQueryRequest, MessageId urn:uuid:452d35f4-ed04-4f6c-8230-bd70083b3e45
Message ID 19: Response...
Message ID 22: Response...
Message ID 20: Response, AdhocQueryResponse, RelatesTo urn:uuid:452d35f4-ed04-4f6c-8230-bd70083b3e45
Message ID 21: Response, AdhocQueryResponse, RelatesTo urn:uuid:452d35f4-ed04-4f6c-8230-bd70083b3e45

@skbhaskarla
Copy link
Collaborator

I understand the issue but without a test procedure that can consistently reproduce the problem, I wouldn't be able to confirm the bug fix.

@dbaumann2
Copy link
Author

Is a simple unit test sufficient or do you need an integration test?

@skbhaskarla
Copy link
Collaborator

Is a simple unit test sufficient or do you need an integration test?

Any of those type of tests is fine as long as it shows the toolkit generated MessageID is incorrect.

@dbaumann2
Copy link
Author

I've written a SoapUI Performance Test that can detect the error:
image

ToolKit-Test-soapui-project.xml.zip

@skbhaskarla
Copy link
Collaborator

I built XDS Toolkit from your pull request, ran the load tests but according to the results in the screenshot below, the error condition still exists.

image

image

The good news is that your test is able to confirm the issue you reported but I'm afraid the pull request code did not solve the issue.

I am adding @andrewmccaffreynist to help look into the root cause of this issue.

@dbaumann2
Copy link
Author

We did not take a deeper look at the code. It just seemed to be the relevant place, but we unfortunately did not have time to compile and deploy it yet.
Thank you very much for further investigating this issue!

@mratzenb
Copy link

We did some more investigation and found the problematic place. It is the creation of event directories, which are then used for saving the request, response and some other stuff.

When handling a request in gov/nist/toolkit/fhir/simulators/servlet/SimServlet.java a new gov/nist/toolkit/simcommon/server/SimDb.groovy instance is created every time. In the constructor, an event Directory with the current Date as name. This is already the problem, because if two SimDb instances get created with the same Date (which can happen, since the timestamp is made up of year.month.day.hours.minutes.seconds.milliseconds), everything serialized into the event directory will be overwritten. Hence, explaining the mismatch of messageID and relatesTo.

We "fixed" this issue by adding a new UUID to gov/nist/toolkit/installation/server/Installation.java#asFilenameBase like so:

    // Installation.java#asFilenameBase
    return val + "-"+ UUID.randomUUID();

Using the provided SoapUI project, we can see no errors anymore. However, we don't know the side effects this will have on the entire Toolkit.

An additional improvement would be to not rely on I/O operations for the request/response handling. For example, ThreadLocals could be used for storing the data within the current thread.

@mratzenb
Copy link

mratzenb commented Jun 15, 2023

After some more digging around, a more elegant solution would be to synchronize during the file event creation inside SimDb like so:

public class SimDb {
	// ...
	private static final Object fileLock = new Object();
	//...
	SimDb(SimId simId, String actor, String transaction, boolean openToLastTransaction) {
		//...
		if (openToLastTransaction) {
			openMostRecentEvent(actor, transaction)
		} else {
			// SimDb Line: 115
                        File eventDir;
			synchronized (fileLock) {
				eventDate = new Date();
				eventDir = mkEventDir(eventDate);
				eventDir.mkdirs();
			}
			Serialize.out(new File(eventDir, "date.ser"), eventDate);
		}
	}
	// ...
}

@skbhaskarla
Copy link
Collaborator

@mratzenb Thanks for your investigation, I believe you have found the root cause of the reported issue.

skbhaskarla pushed a commit that referenced this issue Jun 16, 2023
@skbhaskarla skbhaskarla added this to the v7.10.0 milestone Jun 16, 2023
@skbhaskarla
Copy link
Collaborator

Additional details based on the investigation above:
The problem has two parts: 1) Clobbered SimDB Event folder by the same current timestamp creation and 2) Unreliable response message relatesTo/messageId correlation.
For the first part, the contributed code seems to mitigate the issue. I have run the SoapUI tests against the updated code, and looking at the SimDB Event folders created, the race condition mitigation is evident when creating two separate event folders based on the same timestamp. Example: 2023_06_16_11_08_42_622 and 2023_06_16_11_08_42_622_1. In this case, the appended integer counter (i.e., "_1") to the folder name is made unique by the mkEventDir method once it is given a chance. This was how the fix was verified in my local environment.

Second, as it was suggested as an "additional improvement", I agree that the request message validation can use an in-memory content reference rather than re-loading from it from the SimDB Event folder on disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants