-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: panic: indexed var linked to different container #16388
Comments
Dumping my thoughts after I looked at this with jordan:
So what we have at hand is an impossible stack trace! Which in turn indicates something went very wrong in the Go runtime. |
Also another thing that is very suspicious is that the error message tells us (AssertSameContainer) the IndexedVar was migrated from one renderNode to another (the error would only occur if the addresses of the two renderNodes are different in memory, thus different objects). Although we have logic that migrates expressions from one node to another,
This suggests to me the rogue IndexedVar has "happened" into this AST by means of reusing a stale object from somewhere else in the program state. |
There are two
|
I created the table and ran the query (without the WHERE) locally and I don't see the crash.
|
Ah, my bad, this is an 1.0.x version, this doesn't have the recent rewrite. |
Perhaps this has to do with the fact that it's a prepared statement (notice it uses a placeholder) - maybe one of the |
Indeed!
Logs:
|
Although the above was produced on a |
I can't reproduce this on the build 2cc11e7 that the user reported the crash with. This should be fixed regardless. |
Just to make clear... the statement generally executes fine ... just during stress test with 3 concurrent workers in a 9 node cluster sometimes (1 out of 1000) execution fails (see problem description on top) up to now I see problem only on nodes where data was not inserted , will do some more tests to confirm the same .. |
Thanks @HeikoOnnebrink. The related problem on |
IndexedVars are modified in-place I believe (BindIfUnbound). |
But in normal operation, However, there was a bug: Line 142 in 5370448
This line mutates the original parsed AST, causing reuse of that AST to fail. This line is not necessary and should be removed. I will be submitting a PR to do this shortly. |
Nice find! |
#16434 is a good fix for the problem identified in #16388 (comment) but not for this issue. |
Hmm, yeah I certainly can't prove that this issue is fully solved by #16434. But the two issues are suspiciously similar, so I was planning on optimistically closing this. I wonder if there is some place where we prepare an AST more than once before the prepare caching was introduced. |
not the same AST, no |
Another case when mutation of the parsed AST is a problem is when retrying statements on the server side. |
(You could instrument |
From Gitter
|
I created a custom build from release-1.0 |
@HeikoOnnebrink I am delighted that Jordan's fix has alleviated your problem. The fix will be released in CockroachDB 1.0.3. The precise release date is not known yet, but usually we do stable releases every 2-3 weeks. We will update the Docker images at the time of release. @jordanlewis color me impressed. I still currently have zero idea where the AST gets reused in 1.0. |
@knz just to understand it right... when 1.0.3 will be released which docker image will be updated? |
I'm not sure. @benesch can you chime in? |
Each new release will push to |
report by @HeikoOnnebrink on gitter.
today I was prototyping some SQL as part of some REST API.
The idea is to have some table that stores requests that are inserted from different sources where RequestKey is the Primary key and is of type SERIAL to ensure global sorting.
CREATE TABLE IF NOT EXISTS DbRequests (
RequestKey serial PRIMARY KEY,
RequestType string NOT NULL,
RequestLock bool NOT NULL default false,
ID string NOT NULL,
etc..
There is a set of consumers (worker processes) that grab requests and process them in some sorted fashion by quering the dbRequestTable
(hope in the future when NOTIFY made its way to Roach I can use smarter way than to query the table in some interval..
In order to feed the workers I use some SQL that, on each execution, returns the oldest request that is not yet processed.
The worker will try to lock the request (Update DbRequests set RequestLock=TRUE where RequestKey = and RequestLock=false)
As there can be several requests for each resource (I use here some ID column that saves the UUID of each resource) the query will supress RequestKeys where at least one record per ID has RequestLock = true
In this way workers process requests fully parallel for different IDs but sequentially for requests belonging to same ID.
The related SQL that does the job is:
SELECT min(requestkey) AS requestkey FROM dbrequests WHERE dcname = $1 GROUP BY id HAVING count (case when requestlock = true then 1 end) = 0 ORDER BY 1 LIMIT 1;
My environment is a 9 node cluster spanning 3 DCs with 3 nodes per DC and all replication settings left default.
I run some tests by in injesting some hundred requests at DC 1 and let workers poll the DB in 2 second interval from DC A, B & C
Basically all works fine as designed .. but after some time remote nodes in DC B and C fail with PANIC in log.
I uploaded log cockroachnodeFailSql.log to https://bitbucket.org/honnebrink/public-file-share/src
HeikoOnnebrink @HeikoOnnebrink 07:14
I repeated the tests and could each time reproduce the problem.. one test run got 2 FATALs in DC B and 1 FATAL in DC C.. second test run failed 3 times in DC B only.. Test was executed with 4444 records..
The text was updated successfully, but these errors were encountered: