Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed write with minimum quorum kills database if one node offline #2346

Closed
alexpmorris opened this issue May 16, 2014 · 4 comments
Closed
Assignees
Milestone

Comments

@alexpmorris
Copy link

I've tried this on both ubuntu and windows, with the latest 1.7 snapshot - same problem happens on both.

Start two nodes with write quorum = 2 and failureAvailableNodesLessQuorum = true

Replication should work fine on record update from studio.

Shut down either node, now save a record:

"com.orientechnologies.orient.server.distributed.ODistributedException: Quorum cannot be reached because it is major than available nodes and 'failureAvailableNodesLessQuorum' settings is true"

Correct error received, but the new record version is saved to database anyway! Shouldn't the write be flat out rejected? AND, upon restarting second node:

2014-05-16 09:20:55:674 WARN [node1400243530171] found 1 previous messages in queue orientdb.node.node1400243530171.testdb.request, aligning the database...
[OHazelcastDistributedMessageService]
2014-05-16 09:20:59:921 WARN segment file 'database.ocf' was not closed correctl
y last time [OSingleFileSegment][node1400243530171]<-[node1400243439918] error on reading distributed request: record_update(#9:0 v.57)
Error on creation of shared resource
-> com.orientechnologies.common.concur.resource.OSharedContainerImpl.getResource
(OSharedContainerImpl.java:55)
-> com.orientechnologies.orient.server.distributed.ODistributedStorage.getResour
ce(ODistributedStorage.java:516)
-> com.orientechnologies.orient.core.metadata.OMetadataDefault.init(OMetadataDef
ault.java:110)
-> com.orientechnologies.orient.core.metadata.OMetadataDefault.load(OMetadataDef
ault.java:68)
-> com.orientechnologies.orient.core.db.record.ODatabaseRecordAbstract.open(ODat
abaseRecordAbstract.java:291)
-> com.orientechnologies.orient.core.db.ODatabaseWrapperAbstract.open(ODatabaseW
rapperAbstract.java:49)
-> com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:555)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDatabase.i
nitDatabaseInstance(OHazelcastDistributedDatabase.java:268)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDatabase.o
nMessage(OHazelcastDistributedDatabase.java:458)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDatabase$1
.run(OHazelcastDistributedDatabase.java:231)
-> java.lang.Thread.run(Thread.java:679)
The record with id '#0:1' not found
-> com.orientechnologies.common.concur.resource.OSharedContainerImpl.getResource
(OSharedContainerImpl.java:55)
-> com.orientechnologies.orient.server.distributed.ODistributedStorage.getResour
ce(ODistributedStorage.java:516)
-> com.orientechnologies.orient.core.metadata.OMetadataDefault.init(OMetadataDef
ault.java:110)
-> com.orientechnologies.orient.core.metadata.OMetadataDefault.load(OMetadataDef
ault.java:68)
-> com.orientechnologies.orient.core.db.record.ODatabaseRecordAbstract.open(ODat
abaseRecordAbstract.java:291)
-> com.orientechnologies.orient.core.db.ODatabaseWrapperAbstract.open(ODatabaseW
rapperAbstract.java:49)
-> com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:555)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDatabase.i
nitDatabaseInstance(OHazelcastDistributedDatabase.java:268)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDatabase.o
nMessage(OHazelcastDistributedDatabase.java:458)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDatabase$1
.run(OHazelcastDistributedDatabase.java:231)
-> java.lang.Thread.run(Thread.java:679)
Storage testdb is not opened.
-> com.orientechnologies.common.concur.resource.OSharedContainerImpl.getResource
(OSharedContainerImpl.java:55)
-> com.orientechnologies.orient.server.distributed.ODistributedStorage.getResour
ce(ODistributedStorage.java:516)
-> com.orientechnologies.orient.core.metadata.OMetadataDefault.init(OMetadataDef
ault.java:110)
-> com.orientechnologies.orient.core.metadata.OMetadataDefault.load(OMetadataDef
ault.java:68)
-> com.orientechnologies.orient.core.db.record.ODatabaseRecordAbstract.open(ODat
abaseRecordAbstract.java:291)
-> com.orientechnologies.orient.core.db.ODatabaseWrapperAbstract.open(ODatabaseW
rapperAbstract.java:49)
-> com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:555)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDatabase.i
nitDatabaseInstance(OHazelcastDistributedDatabase.java:268)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDatabase.o
nMessage(OHazelcastDistributedDatabase.java:458)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDatabase$1
.run(OHazelcastDistributedDatabase.java:231)
-> java.lang.Thread.run(Thread.java:679)Exception in thread "main" com.orientech
nologies.common.exception.OException: File flush was abnormally terminated
at com.orientechnologies.orient.core.index.hashindex.local.cache.OWOWCac
he.flush(OWOWCache.java:593)
at com.orientechnologies.orient.core.index.hashindex.local.cache.OReadWr
iteDiskCache.flushFile(OReadWriteDiskCache.java:228)
at com.orientechnologies.orient.core.storage.impl.local.paginated.OClust
erPositionMap.flush(OClusterPositionMap.java:90)
at com.orientechnologies.orient.core.storage.impl.local.paginated.OPagin
atedCluster.synch(OPaginatedCluster.java:1447)
at com.orientechnologies.orient.core.storage.impl.local.paginated.OPagin
atedCluster.close(OPaginatedCluster.java:219)
at com.orientechnologies.orient.core.storage.impl.local.paginated.OLocal
PaginatedStorage.doClose(OLocalPaginatedStorage.java:1876)
at com.orientechnologies.orient.core.storage.impl.local.paginated.OLocal
PaginatedStorage.close(OLocalPaginatedStorage.java:304)
at com.orientechnologies.orient.core.storage.impl.local.paginated.OLocal
PaginatedStorage.open(OLocalPaginatedStorage.java:220)
at com.orientechnologies.orient.core.db.raw.ODatabaseRaw.open(ODatabaseR
aw.java:101)
at com.orientechnologies.orient.core.db.ODatabaseWrapperAbstract.open(OD
atabaseWrapperAbstract.java:49)
at com.orientechnologies.orient.core.db.record.ODatabaseRecordAbstract.o
pen(ODatabaseRecordAbstract.java:268)
at com.orientechnologies.orient.core.db.ODatabaseWrapperAbstract.open(OD
atabaseWrapperAbstract.java:49)
at com.orientechnologies.orient.server.OServer.openDatabase(OServer.java
:555)
at com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDa
tabase.initDatabaseInstance(OHazelcastDistributedDatabase.java:268)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.loadDi
stributedDatabases(OHazelcastPlugin.java:713)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.startu
p(OHazelcastPlugin.java:188)
at com.orientechnologies.orient.server.OServer.registerPlugins(OServer.j
ava:718)
at com.orientechnologies.orient.server.OServer.activate(OServer.java:239
)
at com.orientechnologies.orient.server.OServerMain.main(OServerMain.java
:32)
Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerExcepti
on
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
at java.util.concurrent.FutureTask.get(FutureTask.java:111)
at com.orientechnologies.orient.core.index.hashindex.local.cache.OWOWCac
he.flush(OWOWCache.java:588)
... 18 more
Caused by: java.lang.NullPointerException
at com.orientechnologies.orient.core.index.hashindex.local.cache.OWOWCac
he$FileFlushTask.call(OWOWCache.java:373)
at com.orientechnologies.orient.core.index.hashindex.local.cache.OWOWCac
he$FileFlushTask.call(OWOWCache.java:322)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$101(ScheduledThreadPoolExecutor.java:165)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:603)
at java.lang.Thread.run(Thread.java:679)
Exception in thread "OrientDB Node Request orientdb.node.node1400243530171.testdb.request" com.orientechnologies.orient.core.exception.OStorageException: Sto
rage testdb is not opened.
at com.orientechnologies.orient.core.storage.OStorageEmbedded.checkOpene
ss(OStorageEmbedded.java:242)
at com.orientechnologies.orient.core.storage.impl.local.paginated.OLocal
PaginatedStorage.getClusterIdByName(OLocalPaginatedStorage.java:993)
at com.orientechnologies.orient.server.distributed.ODistributedStorage.g
etClusterIdByName(ODistributedStorage.java:728)
at com.orientechnologies.orient.core.db.raw.ODatabaseRaw.getClusterIdByN
ame(ODatabaseRaw.java:388)
at com.orientechnologies.orient.core.db.ODatabaseWrapperAbstract.getClus
terIdByName(ODatabaseWrapperAbstract.java:186)
at com.orientechnologies.orient.core.cache.OLevel1RecordCache.startup(OL
evel1RecordCache.java:52)
at com.orientechnologies.orient.core.db.record.ODatabaseRecordAbstract.o
pen(ODatabaseRecordAbstract.java:288)
at com.orientechnologies.orient.core.db.ODatabaseWrapperAbstract.open(OD
atabaseWrapperAbstract.java:49)
at com.orientechnologies.orient.server.OServer.openDatabase(OServer.java
:555)
at com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDa
tabase.initDatabaseInstance(OHazelcastDistributedDatabase.java:268)
at com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDa
tabase.setOnline(OHazelcastDistributedDatabase.java:280)
at com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDa
tabase$1.run(OHazelcastDistributedDatabase.java:218)
at java.lang.Thread.run(Thread.java:679)

@lvca lvca self-assigned this May 29, 2014
@lvca lvca modified the milestones: 1.7.1, 1.7.2 May 29, 2014
@lvca lvca modified the milestones: 1.7.3, 1.7.2 Jun 7, 2014
@andrii0lomakin andrii0lomakin modified the milestones: 1.7.4, 1.7.3 Jun 12, 2014
@lvca lvca modified the milestones: 1.7.5, 1.7.4 Jun 23, 2014
@lvca lvca modified the milestones: 1.7.6, 1.7.5, 1.7.7 Jul 10, 2014
@lvca lvca modified the milestones: 1.7.8, 1.7.7 Jul 23, 2014
@lvca lvca removed the 1 - Next label Aug 1, 2014
@lvca
Copy link
Member

lvca commented Aug 13, 2014

Can you retry with last 1.7.8-SNAPSHOT?

@lvca lvca modified the milestones: 1.7.8, 1.7.9 Aug 13, 2014
@alexpmorris
Copy link
Author

I banged away at it from studio, killed each orientdb process back and forth, and it never corrupted or threw the database out of sync at all, so great work and kudos!

Just two minor possible issues...

  1. The message ""com.orientechnologies.orient.server.distributed.ODistributedException: Quorum cannot be reached because it is major than available nodes and 'failureAvailableNodesLessQuorum' settings is true" only showed up as the second node was coming back online, and NOT while it was down completely. Intermittently, after a long wait, I did sometimes see this message: "com.orientechnologies.orient.core.exception.OStorageException: Cannot route UPDATE_RECORD operation against Native support of queries from Object Database [moved] #11:0 to the distributed node\r\n--> com.orientechnologies.orient.server.distributed.ODistributedException: Error on sending distributed request ... " Not to get too picky, but I would imagine the expected behavior would be to receive the same consistent error message for the entire time Quorum could not be reached.

  2. (Side Issue) One time, I had this happen after both were running again:
    Link: http://127.0.0.1:2480/studio/index.html#/database/mydb/browse/edit/11:2
    Would jump to: http://127.0.0.1:2480/studio/index.html#/404
    With json: {"data":"","status":0,"config":{"method":"GET","transformRequest":[null],"transformResponse":[null],"url":"/document/mydb/11:2","headers":{"Accept":"application/json, text/plain, /"}},"statusText":""}

    This seems to be some sort of cache-related issue though (even though all the studio requests in chrome were set to no-cache), because when I tried the link in another chrome tab it worked fine on the same database.

@lvca
Copy link
Member

lvca commented Aug 15, 2014

@qd01a Ok, I'm closing this issue, for the other 2 things, could you open new issues to get better tracked?

@lvca lvca closed this as completed Aug 15, 2014
@alexpmorris
Copy link
Author

will do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants