-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: Initial sync #47
Comments
Hello, I'd agree if that could be exposed by an API or any other way (conf. file / etc ...) that would be really great. For stale collection (at least my use case I would say) I've made a patch and I don't like idea of using personal forked code suited for my use case, in production. Regards, |
Yes I'd also love to see this feature implemented. After working with the mysql river (which slurps in the table initially) I thought I was doing something wrong when my collection wasn't being slurped. If there is no plan to implement this, it might be worth mentioning this in the wiki. |
+1 once we have changed the mapping,we need to clean the old-index,and ask a "re-pull" function,pulling data from mongo-db to elasticsearch,hope to see this feature. |
+1. Would love to see this feature supported |
+1 would love this too! |
+1 for this! |
A good work around for this would be to simply do a BULK-UPDATE on the collection after the mongo-rivers are setup. I use this for millions of records and it works great |
+1 Very usefull! |
+1 |
subratbasnet: Then setup the river on collectionB? |
Yevesx: What I meant was, when you have a stale collection in mongo. First you would setup the river for that collection. This will NOT automatically start moving the data from the stale collection to Elastic search. To trigger that, you could simply perform a bulk update on the collection by a specific condition that matches all the records. For example, in my case, I simply change the "updated" field inside all the documents in my collection.. and this triggers the river and it moves the affected documents to Elaseicsearch |
I see! This is a clever idea.
|
Updating every document in MongoDB to get them to appear in the oplog and be copied by the river is very clever. Unfortunately, I believe this will make the initial import much slower. Writes to MongoDB are much slower for me than writes to ElasticSearch (because MongoDB stores data less efficiently than ES and because MongoDB has its unfortunate DB-level lock). Do you think it would work if we copied over all documents from MongoDB and then iterated over the oplog? I think that's what this issue is requesting and it doesn't sound much more difficult than what we have today. A nice optimization would be to read the latest oplog timestamp, import the collection, then import the oplog only from the start timestamp. |
There are few challenges with your suggestion:
[1] - http://docs.mongodb.org/manual/reference/method/db.fsyncLock/ |
Thanks for the feedback. Could you clarify? I'm not sure that those things are true. E.g. why would the collection need to be locked? Yes, copying without locking could result in an inconsistent state, but then once the oplog is applied wouldn't that fix it? |
The main reason of #47 is to synchronize data not available on oplog.rs
In step 1. we will need to need ensure no new data is imported in the collection. If you import without locking you could get inconsistent state / data, but it does not garanty that will be fixed when processing oplog. |
What if we just make it so that you can only run the initial import on a replica set? |
+1 |
I have posted the question here [1]. Let's see what MongoDB experts will answer... [1] - https://groups.google.com/forum/#!topic/mongodb-user/sOKlhD_E2ns |
Response from William Zola at 10gen for @richardwilly98's question:
|
- TODO: Initial import with GridFS will still to be optimized - Directly DBObject instead of Map - Cleanup the logic with GridFS enabled - New unit test for initial import with GridFS - Reduce wait time unit test from 6 sec to 2 sec - Script filter is not used anymore when GridFS enabled
@richardwilly98 Awesome! I'm really excited about the ability to do an initial import! One thing I'm not very sure about is how to handle the river being stopped and then started again during the initial import. We have to restart the initial import in that case. Should we drop the index and start the initial import again? I'm hesitant to drop an index though. Maybe we should just stop the river from doing anything and post a warning to the admin UI and logs that the index needs to be dropped? |
@benmccann I agree dropping the index is not a good option. We need a flag to indicate the initial import is in progress. If the flag is not cleared and timestamp is null then stop the river and send warning as you suggested. |
+1 |
@cggaurav this is already implemented and released this issue should probably be closed |
Hey Richard,
I would like to suggest some sort of initial sync functionality (optional).
Something like when you create the river via the PUT api, some additional options regarding on how the user would like to perform the initial sync.
This would be a "one time" operation. I dont even know if it is possible...
The main issue is that not everything is on the oplog, especially for really large and stale collections...
So it would be nice to implement a set of options that would allow the user to tell the river to pull all data from mongo (much like a GetAll operation).
Of course we could discuss different strategies for pulling the data, such as:
It would be nice to support different import strategies, much like as plugins for this river.
Keep up the good work :)
The text was updated successfully, but these errors were encountered: