Suggestion: Initial sync #47

calexandre · 2012-12-18T12:04:37Z

Hey Richard,
I would like to suggest some sort of initial sync functionality (optional).

Something like when you create the river via the PUT api, some additional options regarding on how the user would like to perform the initial sync.

This would be a "one time" operation. I dont even know if it is possible...

The main issue is that not everything is on the oplog, especially for really large and stale collections...
So it would be nice to implement a set of options that would allow the user to tell the river to pull all data from mongo (much like a GetAll operation).

Of course we could discuss different strategies for pulling the data, such as:

GetAll (easy, but cumbersome for large collections)
via MongoDump, MongoExport of BsonDump
Others..?

It would be nice to support different import strategies, much like as plugins for this river.

Keep up the good work :)

xma · 2012-12-18T12:10:52Z

Hello,

I'd agree if that could be exposed by an API or any other way (conf. file / etc ...) that would be really great.

For stale collection (at least my use case I would say) I've made a patch
#45

and I don't like idea of using personal forked code suited for my use case, in production.

Regards,

greatwitenorth · 2013-03-13T15:05:28Z

Yes I'd also love to see this feature implemented. After working with the mysql river (which slurps in the table initially) I thought I was doing something wrong when my collection wasn't being slurped. If there is no plan to implement this, it might be worth mentioning this in the wiki.

medcl · 2013-03-21T03:50:20Z

+1 once we have changed the mapping,we need to clean the old-index,and ask a "re-pull" function,pulling data from mongo-db to elasticsearch,hope to see this feature.

mzafer · 2013-04-24T20:01:31Z

+1. Would love to see this feature supported

subratbasnet · 2013-05-04T03:48:19Z

+1 would love this too!

bitmorse · 2013-05-08T14:58:16Z

+1 for this!

subratbasnet · 2013-05-08T15:00:21Z

A good work around for this would be to simply do a BULK-UPDATE on the collection after the mongo-rivers are setup. I use this for millions of records and it works great

enrique-fernandez-polo · 2013-05-17T10:31:08Z

+1 Very usefull!

yvesx · 2013-05-17T15:34:25Z

+1

yvesx · 2013-05-17T15:53:34Z

subratbasnet:
what do yo u mean by a BULK-UPDATE? if A is the large stale collection, do you mean setting up another empty collection B do this:
db.collectionA.find().
forEach( function(i) {
i.ts_imported = new Date();
db.collectionB.insert(i);
});

Then setup the river on collectionB?

subratbasnet · 2013-05-17T15:59:13Z

Yevesx:

What I meant was, when you have a stale collection in mongo. First you would setup the river for that collection. This will NOT automatically start moving the data from the stale collection to Elastic search.

To trigger that, you could simply perform a bulk update on the collection by a specific condition that matches all the records. For example, in my case, I simply change the "updated" field inside all the documents in my collection.. and this triggers the river and it moves the affected documents to Elaseicsearch

yvesx · 2013-05-17T16:35:36Z

I see! This is a clever idea.
On May 17, 2013, at 10:59 AM, Subrat Basnet notifications@github.com wrote:

Yevesx:

What I meant was, when you have a stale collection in mongo. First you would setup the river for that collection. This will NOT automatically start moving the data from the stale collection to Elastic search.

To trigger that, you could simply perform a bulk update on the collection by a specific condition that matches all the records. For example, in my case, I simply change the "updated" field inside all the documents in my collection.. and this triggers the river and it moves the affected documents to Elaseicsearch

—
Reply to this email directly or view it on GitHub.

benmccann · 2013-09-24T00:16:00Z

Updating every document in MongoDB to get them to appear in the oplog and be copied by the river is very clever. Unfortunately, I believe this will make the initial import much slower. Writes to MongoDB are much slower for me than writes to ElasticSearch (because MongoDB stores data less efficiently than ES and because MongoDB has its unfortunate DB-level lock). Do you think it would work if we copied over all documents from MongoDB and then iterated over the oplog? I think that's what this issue is requesting and it doesn't sound much more difficult than what we have today. A nice optimization would be to read the latest oplog timestamp, import the collection, then import the oplog only from the start timestamp.

richardwilly98 · 2013-09-24T14:43:54Z

There are few challenges with your suggestion:

A MongoDB collection does not have out of box timestamp field. So the query to read the collection will need to be defined in river settings.
The collection will need to be locked during the initial import. Maybe using [1]. Is it acceptable?
There is already an initial timestamp settings but it would need to be dynamically calculated based on the end of the initial import.

[1] - http://docs.mongodb.org/manual/reference/method/db.fsyncLock/

benmccann · 2013-09-24T15:23:57Z

Thanks for the feedback. Could you clarify? I'm not sure that those things are true. E.g. why would the collection need to be locked? Yes, copying without locking could result in an inconsistent state, but then once the oplog is applied wouldn't that fix it?

richardwilly98 · 2013-09-24T15:43:17Z

The main reason of #47 is to synchronize data not available on oplog.rs
So I am thinking of the following scenario:

The collection has been created and populated before replicaset has been setup (at this point replicaset is not setup yet).
Run the initial import
Once the initial import is completed setup the replicaset
River will then import data from oplog.rs

In step 1. we will need to need ensure no new data is imported in the collection.

If you import without locking you could get inconsistent state / data, but it does not garanty that will be fixed when processing oplog.

benmccann · 2013-09-24T15:55:46Z

What if we just make it so that you can only run the initial import on a replica set?

nfx · 2013-09-24T21:24:22Z

+1

richardwilly98 · 2013-09-25T09:16:38Z

I have posted the question here [1]. Let's see what MongoDB experts will answer...

[1] - https://groups.google.com/forum/#!topic/mongodb-user/sOKlhD_E2ns

benmccann · 2013-09-25T19:07:05Z

Response from William Zola at 10gen for @richardwilly98's question:

The way that MongoDB does initial sync internally is:
 - Record the latest timestamp (call it time 'T') from the oplog
 - Copy all of the documents from the collection
 - Apply all of the operations in the oplog starting from time 'T'

You could use the same strategy in your plugin.

- TODO: Initial import with GridFS will still to be optimized - Directly DBObject instead of Map - Cleanup the logic with GridFS enabled - New unit test for initial import with GridFS - Reduce wait time unit test from 6 sec to 2 sec - Script filter is not used anymore when GridFS enabled

benmccann · 2013-09-27T20:23:43Z

@richardwilly98 Awesome! I'm really excited about the ability to do an initial import!

One thing I'm not very sure about is how to handle the river being stopped and then started again during the initial import. We have to restart the initial import in that case. Should we drop the index and start the initial import again? I'm hesitant to drop an index though. Maybe we should just stop the river from doing anything and post a warning to the admin UI and logs that the index needs to be dropped?

richardwilly98 · 2013-09-27T20:45:19Z

@benmccann I agree dropping the index is not a good option.

We need a flag to indicate the initial import is in progress. If the flag is not cleared and timestamp is null then stop the river and send warning as you suggested.

cggaurav · 2013-10-22T15:53:54Z

+1

benmccann · 2013-10-22T15:55:07Z

@cggaurav this is already implemented and released

this issue should probably be closed

enrique-fernandez-polo mentioned this issue May 17, 2013

Delete index when corresponding collection is dropped #79

Closed

This was referenced Aug 8, 2013

No documents found on elastic #111

Closed

Problem Import MongoDB #112

Closed

richardwilly98 mentioned this issue Sep 6, 2013

Index count #121

Closed

benmccann mentioned this issue Sep 24, 2013

failed to create river [mongodb][mongodb] #126

Closed

benmccann mentioned this issue Sep 26, 2013

First pass at support for initial import #141

Merged

richardwilly98 mentioned this issue Sep 27, 2013

Optimize initial import with GridFS #147

Closed

richardwilly98 mentioned this issue Oct 7, 2013

Too much indexed docs - drop database #133

Closed

richardwilly98 closed this as completed Oct 22, 2013

richardwilly98 mentioned this issue Nov 6, 2013

Version 1.7.1 doesn't support existing index. #167

Closed

richardwilly98 mentioned this issue May 1, 2014

Data reimported on _river index recreation when Mongo collection is empty #251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Initial sync #47

Suggestion: Initial sync #47

calexandre commented Dec 18, 2012

xma commented Dec 18, 2012

greatwitenorth commented Mar 13, 2013

medcl commented Mar 21, 2013

mzafer commented Apr 24, 2013

subratbasnet commented May 4, 2013

bitmorse commented May 8, 2013

subratbasnet commented May 8, 2013

enrique-fernandez-polo commented May 17, 2013

yvesx commented May 17, 2013

yvesx commented May 17, 2013

subratbasnet commented May 17, 2013

yvesx commented May 17, 2013

benmccann commented Sep 24, 2013

richardwilly98 commented Sep 24, 2013

benmccann commented Sep 24, 2013

richardwilly98 commented Sep 24, 2013

benmccann commented Sep 24, 2013

nfx commented Sep 24, 2013

richardwilly98 commented Sep 25, 2013

benmccann commented Sep 25, 2013

benmccann commented Sep 27, 2013

richardwilly98 commented Sep 27, 2013

cggaurav commented Oct 22, 2013

benmccann commented Oct 22, 2013

Suggestion: Initial sync #47

Suggestion: Initial sync #47

Comments

calexandre commented Dec 18, 2012

xma commented Dec 18, 2012

greatwitenorth commented Mar 13, 2013

medcl commented Mar 21, 2013

mzafer commented Apr 24, 2013

subratbasnet commented May 4, 2013

bitmorse commented May 8, 2013

subratbasnet commented May 8, 2013

enrique-fernandez-polo commented May 17, 2013

yvesx commented May 17, 2013

yvesx commented May 17, 2013

subratbasnet commented May 17, 2013

yvesx commented May 17, 2013

benmccann commented Sep 24, 2013

richardwilly98 commented Sep 24, 2013

benmccann commented Sep 24, 2013

richardwilly98 commented Sep 24, 2013

benmccann commented Sep 24, 2013

nfx commented Sep 24, 2013

richardwilly98 commented Sep 25, 2013

benmccann commented Sep 25, 2013

benmccann commented Sep 27, 2013

richardwilly98 commented Sep 27, 2013

cggaurav commented Oct 22, 2013

benmccann commented Oct 22, 2013