Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to pass in version argument to ElasticsearchTypeMixin.bulk_index method #20

Open
avelis opened this issue Aug 22, 2017 · 7 comments

Comments

@avelis
Copy link
Contributor

avelis commented Aug 22, 2017

The idea is to pass in a version argument in order to alter the results of get_document, get_es, & possibly get_type_name. The best way I see this working out is to add the argument with default set to None.

Adding such an enhancement would allow anyone to facilitate an ES transition/migration as they needed it.

@jaddison
Copy link
Owner

Sorry, but I don't understand the use case for this? Can you elaborate?

(Apologies for the late reply!)

@avelis
Copy link
Contributor Author

avelis commented Oct 30, 2017

@jaddison Specifying a version allows for the flexibility to sync to potentially more than one endpoint with different mappings. The goal is to reduce boilerplate needed facilitate such a feature.

@jaddison
Copy link
Owner

@avelis I still don't understand the feature - but feel free to submit a PR demonstrating the feature and its use case. The PR wouldn't need to be 'complete' or fleshed out, but enough to demonstrate your intentions. Hopefully it wouldn't take more than a few minutes to put it together. If it makes sense (ie we can agree on a path forward) then it can be fleshed out more.

As you can tell, I'm hesitant to add anything that increases complexity or falls out of the 'natural' scope for the project. If it is indeed within scope and is (or could be?) a common usage, then by all means I would merge as long as it had appropriate docs and tests.

@avelis
Copy link
Contributor Author

avelis commented Oct 30, 2017

@jaddison I can see what I can do. To better illustrate where I am coming from I am linking a blog post that highlights their similar approach to re-indexing with zero downtime. This is what is driving the inspiration in this ticket.

ES almost always requires a re-index if a mapping change removes any fields in a given document. This proves challenging if your index is large and can impact a user experience if down. The article above essentially suggested a parallel syncing strategy. The DSE library provides the structure for get_mapping and the serialization method get_document. The challenge is that if two indexes have two different mappings how do I know which document dictionary to generate for which mapping without having some indicator as to what endpoint the document is going to. That is at the point of my suggestion for a version method argument.

However, I do understand that it moves out of the territory of "simple". Thanks again for making this library.

@jaddison
Copy link
Owner

jaddison commented Oct 30, 2017

Ok, so I'll paraphrase spell out in great detail what I think you're suggesting.

Even though DSE supports alias-based 'zero' downtime (indexing all data to a new index while the old one is running, then switching the alias to the new index name), there are still problems.

  • new codebase is running with the old index mappings/data (while the new index mappings/data are being indexed)
    • this is not really something that can be solved within DSE as it's much too application specific
  • data might change (created/changed/deleted) after reindexing data into a new index has started; currently it is possible for these changes to get missed
    • a crude/simplistic attempt to mitigate this exists, but a) it puts the onus on the developer to order their results accordingly and b), it won't catch deletes unless they're 'soft' deletes

You see the ability to write a single Model's queryset in parallel to two separate indexes (types, maybe) (through parameter-versioned get_mapping/get_document methods) as a way around these issues. I think I get it.

I think parameter-versioning get_mapping/get_document is one way to do it, but I'm not entirely sure how the system would know to reference both when indexing data normally.

In the long run, I think maybe a better approach is to use a 'router' paradigm:

  • a central registry (ie. the router) of what ElasticsearchTypeMixin-based class handlers map to which Django model
    • it would replace the ELASTICSEARCH_TYPE_CLASSES simplistic setting (maybe the logic around this setting can simply be modified...)
    • this would allow us to register multiple class handlers for each Django model

In a complex project migration scenario:

  1. make model specific changes (additions/changes, only - no deletes)
    a. no functional template/view/etc related changes would be included at this point that rely on these changes
  2. add new ElasticsearchTypeMixin-based class handler that maps/indexes data in the new way
    a. might need to modify the original ElasticsearchTypeMixin class handler depending on the model changes to maintain original index mapping/data consistency
    b. add the new ElasticsearchTypeMixin class handler to the DSE router for the model in question
  3. probably need to manually create the new index/mapping
  4. deploy this
    a. at this point, the old and new ElasticsearchTypeMixin class handlers should be adding, updating and deleting as necessary
  5. run the reindex operation, targeting ONLY the index storing data for the new ElasticsearchTypeMixin class handler
  6. update the rest of the code base to use the data/structure stored from new ElasticsearchTypeMixin class handler mapping
  7. deploy this (only when the reindex is complete, obviously)
  8. remove the old ElasticsearchTypeMixin based class handler from the router
  9. make any final model changes to remove fields or other destructive changes that couldn't be done because the old ElasticsearchTypeMixin class handler required it
  10. deploy this

I think between the blog article you linked and your descriptions, this covers it?

@avelis
Copy link
Contributor Author

avelis commented Oct 30, 2017

@jaddison Looks pretty thorough to me. I really appreciate you taking the time to digest the article and hear out my scenario. Not many libraries in the Django+ES space are tackling the project migration lifecycle for Python+ES.

With that said I realize this is not a small undertaking. I have greatly benefited from DSE for it's simplicity and continued upkeep. Thanks so much!

@jaddison
Copy link
Owner

You're welcome, although I'll admit time doesn't always allow for timely responsiveness.

It's kind of a tough problem to solve in such a generic fashion - migration processes are usually quite specific to the platform/app/system. So while what is suggested above could be implemented, it's a big investment of time. I'm not certain of how much it would be used.

If you want to take a stab at a revamp to support this, I'd be game to include it, particularly if it was backward compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants