Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 42: Data transfer between Wagtail installations #42

Merged

Conversation

gasman
Copy link
Contributor

@gasman gasman commented Oct 1, 2019

@gasman gasman force-pushed the rfc/data-transfer-between-installations branch from 1a3df8a to 81060c3 Compare October 1, 2019 15:49
@gasman gasman changed the title RFC xx: Data transfer between Wagtail installations RFC 42: Data transfer between Wagtail installations Oct 1, 2019
The list of instances offered as import source sites will be configurable through the Django settings file. In cases where sites are locked down by basic authentication, IP range or similiar, it is the implementor's responsibility to ensure that each instance is able to make HTTP / HTTPS requests to its 'sibling' sites.

WAGTAIL_IMPORT_SOURCE_INSTANCES = [
{
Copy link
Contributor

@kaedroho kaedroho Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not require an identifier for each source? (similar to how we identify database and search backends)

WAGTAIL_IMPORT_SOURCE_INSTANCES = {
    'staging': {
        'label': "Staging",   # Could default to the identifier.title()
    }
}


The list of instances offered as import source sites will be configurable through the Django settings file. In cases where sites are locked down by basic authentication, IP range or similiar, it is the implementor's responsibility to ensure that each instance is able to make HTTP / HTTPS requests to its 'sibling' sites.

WAGTAIL_IMPORT_SOURCE_INSTANCES = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to omit the underscore between WAGTAIL and the module name, For example. WAGTAILSEARCH_BACKENDS.

Also maybe _SOURCE_INSTANCES could be replaced with _SOURCES?


To avoid the need to modify Wagtail's core data model, the import-export app will define a new model / database table to store the mappings from UIDs to local installation-specific IDs:

class IDMapping(models.Model):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a unique_together constraint on content_type and local_id?

@cyface
Copy link

cyface commented Oct 1, 2019

I think in the spirit of the use case that I have for this, the pull approach should be fine, but the push approach is very much preferred. :)

In cases where the reason for doing data transfer is security related, it may not be possible for direct login to 'publish' instances vs. 'author' instances. UX-wise, the push approach also prevents the author from having to log in to multiple instances, and figure out which instance they are on.

Perhaps a menu item in the author instance that jumps you over to publish and initiates the import for that page?

@kaedroho kaedroho added the 1:New label Oct 23, 2019
@chosak
Copy link
Member

chosak commented Oct 30, 2019

This would be a great feature to have and should definitely be discussed as an RFC here.

In cases where the reason for doing data transfer is security related, it may not be possible for direct login to 'publish' instances vs. 'author' instances.

+1 @cyface. A reasonable and common pattern is to have the production environment in a network purposefully isolated from staging and development environments. You might not want to expose those environments to the production one because that opens the possibility of external requests making it down to an internal network.

In those cases it would be preferable to be able to do some kind of file-based export, where users on the "source" Wagtail can generate a transfer artifact which can be imported into the "destination" Wagtail. This would be akin to a super dumpdata which dumped only the desired models, along with a super loaddata which figured out how to smartly ingest them.

This file-based export could potentially even live along side a direct connection-based import/export, sharing the same logic underneath. This might depend a bit on how the logic works: if an import is "make a bunch of API calls, then manipulate and ingest the data", that seems like it'd work just as well with an interim file-based step. If instead the import logic were iterative (meaning, a query is made to the API, then other queries are made depending on how the destination is set up), this wouldn't fit so well with the proposed approach.

@gasman
Copy link
Contributor Author

gasman commented Oct 30, 2019

@chosak I think it is an iterative process, and one that can potentially vary depending on what objects exist at the destination. For any item that has been explicitly chosen to be imported, there's a tree of related items to be imported along with it, potentially of any depth (for example, a page may reference a snippet which references an image).

For every such relation in that tree, there's a business rule (i.e. one of the bullet points given under https://github.com/gasman/weps/blob/81060c34c190482e262be6b7595ac4d213276704/text/042-data-transfer-between-installations.md#configuration) which determines whether the related object will be imported or skipped. Some of these business rules have a 'get-or-create' behaviour, which means that they may or may not end up importing the related object (and possibly pulling in even more related objects associated with it) depending on whether it exists already. This means that the dataset exported from the source site has to be larger in the 'create' scenario than the 'get' scenario.

As a concrete example: suppose we're importing a blog section, and the blog posts have categories, identified by a slug. These categories are snippet/modeladmin models, and should be created at the destination whenever we encounter one that doesn't exist. Now, suppose that a category object isn't just a plain slug + label, but includes some 'rich' fields, such as an image or a rich-text description. The import logic has to consider the object references in these fields if, and only if, the category doesn't exist already at the destination - which is something we can't know if we're preparing the export dataset up-front.

This isn't insurmountable - the exporter logic could take a "pessimistic" view and always output the extra records whenever there's a chance of needing them - but it's added complexity that can be left until later.

@kaedroho kaedroho added 3:Review and removed 1:New labels Oct 30, 2019
@kaedroho
Copy link
Contributor

Discussed in Wagtail core meeting. Everyone in favour of approach as it stands as it could be extended to support a push workflow and transfer with a file later on.

We also discussed implementing a mechanism in Wagtail core to discover all objects that a page references. This would scrape foreign keys, chooser blocks and rich text features to find all linked pages, images, documents and snippets. Everyone was in favour of this change too.

@zerolab
Copy link
Contributor

zerolab commented Jan 24, 2020

@kaedroho kaedroho merged commit 6bdb994 into wagtail:master Aug 4, 2020
@kaedroho
Copy link
Contributor

kaedroho commented Aug 4, 2020

Merged since this is implemented now and the RFC generally reflects what made it in to Wagtail transfer

@SidharajYadav
Copy link

I want to start issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants