Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Rewrite the WXR package to store data in IndexedDB #23

Merged
merged 24 commits into from
Mar 10, 2021
Merged

Conversation

pento
Copy link
Owner

@pento pento commented Feb 10, 2021

The initial WXR package was hacked together quite quickly, for the purposes of providing a proof of concept. In order to allow us to grow quickly, it needs to be largely rewritten. This PR addresses the following:

  • It allows for new versions of WXR to be defined in the future.
  • It writes the export data to IndexedDB before generating the WXR file, which gives us the ability to handle arbitrarily large exports.
  • It can generate the WXR file as a stream (allowing arbitrarily large WXR files to be created).
  • It adds support for all of the missing features listed in Full WXR Support #3.

TODO

Features

  • Finish documenting the WXRDriver class.
  • Add comment support.
  • Add comment meta support.
  • Investigate if we actually need auto-generating post IDs.

Tests

  • Add tests for post meta.
  • Expand the posts test library.

Fixes #3.

@pento pento added [Package] WXR /packages/wxr [Status] In Progress Tracking issues with work in progress [Type] Enhancement A suggestion for improving an existing feature. labels Feb 10, 2021
@pento pento self-assigned this Feb 10, 2021
@pento pento force-pushed the try/idb-based-wxr branch 2 times, most recently from 1f013a1 to 4ef1ab6 Compare February 24, 2021 06:11
@pento pento requested a review from akirk February 24, 2021 07:45
@pento
Copy link
Owner Author

pento commented Feb 24, 2021

@akirk: I haven't yet managed to address the final step (actually downloading the WXR file), but the rest of the PR is in a reviewable state, if you're able to take some time to look at it.

@akirk
Copy link
Collaborator

akirk commented Feb 24, 2021

For now I only did a high level review and it looks already great with tests covering a good range of cases.

One thing I don't fully grasp is the streaming aspect. In my experience, the benefit of "Streaming X" is that it's possible to consume data before all of the input has been fed into the "Streaming X". Is this eventually the plan?

I am not sure I am seeing this concept in the PR yet since there are awaits until the data is stored in IndexDB and awaits until the stream has been written. Of course, the data would need to be streamed in a certain pattern (the terms and authors can only be streamed when we're sure that there are no posts left).

@pento
Copy link
Owner Author

pento commented Feb 25, 2021

Thanks for checking it out, @akirk!

The streaming part of it is not focussed on the idea of "downloading the WXR while the data is still being gathered", though that's something which we could certainly explore. As you noted, there would likely be some gotchas to watch out for.

The goal of streaming is to download the WXR file as its being generated, rather than generating the entire WXR file and then downloading it. This allows us to generate arbitrarily large WXR files, without running into memory limits. For WXR 1.2, this is a fairly academic distinction, since the files would very rarely be large enough to cause these problems. This feature will come into its own in WXR 2.0, when we're adding media files to the download.

@akirk
Copy link
Collaborator

akirk commented Mar 1, 2021

@pento I've added a commit to remove momentjs, see #27

@akirk akirk force-pushed the try/idb-based-wxr branch 3 times, most recently from 2058282 to 86b104d Compare March 1, 2021 11:42
@akirk
Copy link
Collaborator

akirk commented Mar 2, 2021

Are you planning to add nav_menu support to this or is there already a way to save menus?

@pento pento force-pushed the try/idb-based-wxr branch from 86b104d to 0b699df Compare March 5, 2021 02:45
@pento
Copy link
Owner Author

pento commented Mar 5, 2021

Sadly, I couldn't get the streaming working on Firefox, so I've switched over to using the webextensions downloads API: it has to generate the full WXR file, so we don't get the memory benefits of streaming, but we can leave that until we're generating WXR files big enough to cause such problems.

This PR is far too big to be properly reviewable, but hopefully the test coverage makes up for it. 🙂

It's the kind of thing we're going to need to iterate on, so I'm not super concerned about it landing in main now. It's better for us to be able to move on to all the other things.

Copy link
Collaborator

@akirk akirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some weirdness resolving the dependencies and had to remove my node_modules directory and install it again.

In my review I manipulated the tests a little since right now they follow the "golden path." So for example I added the same author twice and it is put into the WXR twice. This is by design and I think it makes sense to not burden the WXR library with data integrity checks but we need to be clear about that which I referred to in my review comment about README.md.


wxr.addAuthor( {
login: 'someone-else',
} );
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I add another author pento it will be added to the WXR and result in a duplicate author.

@@ -10,4 +10,39 @@ Install the module
npm install @wordpress/wxr --save-dev
```

## Usage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should include a section about the "mission," i.e. creating a compatible WXR and not being responsible for data integrity checks.

@pento
Copy link
Owner Author

pento commented Mar 9, 2021

I had some weirdness resolving the dependencies and had to remove my node_modules directory and install it again.

I'm not sure why, but there have been some oddities with node_modules recently. I'm not sure if it's associated with how our package.json is configured, or there are upstream issues.

In my review I manipulated the tests a little since right now they follow the "golden path." So for example I added the same author twice and it is put into the WXR twice. This is by design and I think it makes sense to not burden the WXR library with data integrity checks but we need to be clear about that which I referred to in my review comment about README.md.

Aye, there are certainly issues along these lines.

Given that the library already does some type checking, I think it'd be good to handle some data integrity checks: fields like the user login, or the post slug, could have a unique flag set on them, which would provide a simple check to catch these kinds of issues. I'm inclined to handle this in a follow-up PR.

I think there's likely to be value in improving these checks over time, too: more comprehensive integrity checks means better results for end users, as we're more likely to catch data weirdness before they get to importing their WXR into WordPress.

@pento pento merged commit 3c17d5a into main Mar 10, 2021
@pento pento deleted the try/idb-based-wxr branch March 10, 2021 03:27
@akirk akirk mentioned this pull request Mar 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
[Package] WXR /packages/wxr [Status] In Progress Tracking issues with work in progress [Type] Enhancement A suggestion for improving an existing feature.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Full WXR Support
2 participants