Rewrite the WXR package to store data in IndexedDB #23

pento · 2021-02-10T05:42:37Z

The initial WXR package was hacked together quite quickly, for the purposes of providing a proof of concept. In order to allow us to grow quickly, it needs to be largely rewritten. This PR addresses the following:

It allows for new versions of WXR to be defined in the future.
It writes the export data to IndexedDB before generating the WXR file, which gives us the ability to handle arbitrarily large exports.
It can generate the WXR file as a stream (allowing arbitrarily large WXR files to be created).
It adds support for all of the missing features listed in Full WXR Support #3.

TODO

Features

Finish documenting the WXRDriver class.
Add comment support.
Add comment meta support.
Investigate if we actually need auto-generating post IDs.

Tests

Add tests for post meta.
Expand the posts test library.

Fixes #3.

pento · 2021-02-24T07:48:08Z

@akirk: I haven't yet managed to address the final step (actually downloading the WXR file), but the rest of the PR is in a reviewable state, if you're able to take some time to look at it.

akirk · 2021-02-24T13:31:42Z

For now I only did a high level review and it looks already great with tests covering a good range of cases.

One thing I don't fully grasp is the streaming aspect. In my experience, the benefit of "Streaming X" is that it's possible to consume data before all of the input has been fed into the "Streaming X". Is this eventually the plan?

I am not sure I am seeing this concept in the PR yet since there are awaits until the data is stored in IndexDB and awaits until the stream has been written. Of course, the data would need to be streamed in a certain pattern (the terms and authors can only be streamed when we're sure that there are no posts left).

pento · 2021-02-25T05:45:46Z

Thanks for checking it out, @akirk!

The streaming part of it is not focussed on the idea of "downloading the WXR while the data is still being gathered", though that's something which we could certainly explore. As you noted, there would likely be some gotchas to watch out for.

The goal of streaming is to download the WXR file as its being generated, rather than generating the entire WXR file and then downloading it. This allows us to generate arbitrarily large WXR files, without running into memory limits. For WXR 1.2, this is a fairly academic distinction, since the files would very rarely be large enough to cause these problems. This feature will come into its own in WXR 2.0, when we're adding media files to the download.

akirk · 2021-03-01T11:36:19Z

@pento I've added a commit to remove momentjs, see #27

akirk · 2021-03-02T13:51:36Z

Are you planning to add nav_menu support to this or is there already a way to save menus?

… maintain.

pento · 2021-03-05T06:09:36Z

Sadly, I couldn't get the streaming working on Firefox, so I've switched over to using the webextensions downloads API: it has to generate the full WXR file, so we don't get the memory benefits of streaming, but we can leave that until we're generating WXR files big enough to cause such problems.

This PR is far too big to be properly reviewable, but hopefully the test coverage makes up for it. 🙂

It's the kind of thing we're going to need to iterate on, so I'm not super concerned about it landing in main now. It's better for us to be able to move on to all the other things.

akirk

I had some weirdness resolving the dependencies and had to remove my node_modules directory and install it again.

In my review I manipulated the tests a little since right now they follow the "golden path." So for example I added the same author twice and it is put into the WXR twice. This is by design and I think it makes sense to not burden the WXR library with data integrity checks but we need to be clear about that which I referred to in my review comment about README.md.

akirk · 2021-03-08T10:37:32Z

packages/wxr/src/1.2/test/authors.js

+
+		wxr.addAuthor( {
+			login: 'someone-else',
+		} );


If I add another author pento it will be added to the WXR and result in a duplicate author.

akirk · 2021-03-08T10:40:32Z

packages/wxr/README.md

@@ -10,4 +10,39 @@ Install the module
 npm install @wordpress/wxr --save-dev
 ```

+## Usage


This should include a section about the "mission," i.e. creating a compatible WXR and not being responsible for data integrity checks.

pento · 2021-03-09T04:10:45Z

I had some weirdness resolving the dependencies and had to remove my node_modules directory and install it again.

I'm not sure why, but there have been some oddities with node_modules recently. I'm not sure if it's associated with how our package.json is configured, or there are upstream issues.

In my review I manipulated the tests a little since right now they follow the "golden path." So for example I added the same author twice and it is put into the WXR twice. This is by design and I think it makes sense to not burden the WXR library with data integrity checks but we need to be clear about that which I referred to in my review comment about README.md.

Aye, there are certainly issues along these lines.

Given that the library already does some type checking, I think it'd be good to handle some data integrity checks: fields like the user login, or the post slug, could have a unique flag set on them, which would provide a simple check to catch these kinds of issues. I'm inclined to handle this in a follow-up PR.

I think there's likely to be value in improving these checks over time, too: more comprehensive integrity checks means better results for end users, as we're more likely to catch data weirdness before they get to importing their WXR into WordPress.

pento added [Package] WXR /packages/wxr [Status] In Progress Tracking issues with work in progress [Type] Enhancement A suggestion for improving an existing feature. labels Feb 10, 2021

pento self-assigned this Feb 10, 2021

akirk mentioned this pull request Feb 17, 2021

Enable development without running the browser extension #20

Merged

pento force-pushed the try/idb-based-wxr branch 2 times, most recently from 1f013a1 to 4ef1ab6 Compare February 24, 2021 06:11

pento requested a review from akirk February 24, 2021 07:45

akirk force-pushed the try/idb-based-wxr branch 3 times, most recently from 2058282 to 86b104d Compare March 1, 2021 11:42

pento added 14 commits March 5, 2021 13:43

First working-ish version.

235d9f7

Try triggering the download from the background script.

c27f0f8

Add a bunch of documentation.

5177a21

Add meta support.

e805af3

Add a bunch of snapshot tests.

da55372

Remove a weird package.json linting rule.

9e879ea

Don't force a post_id to be added to WXR if none is passed.

e0cd550

Add comment support.

6a5f69b

Add a bunch documentation to the WXRDriver class.

61959d1

Add some API documentation.

598fb97

Linting.

d9eda7f

Add tests for getWXRDriver().

09f7a96

Add a bunch of tests, fix post taxonomy formatting.

4063e31

Don't add the wp:attachment_url for non-attachment post types.

35813b6

pento and others added 6 commits March 5, 2021 13:44

Update the Wix HAR integration test WXR fixture.

fa7886b

Switch the integration tests to using snapshots, as they're easier to…

7e5dc03

… maintain.

Clean up some bits missed in 82eb5c5.

200add7

Fix downloading support (in Chrome, at least).

1f042ae

Fix various failing tests.

a5f4b7b

Remove momentjs

0b699df

pento force-pushed the try/idb-based-wxr branch from 86b104d to 0b699df Compare March 5, 2021 02:45

pento added 4 commits March 5, 2021 13:52

Reimplement snapshots in integration tests after rebase.

22cb4cb

Switch to using the extension downloads API.

246801e

Try updating dependencies.

34941b8

Include the missing dependency manually, maybe?

e691d28

akirk reviewed Mar 8, 2021

View reviewed changes

pento merged commit 3c17d5a into main Mar 10, 2021

pento deleted the try/idb-based-wxr branch March 10, 2021 03:27

akirk mentioned this pull request Mar 11, 2021

Wix: Static Pages #28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite the WXR package to store data in IndexedDB #23

Rewrite the WXR package to store data in IndexedDB #23

pento commented Feb 10, 2021 •

edited

Loading

pento commented Feb 24, 2021

akirk commented Feb 24, 2021

pento commented Feb 25, 2021

akirk commented Mar 1, 2021

akirk commented Mar 2, 2021

pento commented Mar 5, 2021

akirk left a comment

akirk Mar 8, 2021

akirk Mar 8, 2021

pento commented Mar 9, 2021

Rewrite the WXR package to store data in IndexedDB #23

Rewrite the WXR package to store data in IndexedDB #23

Conversation

pento commented Feb 10, 2021 • edited Loading

TODO

Features

Tests

pento commented Feb 24, 2021

akirk commented Feb 24, 2021

pento commented Feb 25, 2021

akirk commented Mar 1, 2021

akirk commented Mar 2, 2021

pento commented Mar 5, 2021

akirk left a comment

Choose a reason for hiding this comment

akirk Mar 8, 2021

Choose a reason for hiding this comment

akirk Mar 8, 2021

Choose a reason for hiding this comment

pento commented Mar 9, 2021

pento commented Feb 10, 2021 •

edited

Loading