Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish pre-built bundles to npm #7398

Merged
merged 13 commits into from
Apr 15, 2021
Merged

Publish pre-built bundles to npm #7398

merged 13 commits into from
Apr 15, 2021

Conversation

bershanskiy
Copy link
Contributor

@bershanskiy bershanskiy commented Nov 18, 2020

This is a simple implementation of publishing "bundled" releases to npm, which are much smaller and load faster than the current releases. Background: #7374 .

Benchmark (updated)

$ npm publish dist/ --dry-run --access public
$ npm publish dist/ --dry-run --access public
npm notice
npm notice 📦  @mdn/browser-compat-data@2.0.6
npm notice === Tarball Contents ===
npm notice 6.6kB   LICENSE
npm notice 6.7kB   README.md
npm notice 440.2kB data.json.gz
npm notice 337B    index.d.ts
npm notice 163B    index.js
npm notice 604B    package.json
npm notice 11.2kB  types.d.ts
npm notice === Tarball Details ===
npm notice name:          @mdn/browser-compat-data
npm notice version:       2.0.6
npm notice filename:      @mdn/browser-compat-data-2.0.6.tgz
npm notice package size:  450.1 kB
npm notice unpacked size: 465.8 kB
npm notice shasum:        7698d4a9ec5137486e11062d1eae762528066e81
npm notice integrity:     sha512-RRTkhnEmu3DuA[...]GeIUgaXxjZ6uQ==
npm notice total files:   7
npm notice
+ @mdn/browser-compat-data@2.0.6

Release contents

  • The following files are released verbatim because dependents might actually want to read them:
    • LICENSE
    • README.md
    • index.d.ts and types.d.ts
  • package.json is stripped of entries not needed by dependents and npm (e.g., scripts, which describes script files already omitted from npm releases)
  • index.js is "bundled"

A checklist to help your pull request get merged faster:

  • Summarize your changes
  • Data: link to resources that verify support information (such as browser's docs, changelogs, source control, bug trackers, and tests)
  • Data: if you tested something, describe how you tested with details like browser and version
  • Review the results of the linter and fix problems reported (If you need help, please ask in a comment!)
  • Link to related issues or pull requests, if any

@bershanskiy bershanskiy requested a review from ddbeck as a code owner November 18, 2020 13:58
@github-actions github-actions bot added dependencies Pull requests that update a dependency package or file. infra Infrastructure issues (npm, GitHub Actions, releases) of this project scripts Issues or pull requests regarding the scripts in scripts/. labels Nov 18, 2020
@queengooborg
Copy link
Contributor

queengooborg commented Nov 19, 2020

I'm loving how this is looking, great work! I love the idea of a smaller and faster package, not to mention that would remove the need for additional dependencies, or worry about the NodeJS version at all. I spent some time reviewing this code, and I've just got a few suggestions:

  • We should move all of the preparation work into the script, so that the copying of the files like README and other are all performed in the same script. To me, it seems more reasonable to keep the process in one place from start to finish.
  • Did you mean "v2.1.0" rather than "v2.1.6"? Regardless, I'm thinking that we should keep the version bumping out of this PR and let it be performed as a part of the release process.
  • Perhaps we should separate the JSON data into its own file, such as data.json? Organizationally, I feel that makes a little more sense.
    • Extending on this, it might be nice to compress the bundle even further with a little gzip compression. I was doing some experimentation and research, and by compressing the JSON, the file size drops down to 440 KB (vs 9 MB). What's more, I'm seeing no performance drawbacks (about 0.22s to load), and zlib functions are available since NodeJS v0.10 (confirmed it works in NodeJS v4)!

@queengooborg
Copy link
Contributor

@ddbeck, what do you think?

@bershanskiy
Copy link
Contributor Author

bershanskiy commented Nov 19, 2020

@vinyldarkscratch Thanks for your comments!

We should move all of the preparation work into the script, so that the copying of the files like README and other are all performed in the same script. To me, it seems more reasonable to keep the process in one place from start to finish.

Ok, will do.

Perhaps we should separate the JSON data into its own file, such as data.json? Organizationally, I feel that makes a little more sense.

As is, I use index.js file name just to match previous name. I think we could change it to anything (e.g., data.json) and just need to set package.json key main to that name.

Extending on this, it might be nice to compress the bundle even further with a little gzip compression. I was doing some experimentation and research, and by compressing the JSON, the file size drops down to 440 KB (vs 9 MB). What's more, I'm seeing no performance drawbacks (about 0.22s to load), and zlib functions are available since NodeJS v0.10 (confirmed it works in NodeJS v4)!

Are you suggesting that to reduce download size or at-rest size in node_modules?

As far as I know, npm already gzips packages in transit (and uncompresses them automatically). According to the a log of a dry npm publish run above, npm compresses the whole package to 449.1 KB (vs. 440 KB you saw for data). Since npm reports the whole package size, the extra 9.1 KB likely came from LICENSE and README.md, etc.

Storing package in compressed format in node_modules and then un-compressing it on every run would indeed would save a lot of space. That's actually the case for most packages on npm, so I wonder why npm/node does not do it automatically. I'll implement this suggestion to try it out.

Edit: grammar.

@bershanskiy bershanskiy marked this pull request as draft November 19, 2020 09:30
@queengooborg
Copy link
Contributor

As is, I use index.js file name just to match previous name. I think we could change it to anything (e.g., data.json) and just need to set package.json key main to that name.

I think index.js is a good name for an entry point! I did some experimentation and unfortunately found that we are not able to specify a JSON file as the main key; it has to be a script.

Are you suggesting that to reduce download size or at-rest size in node_modules?

I'm referring to the at-rest size when the package is downloaded and installed.

@bershanskiy
Copy link
Contributor Author

@vinyldarkscratch I added GZip, moved file copy code and changed around a few things. Most notably, now bundling script uses async I/O for file copying and checks for errors. Also, bundle now requires (and was tested) on NodeJS 4.2.0 released in 2015-- the earliest LTS release in 4.* branch, as far as I see.

@bershanskiy bershanskiy marked this pull request as ready for review November 20, 2020 11:38
@bershanskiy
Copy link
Contributor Author

Is there anything that I should do to advance this PR?

@ddbeck
Copy link
Collaborator

ddbeck commented Nov 25, 2020

I plan to review this (and the original issue) more thoroughly, but it may take some time (with some time off at the end of this week). My very quick impression here:

  • This is a clever approach and solves some real problems. I want to explore the ideas more thoroughly.
  • There's some risks here that I haven't thought about long enough yet to assess properly, but concern me nonetheless. A couple off the top of my head:
    • It looks like our tests never actually run against the code that ships to npm. It's not clear to me how I would test the shipping code. That's… worrying, to say the least.
    • This introduces multiple new kinds of complexity (building and compressing). I suspect I'd want to defer the compression stuff until later, but I haven't read closely enough to see if it's essential.
  • This uses GitHub Actions to do some of the publishing work here, which makes local testing difficult. It seems like we should use idiomatic npm life cycle scripts instead.

@ddbeck
Copy link
Collaborator

ddbeck commented Nov 25, 2020

Oh one more thought: I would also like to look at some of the existing tooling for building distributables (e.g., webpack) and make sure we're not reinventing something.

@bershanskiy
Copy link
Contributor Author

@ddbeck I'll split out the compression part to simplify review.

It looks like our tests never actually run against the code that ships to npm. It's not clear to me how I would test the shipping code. That's… worrying, to say the least.

The current script assumes that the current system (index.js with extend) works. It only checks that all the file I/O operations returned no errors.

We could test the following:

  1. Do a npm publish dry run and check that npm actually picks up all files we care about and that their hashes are what we expect.
    Fortunately, most files (README, LICENSE) don't have to change frequently (and bundled index.js changes only in its version, so hashes can be pre-computed in advance for, e.g. 100 future releases). data.json is the only file that has to change every release.
  2. Add test that ensures that the bundle is really equivalent to the current system. For example, something trivial like:
    const real = require('./index');
    const bundle = require('./dist/index');
    deepCompare(real, bundle); // 
    

This introduces multiple new kinds of complexity (building and compressing). I suspect I'd want to defer the compression stuff until later, but I haven't read closely enough to see if it's essential.

Compression is not essential (I actually excluded it initially to simplify review). I would be glad to defer compression for later.

Oh one more thought: I would also like to look at some of the existing tooling for building distributables (e.g., webpack) and make sure we're not reinventing something.

I don't think anyone bothered to write a tool for a use case as simple as this one. Most bundling tools do a lot of things like dependency resolution, tree shaking, transpiring, etc. In our case, none of that is needed.

@bershanskiy bershanskiy changed the title Publish pre-build bundles to npm Publish pre-built bundles to npm Nov 27, 2020
@bershanskiy
Copy link
Contributor Author

bershanskiy commented Nov 27, 2020

I re-wrote the PR from scratch and made everything much simpler.

Regarding earlier comments by @ddbeck :

It looks like our tests never actually run against the code that ships to npm. It's not clear to me how I would test the shipping code. That's… worrying, to say the least.

I can create the following process with an end-to-end test (that could be run as an Action or locally):

  1. Create a bundle, checking that its content matches the current require('index.js')
  2. Run npm publish --dry-run to ensure it picks up all the desired files
  3. Push that bundle to npm as a "private" release
  4. Download that private pre-release and ensure it matches the current require('index.js')
  5. Only after that change visibility of already published tag to "public"

Unfortunately, only paying npm users can push private releases.

we should use idiomatic npm life cycle scripts

I experimented with lifecycle scripts in #7513 and found them too inflexible. I found two main inconveniences:

  • having to use the same package.json, which means lots of unnecessary entries (scripts, devDependencies, etc.) and the same main (which among other things means index.js it has to determine when it's installed as a dependency or in development environment to ensure tests are loading and testing the real data and not previously built bundle).
  • NOT being able to run build steps and tests selectively in any order

@ddbeck
Copy link
Collaborator

ddbeck commented Dec 9, 2020

@bershanskiy Thank you for the updates on this. I'm sorry its taken me so long to come back to this. I do plan to return to this, but I'm reluctant to merge a change like this close the launch of Yari (our most visible dependent). Watch this space. 😄

@bershanskiy
Copy link
Contributor Author

@ddbeck Of course, take your time! Bug-free launch is more exciting than a rushed feature. :)

@ddbeck ddbeck removed their request for review December 17, 2020 16:26
Base automatically changed from master to main March 24, 2021 12:54
Copy link
Contributor

@foolip foolip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very excited about this direction! I tried it locally and made a bunch of suggestions.

@bershanskiy
Copy link
Contributor Author

We never test this in CI, so it could be entirely broken and tests would still pass. A simple kind of test for this would be to build the package, then require both the built package and the regular index.js, and use assert.deepEquals.

I added a Mocha test for this in scripts/release-build.test.js. The only reservation I have is that I have to specify timeout, since bundling script now takes over 2 seconds on GitHub Actions.

@bershanskiy
Copy link
Contributor Author

@vinyldarkscratch A while ago you wrote:

I think index.js is a good name for an entry point! I did some experimentation and unfortunately found that we are not able to specify a JSON file as the main key; it has to be a script.

Please note that now main is 'data.json' and it works for me, for @foolip, and on GitHub Actions. Could you check whether this PR works for you locally now? I would hate to break something accidentally. Thank you.

await fs
.rmdir(directory, {
force: true,
recursive: true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By switching Node versions (with nodenv), I discovered that the recursive option doesn't work on Node 10. But it's EOL in ~2 weeks. I think we can tolerate the (unlikely) annoyance for that long.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that I think on it some more: I guess it doesn't matter at all. If all we publish is a .json file, then we can adopt whatever version of Node we want for development.

(Though this also suggests that, should we desire to expose utilities for working with compat data, it would need to be as a separate package.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By switching Node versions (with nodenv), I discovered that the recursive option doesn't work on Node 10.

Yes, the docs say recursive was introduced in 12.

If all we publish is a .json file, then we can adopt whatever version of Node we want for development.

I'll confess: it was my secret plan all along :) . More recent NodeJS versions have nice features, so I wanted to simplify transition to modern NodeJS should you ever choose to.

(Though this also suggests that, should we desire to expose utilities for working with compat data, it would need to be as a separate package.)

Does BCD expose any utilities now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll confess: it was my secret plan all along :)

😄 Yeah, there's a ton of hidden benefits to this. For example, we could add internal-only fields to the schema that might make working with the data more ergonomic, but filter it out before building the final data for consumers.

Does BCD expose any utilities now?

No, not yet. It's something we've maybe inched toward (see #9441), but have been sort of cautious about it. I rather like the idea of decoupling the data from such utilities anyway and this PR would sort of force that decision.

@ddbeck
Copy link
Collaborator

ddbeck commented Apr 14, 2021

We never test this in CI, so it could be entirely broken and tests would still pass. A simple kind of test for this would be to build the package, then require both the built package and the regular index.js, and use assert.deepEquals.

I added a Mocha test for this in scripts/release-build.test.js. The only reservation I have is that I have to specify timeout, since bundling script now takes over 2 seconds on GitHub Actions.

Since this test is slow, perhaps we can put mocha after the linting, for the convenience of routine edits?

// in package.json
"test": "npm run lint && npm run mocha"

We can probably do something to improve the performance of index.js to properly fix this, though.

@bershanskiy
Copy link
Contributor Author

bershanskiy commented Apr 14, 2021

Since this test is slow, perhaps we can put mocha after the linting, for the convenience of routine edits?

Done!

We can probably do something to improve the performance of index.js to properly fix this, though.

Yes, index.js can be made more async by using generic file I/O and JSON.parse(). Problem is, require() is syncroneous, so BCD dependents can not require() and get the object immediatelly. They'd have to wait for a promise or provide a callback (yak!) just to get the data. I actually started this PR partially because I could not find a way to improve index.js. If we publish JSON bundles, then we can make all building steps async and arbitrarily complex!

However, tests temselves can be improved (and made more async):

@foolip
Copy link
Contributor

foolip commented Apr 15, 2021

@bershanskiy I think you're on the right track with the performance. Right now, we do I/O (fs.readFileSync) and parsing (JSON.parse) in tandem and thus aren't reading from disk as fast as we could.

I did an experiment to first create promises for all the file reads, to read everything in parallel. That can being down the (uncached) time by perhaps half a second, but doesn't do much when cached.

I think that loading all files in parallel isn't the optimum though, probably one could do better by reading one at a time but never stopping for JSON.parse. But since this will only affect contributors and not users of BCD, there's a limit to how complex it's worth making this.

@foolip
Copy link
Contributor

foolip commented Apr 15, 2021

Here's the async variant of load() that I tried:

async function load(...dirs) {
  const result = {};

  const stack = dirs.map(dir => path.resolve(__dirname, dir));
  const promises = [];

  while (stack.length) {
    const dir = stack.pop();
    const entries = await fs.readdir(dir, {withFileTypes: true});
    for (const ent of entries) {
      const entpath = path.join(dir, ent.name);
      if (ent.isDirectory()) {
        stack.push(entpath);
      } else if (path.extname(entpath) === '.json') {
        promises.push(fs.readFile(entpath).then(JSON.parse));
      }
    }
  }

  for (const p of promises) {
    const data = await p;
    extend(result, data);
  }

  return result;
}

A remaining problem is that it changes the order of the keys in the objects, which isn't even going to be deterministic on all platforms unless the fs.readdir results are sorted.

Copy link
Contributor

@foolip foolip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Leaving final review for @ddbeck though.

@ddbeck
Copy link
Collaborator

ddbeck commented Apr 15, 2021

OK, I've read back through the thread and I have one question remaining.

If I install this package from source (i.e., npm install git@github.com:bershanskiy/browser-compat-data.git#bundle), I get the development package (with extend as a dependency), instead of what we'll actually ship to npm.

(Compare this to #7513, where running npm install git@github.com:bershanskiy/browser-compat-data.git#bundle-json gets you the dist-style package, without any dependencies.)

Is that weird or surprising? This is an honest question. I'm legitimately unsure how significant that is.

@foolip
Copy link
Contributor

foolip commented Apr 15, 2021

@ddbeck I'm no expert maintainer of npm packages, but a "installing from source should equal dist" rule doesn't seem like anyone can safely assume. Take https://github.com/w3c/webref which is the source of two packages, there isn't any URL at all one could install to get the equivalent of the @webref/idl package there.

So this seems fine to me.

Using lifecycle scripts might still be a good idea, however. The tradeoff is that any build step will require contributors to run the build script manually after any change, or risk being confused when their changes don't seem to work. Lots of packages work like that, so it's not unworkable, but does seem avoidable in this case.

@ddbeck
Copy link
Collaborator

ddbeck commented Apr 15, 2021

@ddbeck I'm no expert maintainer of npm packages, but a "installing from source should equal dist" rule doesn't seem like anyone can safely assume. Take https://github.com/w3c/webref which is the source of two packages, there isn't any URL at all one could install to get the equivalent of the @webref/idl package there.

@foolip having an existing, real-world example to consider is really helpful. This works for me. Thank you!

Copy link
Collaborator

@ddbeck ddbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm going to merge this now. Outstanding and thank you for putting this together and for your patience, @bershanskiy! And thank you @foolip, for giving this a really excellent review too.

I've written a rather lengthy release note for this PR in #9865. If either of you has a chance to take a look at that before today's release, that would be especially welcome (though strictly optional).

ddbeck added a commit to ddbeck/browser-compat-data that referenced this pull request Apr 15, 2021
@ddbeck ddbeck merged commit 00eb7a0 into mdn:main Apr 15, 2021
ddbeck added a commit that referenced this pull request Apr 15, 2021
* Start preparing for April 15, 2021 release
* Add release note for #9842
* Add release note for #7398
* Add stats
@bershanskiy bershanskiy deleted the bundle branch April 15, 2021 18:19
@bershanskiy bershanskiy mentioned this pull request Aug 25, 2021
@bershanskiy bershanskiy mentioned this pull request Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency package or file. infra Infrastructure issues (npm, GitHub Actions, releases) of this project scripts Issues or pull requests regarding the scripts in scripts/.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants