Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App bundle size inflated by matchPath data on 25k page site #21701

Closed
ConorLindsay94 opened this issue Feb 24, 2020 · 31 comments
Closed

App bundle size inflated by matchPath data on 25k page site #21701

ConorLindsay94 opened this issue Feb 24, 2020 · 31 comments
Assignees
Labels
topic: webpack/babel Webpack or babel

Comments

@ConorLindsay94
Copy link

Summary

We have recently noticed our app.js bundle has shot up in size, and after closer inspection it looks like the majority of the code is data relating to matchPaths.

{
   "path":"/example/path",
   "matchPath":"/example/*"
}

Relevant information

I think I may have tracked down the code that does this to getMatchPaths in gatsby/src/bootstrap/requires-writer.js. When I debug, matchPathPages has a length of 25237, and then I think this matchPath data is written to match-paths.json. Would this have an effect on the app.js bundle size?

Environment (if relevant)

System:
    OS: macOS High Sierra 10.13.6
    CPU: (12) x64 Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
    Shell: 3.2.57 - /bin/bash
  Binaries:
    Node: 10.16.3 - /usr/local/bin/node
    Yarn: 1.17.3 - /usr/local/bin/yarn
    npm: 6.9.0 - /usr/local/bin/npm
  Languages:
    Python: 2.7.16 - /usr/local/bin/python
  Browsers:
    Chrome: 80.0.3987.116
    Firefox: 72.0.2
    Safari: 13.0.5
@LekoArts LekoArts added the type: question or discussion Issue discussing or asking a question about Gatsby label Feb 25, 2020
@github-actions
Copy link

Hiya!

This issue has gone quiet. Spooky quiet. 👻

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.
If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!
As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks for being a part of the Gatsby community! 💪💜

@github-actions github-actions bot added the stale? Issue that may be closed soon due to the original author not responding any more. label Mar 16, 2020
@github-actions
Copy link

Hey again!

It’s been 30 days since anything happened on this issue, so our friendly neighborhood robot (that’s me!) is going to close it.
Please keep in mind that I’m only a robot, so if I’ve closed this issue in error, I’m HUMAN_EMOTION_SORRY. Please feel free to reopen this issue or create a new one if you need anything else.
As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks again for being part of the Gatsby community! 💪💜

@KyleAMathews
Copy link
Contributor

Hey sorry this never got responded to. Could you create a small reproduction of the problem? That'd be the best next step towards investigating the bug.

@KyleAMathews KyleAMathews added not stale and removed stale? Issue that may be closed soon due to the original author not responding any more. labels Mar 27, 2020
@ConorLindsay94
Copy link
Author

ConorLindsay94 commented Apr 15, 2020

Sorry for the late reply on this @KyleAMathews

I have created a codesandbox with my approach here.

The app.js file is here.

As you can see there are unneccessary matchPath entries as you get towards the bottom of the file. I only have four matchPaths set up in gatsby-node.js, but there is an entry for every 'static' page.

@ConorLindsay94
Copy link
Author

Hi @wardpeet, sorry for mentioning you directly but I think the bit of code which is causing me an issue here is from one of your PRs - #17412.

If I comment out the following code, the matchPath array has the correct amount of elements:

  // Pages can live in matchPaths, to keep them working without doing another network request
  // we save them in matchPath. We use `@reach/router` path ranking to score paths/matchPaths
  // and sort them so more specific paths are before less specific paths.
  // More info in https://github.com/gatsbyjs/gatsby/issues/16097
  // small speedup: don't bother traversing when no matchPaths found.
  if (matchPathPages.length) {
    const newMatches = []
    pages.forEach((page, index) => {
      const isInsideMatchPath = !!matchPathPages.find(
        pageWithMatchPath =>
          !page.matchPath && match(pageWithMatchPath.matchPath, page.path)
      )

      if (isInsideMatchPath) {
        newMatches.push(
          createMatchPathEntry(
            {
              ...page,
              matchPath: page.path,
            },
            index
          )
        )
      }
    })
    // Add afterwards because the new matches are not relevant for the existing search
    matchPathPages.push(...newMatches)
  }

I understand that code is there for a reason as it was part of the fix for #16097, but I just want to get some understanding of whether the solution should be on my end or yours.

Our app has a requirement to have static pages which can have client-only sub-routes, which is why we have pages created through createPage but then also a matchPath set up for them. You can see an example of our approach on the codesandbox in my comment above.

Correct me if I'm wrong - I think that code checks all pages to see if their URL matches any of the matchPaths that are set up, and if there is a match adds them to the matchPath array which is in turn added to the app.js bundle. This is why we are seeing such an inflated bundle size, because we have ~25k pages and there is an matchPath entry for all of these.

Any help would be greatly appreciated.

@dominicfallows
Copy link
Contributor

@KyleAMathews @wardpeet - as far as I can tell, we should be able to trust that Gatsby has generated all the required assets for statically generated pages (their index.html, page-data.json, etc) so the fix (#17412) for #16097 shouldn't need to include static pages in matchPaths (which it seems it does). Therefore, could matchPaths be solely for client-side dynamic only routes?

@wardpeet
Copy link
Contributor

Let me sketch the problem here a bit. When gatsby loads on page, the next navigation will be a client side navigation. So, no page refresh. We try to load a page-data.json file for each URL. If it doesn't exist it will return a 404. When people use matchpaths, this fails and we need to know what the rootPath is. That's why matchpaths exists.

If we kept doing a page-data fetch that could result in many 404s which wouldn't be ideal. We had this logic for a while but a lot of complains happened so that's why we reverted to this pattern.

We're probably able to improve the algorithm a bit.

@herecydev
Copy link
Contributor

Hi @wardpeet, also suffering from this issue.

I would like to put a few select client only routes on the index (e.g. /pay), but also have plenty of static pages (e.g. /shiny-new-toy, /buy-buy-buy). Gatsby understands those static pages exist and can be routed to, but the number of the static pages has brought back the "page manifest problem":

[
{
    "path": "/buy-buy-buy",
    "matchPath": "/buy-buy-buy"
 }, //x 10's of thousands of these static page mappings
 {
    "path": "/",
    "matchPath": "/*"
 }
]

With a finite amount of client only pages, could the problem not be avoided by permitting a simple mapping alongside the globs that are currently supported, i.e.

//gatsby-node.js

exports.onCreatePage = async ({ page, actions }) => {
  const { createPage } = actions
  if (page.path === "/") {
    page.matchPaths = ["/pay", "myotherclientsideroute"]
    createPage(page)
  }
}

Which then produces

[
 {
    "path": "/",
    "matchPaths": ["/pay", "myotherclientsideroute"]
 }
]

And then changing find-path to check accordingly as well. What do you think?

@wanjas
Copy link

wanjas commented May 26, 2020

I also bumped into this problem having 25k+ pages. As @ConorLindsay94 suggested commenting out the mentioned block of code solved it to me too.

Update [2020-10-06]: No, this move has fixed my issues only partially. In the end, I rollbacked to 2. 15.22 (Sep. 2019 version) to make localizations work without having huge match-paths.json.

@LekoArts LekoArts added type: feature or enhancement Issue that is not a bug and requests the addition of a new feature or enhancement. and removed not stale type: question or discussion Issue discussing or asking a question about Gatsby labels May 26, 2020
@danabrit danabrit added the topic: webpack/babel Webpack or babel label May 29, 2020
@dominicfallows
Copy link
Contributor

@KyleAMathews @wardpeet is this something that you would be happy having opensource contributions for? Or is it something you're already working on? If we were to contribute, are there any specific areas to look out for, or has our (@ConorLindsay94) initial work highlighted the main area?

@dominicfallows
Copy link
Contributor

@pieh @wardpeet we have noticed that as a result of changes in #25057 that the matchPaths are no longer included in app-[hash].js files. This has a positive side-effect for this issue. We haven't done thorough testing yet of our client-only nested routers, which we will do shortly, but I wanted to ask if this side-effect (which is positive so far) was intended?

@pieh
Copy link
Contributor

pieh commented Jul 10, 2020

@pieh @wardpeet we have noticed that as a result of changes in #25057 that the matchPaths are no longer included in app-[hash].js files. This has a positive side-effect for this issue. We haven't done thorough testing yet of our client-only nested routers, which we will do shortly, but I wanted to ask if this side-effect (which is positive so far) was intended?

Hmmm, this PR shouldn't have such impact - what this PR should have done is just change from using .cache/match-paths.json to $virtual/match-paths.json, but the size of match-paths shouldn't be impacted at all. This might be regression TBH, will drill a bit into it

@pieh
Copy link
Contributor

pieh commented Jul 10, 2020

Using default it seems like match-paths.json are still bundled into app chunk for me (at least using default start) - and this is expected (as I mentioned in comment above)

I wonder - this might be result of changes to webpack chunking setup maybe that we done some time ago? If that would be the case then match-paths might not end up in app chunk, it might just end up somewhere else (especially in case when this array is a large one).

Other option is that there is regression, but it's conditional - there might be some specific setup that cause this - removing match-paths from bundle would/could cause issues with handling of client side routes. I guess in some cases the tradeoff could be worth it to decrease bundle size at the cost of some runtime problems? (i.e. what potentially could happen is that navigating to a path handled by matchPath but not a path might cause browser reload -> loss of state etc, but the site/app generally could still "work" in general)

@dominicfallows
Copy link
Contributor

dominicfallows commented Jul 10, 2020

@pieh apologies, it was a false positive (true negative for this issue, I suppose). I have run various tests and it does seem that webpack is now putting matchPaths array elsewhere. Thank you for your help though.

I've run tests using different examples sites, slowly increasing the number of dynamic pages, using something like:

exports.createPages = ({ actions }) => {
  const { createPage } = actions
  for (let i = 0; i < 2000; i++) {
    createPage({
      path: `/${i}`,
      matchPath: `/${i}/*`,
      component: require.resolve(`./src/pages/index.js`),
    })
  }
}

When the array gets to a certain size, webpack is creating an independent chunk for the matchPaths, called something like public/188075ab-f28a602826f6270b17df.js (hash changes each build of course). I guess this would be a filesize related decision, where webpack is deciding to chunk.

It seems the change in webpack behavior is between 2.23.20 and 2.23.21, which is where webpack-virtual-modules is introduced, so could be related to those changes, or could be elsewhere, I'd have to dig further - but I don't think that is needed at this point. This issue and the original problem still exists for larger sites with mixed static and dynamic routes, that require matchPath registrations.

In fact, now having multiple large .js files on first load might in fact make the problem exponentially worse, for site load performance and for Google vitals scores. Before we had just a super large app-[hash].js file - which was problem enough, now we have a still-quite-large app-[hash].js file and the addition of a super large [matchPaths].js file.

@pieh
Copy link
Contributor

pieh commented Jul 13, 2020

Ok, so this seems like combination of things here. Introduction of virtual modules (at least the way they were implemented) cause match-paths to have different handling for chunk splitting (as it virtually lands in node_modules, there are some node_modules rules for chunking). Will be working on adjusting that this change to get back to behaviour it had before introduction of virtual modules

@pieh
Copy link
Contributor

pieh commented Jul 13, 2020

#25720 is PR that should fix the chunk splitting change (regression?)

@herecydev
Copy link
Contributor

@pieh what's the latest on this? This still feels very problematic for me (and others); the effect of dynamic/templated page creation results in a ballooning of matchpaths.

I've suggested in another thread about possibly relinquishing control of matchPath to a function that we (consumers of gatsby) could provide. That would allow far more succint code to be generated.

@kszot-ref
Copy link

kszot-ref commented Apr 2, 2021

We've hit this issue with 65k pages, where the bundle size increased to 8MB. The other thing about this issue is that it basically breaks incremental builds. A change in page path, or creating/deleting new pages changes the list in app-hash.js, which will always result in a full rebuild of all pages (the list changes, so the hash in app-hash.js filename will change, which is then included in all html pages...). Is there some workaround for this other than completely getting rid of matchPath usage in a project?

@KyleAMathews
Copy link
Contributor

@kszot-ref curious what you're doing that results in so many matchPaths? Perhaps there's a cheaper way to accomplish the same goal.

@kszot-ref
Copy link

@KyleAMathews I basically create 65k pages through "createPage" method inside gatsby-node.js file. If I pass a matchPath to at least one of those pages during their creation, my app.js balloons up to 8MB from 300kb because it generates the matchPath array that contains every single page and places it inside the app-hash.js. In general I have only 8 matchPaths set - for some specific pages that use reach/router for dynamic views.

@KyleAMathews
Copy link
Contributor

Wait... the array includes every page not just the few matchPath records? What does the data look like?

@kszot-ref
Copy link

@KyleAMathews Yes, every page is in there. The array looks like this:

V=JSON.parse('[{"path":"/path1","matchPath":"/path1"},{"path":"/path2","matchPath":"/path2"},{"path":"/path3","matchPath":"/path3"},...

and it goes on to include every single page. So, basically each "path" gets assigned its own "matchPath" with the same value as "path". The few matchPaths that were set manually are in there too of course.

@lucascaro
Copy link

I see this too. For added context, I see this when using gatsby-plugin-intl.
I've tracked it down to the following code:

      languages.forEach(function (language) {
      var localePage = generatePage(true, language);
      var regexp = new RegExp("/404/?$");
      
      if (regexp.test(localePage.path)) {
         localePage.matchPath = "/" + language + "/*";
      }
      
      createPage(localePage);

This is adding matchPath to 404 pages, but results in gatsby adding every single route to match-paths.json.
If I comment out localePage.matchPath = "/" + language + "/*"; then match-paths.json is empty.

This creates a single page for each language: /language/404/ that has a match path of /language/*.
Instead of adding an entry to match-paths.json per locale this adds every single path to it.

@lucascaro
Copy link

looks like it's intended:

exports[requires-writer matchPath should have static pages that live inside a matchPath 1] = `
https://github.com/gatsbyjs/gatsby/pull/17412/files#diff-78b69af59651dca44c9ba730572b9395d7da73cee3997b7e8eb76f2688d85ae4R24

Is this really intended or a bug?
If intended, is there any way to make it possible to have localized 404 pages, or any pages which require a matchPath, without adding every single page to match-paths.json?

@esped-dfds
Copy link

esped-dfds commented Apr 26, 2021

Any updates on this? our bundle contains 2M of slugs because we have 404 page in multiple locales, and this is quite bad for web-vitals.

If we leave out the matchPath the 404s still work except it will reset the url to e.g /da-dk/404.

The source of this problem is matchPath when creating the 404s: Basicly:

Is there not a less expensive way to not reset the urls on 404?

//create all the "normal" pages

//foreach locale
createPage(
        path: `/${locale.toLowerCase()}/404`,
        component: resolve(`./src/templates/NotFoundPage.js`),
        matchPath: `/${locale.toLowerCase()}/*`,
        context: {} // etc 
)

@kdichev
Copy link
Contributor

kdichev commented Apr 26, 2021

here is a small reproduction of the use case for localised 404 pages:

5 locales, 1 404 page with matchPath /${locale}/* and 1000 normal pages per locale.
That results in around 5000 website pages where match-paths.json takes around 60% of the app bundle.

When I build 10000 pages per locale the resulting match-paths.json takes 90% of the bundle, so the more pages and locales we have the bigger the bundle will be

code: https://github.com/kdichev/gatsby-match-paths-issue
bundle: https://nostalgic-goldberg-a19418.netlify.app/report.html

PS. If I build the project to output 100k pages the cold builds are slower with matchPath

no matchPath: 82s
success Building static HTML for pages - 11.533s - 100003/100003 8670.83/s
with matchPath: 130s
success Building static HTML for pages - 32.988s - 100003/100003 3031.54/s

@lucascaro
Copy link

I can confirm that adding a single page with matchPath (for example a 404 page with match of /*) will cause the build to embed the full list of all site pages on every pages' javascript. Apart from the obvious bloating, this certainly breaks conditional builds since all pages depend on this list now, has anyone found any workarounds?

@KyleAMathews
Copy link
Contributor

If anyone faces would like to brainstorm some possible fixes for this & work towards a solution, grab some time to meet — https://calendly.com/kyle-gatsby/30min?back=1&month=2021-05

@tlgimenes
Copy link
Contributor

I'm having the same problem. I'm creating an ecommerce website, where we pre-render some products and leave others out. Our folder structure looks something like:

└── src
    └── pages
        ├── [:slug]
        │   └── p.tsx
        └── {Product.slug}
            └── p.tsx

The problem I'm having is that the list of all products are being added to app.js entrypoint.

I personally don't mind receiving a 404 for page-data.json on client-side routes for this case in specific. Also, receiving a request for a non pre-rendered page is nice because my servers can add logic to enqueue a new build adding this extra page to my source plugin

@tlgimenes
Copy link
Contributor

For those who are still having this problem, I created a plugin addressing this problem and other performance improvements. The idea is to use the server for routing. It createRedirect for the page-data.json so we don't need to ship all of these paths to the frontend. You can check this out in here:

https://github.com/vtex/faststore/tree/master/packages/gatsby-plugin-performance.

@TymofiiVdovichenkoW
Copy link

For those who are still having this problem, I created a plugin addressing this problem and other performance improvements. The idea is to use the server for routing. It createRedirect for the page-data.json so we don't need to ship all of these paths to the frontend. You can check this out in here:

https://github.com/vtex/faststore/tree/master/packages/gatsby-plugin-performance.

This plugin has a conflict with 'gatsby-plugin-meta-redirect'.

"gatsby-plugin-meta-redirect" threw an error while running the onPostBuild lifecycle:

ENOTDIR: not a directory, open '/public/page-data/en/nz/page-data.json/index.html'

@LekoArts LekoArts removed the type: feature or enhancement Issue that is not a bug and requests the addition of a new feature or enhancement. label Aug 9, 2021
@gatsbyjs gatsbyjs locked and limited conversation to collaborators Aug 9, 2021
@LekoArts LekoArts closed this as completed Aug 9, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
topic: webpack/babel Webpack or babel
Projects
None yet
Development

No branches or pull requests