Stop generating sitemap.xml.gz (#6561) #6562

reebalazs · 2025-01-02T13:18:18Z

We generate sitemap-index.xml which is also put correctly into robots.txt. However we still provide the old sitemap.xml.gz.

But, while this serves no purpose in addition to the index, it can cause problems if Google somehow parses it (for example it's submitted to the Google Search console). For this reason, we stop providing sitemap.xml.gz.

I signed and returned the Plone Contributor Agreement, and received and accepted an invitation to join a team in the Plone GitHub organization.
I verified there aren't other open pull requests for the same change.
I followed the guidelines in Contributing to Volto.
I succesfully ran code linting checks on my changes locally.
I succesfully ran unit tests on my changes locally.
I succesfully ran acceptance tests on my changes locally.
If needed, I added new tests for my changes.
If needed, I added documentation for my changes, either in the Storybook or narrative documentation.
I included a change log entry in my commits.

Closes #6561

netlify · 2025-01-02T13:18:38Z

✅ Deploy Preview for plone-components canceled.

Name	Link
🔨 Latest commit	`13ba183`
🔍 Latest deploy log	https://app.netlify.com/sites/plone-components/deploys/67ad835c3d07600008789f9f

sneridagh

@plone/volto-team anybody else feedback?

ichim-david · 2025-01-09T11:48:46Z

@sneridagh I asked at Edw and at least as a preliminary reply was that it doesn't affect us as we check also for sitemap-index.xml. So at least for us it seems fine. Waiting to see what the other companies have to say about this but in the meanwhile I wanted to write to comment to know that we've discussed about this and we are ok with these changes

erral · 2025-01-09T12:47:05Z

I would add a redirect from sitemap.xml.gz to sitemap-index.xml.gz to gracefully handle the change.

reebalazs · 2025-01-09T15:02:39Z

I would add a redirect from sitemap.xml.gz to sitemap-index.xml.gz to gracefully handle the change.

I think that a redirect, while technically possible, is problematic because sitemap-index.xml is not gzipped and the original one is. So we would serve a file ending with .gz but actually not gzipped. Whether or not google eats this or not, is not possible to find out from the documentation. We could figure out experimentally... but then what if they change things.

So if we don't redirect, then Google either finds the sitemap from robots.txt. or it has to be changed once in the console, but it's explicit and there is no magic behind that might break things in the future.

(The reason for it being not gzipped is because of following the google documentation. It seemed simpler to do the same then to start experimenting what can work and what not.)

erral · 2025-01-09T15:34:49Z

Sorry I have written it incorrectly, I meant a redirect between sitemap.xml.gz and sitemap-index.xml

sneridagh · 2025-01-21T08:42:28Z

@erral Maybe we are not talking about the same, but if we create a redirect from sitemap.xml.gz to sitemap-index.xml it can bring some other problems, starting with the fact that it's not a .gz file.

@reebalazs I can understand that people that have their consoles pointing to the old, can have problems and will require action, so it's not an easy one (and then, it would require to add it in a breaking change).

The main problem here is that we are providing pointers to both... and could lead to duplicated indexes. Not an easy one. We'll have to test if a issuing a 301 would work for the console and make sure we issue the correct mimetype too.

erral · 2025-01-21T09:19:05Z

Then, if we are removing an existing URL, I think it should be considered a breaking change.

reebalazs · 2025-01-21T09:44:24Z

@erral Maybe we are not talking about the same, but if we create a redirect from sitemap.xml.gz to sitemap-index.xml it can bring some other problems, starting with the fact that it's not a .gz file.

@reebalazs I can understand that people that have their consoles pointing to the old, can have problems and will require action, so it's not an easy one (and then, it would require to add it in a breaking change).

The main problem here is that we are providing pointers to both... and could lead to duplicated indexes. Not an easy one. We'll have to test if a issuing a 301 would work for the console and make sure we issue the correct mimetype too.

The problems are

the duplication
the old one is only working until a site has grown as big that Google will reject the sitemap.

reebalazs · 2025-01-21T09:50:16Z

Then, if we are removing an existing URL, I think it should be considered a breaking change.

@erral:

Yes it's a breaking change, but we need this change if we want to fix the situation. The alternative is not doing anything. In this case the old index will continue to work, until the site grows large enough and it breaks. Or even worse, there will be a duplicate that might spoil the SEO. By providing the indexed version by default, we can avoid this.

Also, by "breaking" it means that Google won't be able to access the old index any more. So one should either go to Google search console and add the new index manually, OR Google can pick up the new index from robots.txt, which it SHOULD be smart enough to do, but I'm not 100% sure if that's the case.

If someone could confirm that Google picks up the new index from robots.txt. then there is no breaking at all. If this is not the case then "breaking" means that manual intervention is needed in the search console. I believe if we put a loud enough message in the changelog to "CHECK YOUR SEO AND UPDATE IF NECESSARY", then this should be enough.

erral · 2025-01-21T09:59:49Z

If we have hit a maximum number of URLs that Google can consider in a single sitemap.xml.gz we can set a limit in the URLs generated in that file, and document that in the Upgrade guide.

This way, although still being a breaking change, the url will keep working, and those small sites that don't reach the limit can still keep working.

We can set the limit to the limit you have set to build each of the files referenced in the sitemap-index.xml

tisto · 2025-01-21T12:11:37Z

@erral setting a limit is very dangerous, IMO, since the admins of a site won't get notified if they hit the limit. I agree that this is a breaking change. However, I would fix this once and for all with a breaking change release and just support the index sitemap. Anything else will lead to confusion and make the situation worse than it is already (we lost our entire SEO rankings for plone.de thanks to this problem; I don't want to imagine what would happen if that happened in a client project).

erral · 2025-01-22T07:25:39Z

Don't get me wrong, I have also faced picky SEO things, and I understand how hard and important is.

I understand that if we have hit the limit we need to provide a way to fix it, and we have done so using the index thing.

I only say that just removing the sitemap.xml.gz can be a "hard" thing, so to say. That's why I proposed 2 possible solutions (redirect and limit the amount of the URLs).

If none of them is valid, or if the solution is worst than the problem, fair enough. I just wanted to raise the concern of removing a URL.

reebalazs · 2025-01-22T09:26:23Z

Don't get me wrong, I have also faced picky SEO things, and I understand how hard and important is.

I understand that if we have hit the limit we need to provide a way to fix it, and we have done so using the index thing.

I only say that just removing the sitemap.xml.gz can be a "hard" thing, so to say. That's why I proposed 2 possible solutions (redirect and limit the amount of the URLs).

If none of them is valid, or if the solution is worst than the problem, fair enough. I just wanted to raise the concern of removing a URL.

Ok so if we want to do the redirect, we have to test it out. There is no other way.

Does Google follow the redirect at all?
Is it a problem for Google that we redirect a .gz file to an unzipped one, iow does Google process the file that ends with .gz but it's actually a plain xml?

If the answer to these is true then yes, we can do the redirect. If no, then no it does not work.

EDIT actually we can eliminate the redirect and simply serve the index file also with the old name. (Even simpler to do from the middleware.) Then only the second question remains, is it a problem for Google that the extension is .gz but it's actually a plain xml? If Google is opportunistic about this, then this might just work. (If not, we'd need to gzip it, which is also not too hard.)

Does anyone have the capacity to test this out with the Google console, see if a file is actually picked up? I'm flooded right now so I can't make any promises.

erral · 2025-01-28T10:44:53Z

I will test the redirect thing between sitemap.xml.gz and sitemap-index.xml and report back

sneridagh · 2025-01-28T10:46:30Z

We have to document the issues sitemap-index.html / sitemap.xml.gz nuances.

stevepiercy · 2025-01-28T12:29:34Z

We have to document the issues sitemap-index.html / sitemap.xml.gz nuances.

Where should the docs go? I'd say at least a mention in the Upgrade Guide. Is any of this configurable or changeable? If so, then it's probably a how to guide, else it's probably explanation.

erral · 2025-01-28T13:41:32Z

After talking with the guy that handles Search Console in our company, the solution may be easier than we thought:

The sitemap-index.xml file it's in itself a sitemap file, so instead of creating a new URL with the index, we can use the sitemap.xml.gz URL and serve the index file there.

This way, we don't need to change anything, we just change the contents of the sitemap.xml.gz.

My colleague says it doesn't matter to serve the file compress or uncompressed, he even suggests to compress every sitemap file, even the index one.

reebalazs · 2025-01-28T13:46:24Z

After talking with the guy that handles Search Console in our company, the solution may be easier than we thought:

The sitemap-index.xml file it's in itself a sitemap file, so instead of creating a new URL with the index, we can use the sitemap.xml.gz URL and serve the index file there.

This way, we don't need to change anything, we just change the contents of the sitemap.xml.gz.

My colleague says it doesn't matter to serve the file compress or uncompressed, he even suggests to compress every sitemap file, even the index one.

That should be pretty easy then. I also suspected that Google does not care if the file is compressed or uncompressed.

Still for future compatibility I suggest:

We continue serving sitemap-index.xml, and we also put it to robot.txt just like in the current PR
We add back serving sitemap.xml.gz but we actually serve the same content as sitemap-index.xml with this name.

That way every new site setup would see sitemap-index.xml which indicates that this is, in fact, a batched index. Also this is available for picking up from robots.txt. But all old deployments will continue to work even if the old sitemap.xml.gz is added to the console.

erral · 2025-02-11T08:56:45Z

We have tested this approach (serve the index file gzipped in the sitemap.xml.gz file) in 2 sites of ours and Search Console reports that the files are OK and the reported number of URLs is correct.

reebalazs · 2025-02-11T09:14:19Z

We have tested this approach (serve the index file gzipped in the sitemap.xml.gz file) in 2 sites of ours and Search Console reports that the files are OK and the reported number of URLs is correct.

@erral I've added this, can you please test it out and approve? Thank you!

erral · 2025-02-11T09:29:39Z

I will try to deploy these changes in a Volto 18 site.

erral

I have tested it locally before deploying it, and I see that the sitemap.xml.gz contents are not gzipped, it directly contains the index file.

reebalazs · 2025-02-11T09:41:01Z

I have tested it locally before deploying it, and I see that the sitemap.xml.gz contents are not gzipped, it directly contains the index file.

I've left out some part and force pushed later, can you please double check that you have the final?

It has res.set('Content-Type', 'application/x-gzip'); which causes the gzipping in the same way as originally or with the currently batched files.

erral · 2025-02-11T09:48:37Z

Mmm, strange.

The headers mark that the content is sent gzipped, which it happens, but the transferred file is not a gzipped file, but a text/plain:

erral@lindari:/tmp$ wget -S http://localhost:3000/sitemap.xml.gz
--2025-02-11 10:46:09--  http://localhost:3000/sitemap.xml.gz
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:3000... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Content-Type: application/x-gzip; charset=utf-8
  Content-Disposition: attachment; filename="sitemap.xml.gz"
  Content-Length: 199
  ETag: W/"c7-Jl9aJBoMZB2Ag5+54aah1NO77Sk"
  Date: Tue, 11 Feb 2025 09:46:09 GMT
  Connection: keep-alive
  Keep-Alive: timeout=5
Length: 199 [application/x-gzip]
Saving to: 'sitemap.xml.gz'

sitemap.xml.gz                              100%[========================================================================================>]     199  --.-KB/s    in 0s      

2025-02-11 10:46:09 (20.7 MB/s) - 'sitemap.xml.gz' saved [199/199]

And then:

erral@lindari:/tmp$ more sitemap.xml.gz 
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>http://localhost:3000/sitemap1.xml.gz</loc>
  </sitemap>
</sitemapindex>

I would expect that not only the transfer is gzipped, but the file itself also, right?

This is the example with the current https://demo.plone.org/sitemap.xml.gz

erral@lindari:/tmp$ wget https://demo.plone.org/sitemap.xml.gz
--2025-02-11 10:47:24--  https://demo.plone.org/sitemap.xml.gz
Resolving demo.plone.org (demo.plone.org)... 104.25.83.118, 172.67.80.190, 104.25.84.118, ...
Connecting to demo.plone.org (demo.plone.org)|104.25.83.118|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 927 [application/x-gzip]
Saving to: 'sitemap.xml.gz'

sitemap.xml.gz                              100%[========================================================================================>]     927  --.-KB/s    in 0s      

2025-02-11 10:47:24 (9.97 MB/s) - 'sitemap.xml.gz' saved [927/927]

And the contents:

erral@lindari:/tmp$ more sitemap.xml.gz 
�o�0���+P���m 	Oiz۩;-�v�<p�7�v��O��
aFQ���ы�H���`t��;�qg�1
�y#3�����<ԪhX�y�x�.I\������
��b��+��Y6B��	����:��W�׉��f�ׁ�Hp�sP�,ұ��x��KV�������5��$A<(g�E���� YN�L5Zp~DJ�x��A����b�(şH�Y�
R��O
erral@lindari:/tmp$

reebalazs · 2025-02-11T09:58:13Z

@erral, sorry, my bad. I'll ping you when I've updated it.

reebalazs · 2025-02-11T10:11:29Z

@erral please check it out now!

erral

Yes, now it works as expected

We generate sitemap-index.xml which is also put correctly into robots.txt. However we still provide the old sitemap.xml.gz. But, while this serves no purpose in addition to the index, it can cause problems if Google somehow parses it (for example it's submitted to the Google Search console). For this reason, we stop providing sitemap.xml.gz. - also serve the batched sitemap under the old name sitemap.xml.gz - update comment in robots.txt

sneridagh

LGTM! @plone/volto-team please another review?

packages/volto/news/6561.bugfix

davisagli

LGTM

reebalazs requested review from sneridagh and tisto January 2, 2025 13:18

sneridagh approved these changes Jan 8, 2025

View reviewed changes

sneridagh requested a review from a team January 8, 2025 11:21

erral mentioned this pull request Jan 28, 2025

sitemap.xml.gz has limit of 50.000 items plone/plone.app.layout#381

Closed

reebalazs force-pushed the ree-remove-old-sitemap branch from 3ccc9b3 to 6dc9010 Compare February 11, 2025 09:12

reebalazs requested a review from erral February 11, 2025 09:13

reebalazs force-pushed the ree-remove-old-sitemap branch from 6dc9010 to c1e3d8c Compare February 11, 2025 09:22

erral requested changes Feb 11, 2025

View reviewed changes

reebalazs changed the title ~~Stop generating sitemap.xml.gz (#6561)~~ WIP Stop generating sitemap.xml.gz (#6561) Feb 11, 2025

reebalazs force-pushed the ree-remove-old-sitemap branch from c1e3d8c to 4212296 Compare February 11, 2025 10:11

erral approved these changes Feb 11, 2025

View reviewed changes

reebalazs changed the title ~~WIP Stop generating sitemap.xml.gz (#6561)~~ Stop generating sitemap.xml.gz (#6561) Feb 11, 2025

reebalazs force-pushed the ree-remove-old-sitemap branch from 4212296 to 17491ce Compare February 12, 2025 14:38

sneridagh approved these changes Feb 12, 2025

View reviewed changes

davisagli reviewed Feb 13, 2025

View reviewed changes

packages/volto/news/6561.bugfix Outdated Show resolved Hide resolved

Update packages/volto/news/6561.bugfix

13ba183

davisagli approved these changes Feb 13, 2025

View reviewed changes

davisagli added the 24 status: ready label Feb 13, 2025

sneridagh merged commit 2599c72 into main Feb 13, 2025
80 checks passed

sneridagh deleted the ree-remove-old-sitemap branch February 13, 2025 07:38

Uh oh!

Stop generating sitemap.xml.gz (#6561) #6562

Stop generating sitemap.xml.gz (#6561) #6562

Uh oh!

Conversation

reebalazs commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for plone-components canceled.

Uh oh!

sneridagh left a comment

Choose a reason for hiding this comment

Uh oh!

ichim-david commented Jan 9, 2025

Uh oh!

erral commented Jan 9, 2025

Uh oh!

reebalazs commented Jan 9, 2025

Uh oh!

erral commented Jan 9, 2025

Uh oh!

sneridagh commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erral commented Jan 21, 2025

Uh oh!

reebalazs commented Jan 21, 2025

Uh oh!

reebalazs commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erral commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tisto commented Jan 21, 2025

Uh oh!

erral commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reebalazs commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erral commented Jan 28, 2025

Uh oh!

sneridagh commented Jan 28, 2025

Uh oh!

stevepiercy commented Jan 28, 2025

Uh oh!

erral commented Jan 28, 2025

Uh oh!

reebalazs commented Jan 28, 2025

Uh oh!

erral commented Feb 11, 2025

Uh oh!

reebalazs commented Feb 11, 2025

Uh oh!

erral commented Feb 11, 2025

Uh oh!

erral left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reebalazs commented Feb 11, 2025

Uh oh!

erral commented Feb 11, 2025

Uh oh!

reebalazs commented Feb 11, 2025

Uh oh!

reebalazs commented Feb 11, 2025

Uh oh!

erral left a comment

Choose a reason for hiding this comment

Uh oh!

sneridagh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davisagli left a comment

reebalazs commented Jan 2, 2025 •

edited

Loading

netlify bot commented Jan 2, 2025 •

edited

Loading

sneridagh commented Jan 21, 2025 •

edited

Loading

reebalazs commented Jan 21, 2025 •

edited

Loading

erral commented Jan 21, 2025 •

edited

Loading

erral commented Jan 22, 2025 •

edited

Loading

reebalazs commented Jan 22, 2025 •

edited

Loading

erral left a comment •

edited

Loading