Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop generating sitemap.xml.gz (#6561) #6562

Merged
merged 2 commits into from
Feb 13, 2025
Merged

Conversation

reebalazs
Copy link
Member

@reebalazs reebalazs commented Jan 2, 2025

We generate sitemap-index.xml which is also put correctly into robots.txt. However we still provide the old sitemap.xml.gz.

But, while this serves no purpose in addition to the index, it can cause problems if Google somehow parses it (for example it's submitted to the Google Search console). For this reason, we stop providing sitemap.xml.gz.


Closes #6561

@reebalazs reebalazs requested review from sneridagh and tisto January 2, 2025 13:18
Copy link

netlify bot commented Jan 2, 2025

Deploy Preview for plone-components canceled.

Name Link
🔨 Latest commit 13ba183
🔍 Latest deploy log https://app.netlify.com/sites/plone-components/deploys/67ad835c3d07600008789f9f

Copy link
Member

@sneridagh sneridagh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plone/volto-team anybody else feedback?

@sneridagh sneridagh requested a review from a team January 8, 2025 11:21
@ichim-david
Copy link
Member

@sneridagh I asked at Edw and at least as a preliminary reply was that it doesn't affect us as we check also for sitemap-index.xml. So at least for us it seems fine. Waiting to see what the other companies have to say about this but in the meanwhile I wanted to write to comment to know that we've discussed about this and we are ok with these changes

@erral
Copy link
Member

erral commented Jan 9, 2025

I would add a redirect from sitemap.xml.gz to sitemap-index.xml.gz to gracefully handle the change.

@reebalazs
Copy link
Member Author

I would add a redirect from sitemap.xml.gz to sitemap-index.xml.gz to gracefully handle the change.

I think that a redirect, while technically possible, is problematic because sitemap-index.xml is not gzipped and the original one is. So we would serve a file ending with .gz but actually not gzipped. Whether or not google eats this or not, is not possible to find out from the documentation. We could figure out experimentally... but then what if they change things.

So if we don't redirect, then Google either finds the sitemap from robots.txt. or it has to be changed once in the console, but it's explicit and there is no magic behind that might break things in the future.

(The reason for it being not gzipped is because of following the google documentation. It seemed simpler to do the same then to start experimenting what can work and what not.)

@erral
Copy link
Member

erral commented Jan 9, 2025

Sorry I have written it incorrectly, I meant a redirect between sitemap.xml.gz and sitemap-index.xml

@sneridagh
Copy link
Member

sneridagh commented Jan 21, 2025

@erral Maybe we are not talking about the same, but if we create a redirect from sitemap.xml.gz to sitemap-index.xml it can bring some other problems, starting with the fact that it's not a .gz file.

@reebalazs I can understand that people that have their consoles pointing to the old, can have problems and will require action, so it's not an easy one (and then, it would require to add it in a breaking change).

The main problem here is that we are providing pointers to both... and could lead to duplicated indexes. Not an easy one. We'll have to test if a issuing a 301 would work for the console and make sure we issue the correct mimetype too.

@erral
Copy link
Member

erral commented Jan 21, 2025

Then, if we are removing an existing URL, I think it should be considered a breaking change.

@reebalazs
Copy link
Member Author

@erral Maybe we are not talking about the same, but if we create a redirect from sitemap.xml.gz to sitemap-index.xml it can bring some other problems, starting with the fact that it's not a .gz file.

@reebalazs I can understand that people that have their consoles pointing to the old, can have problems and will require action, so it's not an easy one (and then, it would require to add it in a breaking change).

The main problem here is that we are providing pointers to both... and could lead to duplicated indexes. Not an easy one. We'll have to test if a issuing a 301 would work for the console and make sure we issue the correct mimetype too.

The problems are

  1. the duplication
  2. the old one is only working until a site has grown as big that Google will reject the sitemap.

@reebalazs
Copy link
Member Author

reebalazs commented Jan 21, 2025

Then, if we are removing an existing URL, I think it should be considered a breaking change.

@erral:

Yes it's a breaking change, but we need this change if we want to fix the situation. The alternative is not doing anything. In this case the old index will continue to work, until the site grows large enough and it breaks. Or even worse, there will be a duplicate that might spoil the SEO. By providing the indexed version by default, we can avoid this.

Also, by "breaking" it means that Google won't be able to access the old index any more. So one should either go to Google search console and add the new index manually, OR Google can pick up the new index from robots.txt, which it SHOULD be smart enough to do, but I'm not 100% sure if that's the case.

If someone could confirm that Google picks up the new index from robots.txt. then there is no breaking at all. If this is not the case then "breaking" means that manual intervention is needed in the search console. I believe if we put a loud enough message in the changelog to "CHECK YOUR SEO AND UPDATE IF NECESSARY", then this should be enough.

@erral
Copy link
Member

erral commented Jan 21, 2025

If we have hit a maximum number of URLs that Google can consider in a single sitemap.xml.gz we can set a limit in the URLs generated in that file, and document that in the Upgrade guide.

This way, although still being a breaking change, the url will keep working, and those small sites that don't reach the limit can still keep working.

We can set the limit to the limit you have set to build each of the files referenced in the sitemap-index.xml

@tisto
Copy link
Member

tisto commented Jan 21, 2025

@erral setting a limit is very dangerous, IMO, since the admins of a site won't get notified if they hit the limit. I agree that this is a breaking change. However, I would fix this once and for all with a breaking change release and just support the index sitemap. Anything else will lead to confusion and make the situation worse than it is already (we lost our entire SEO rankings for plone.de thanks to this problem; I don't want to imagine what would happen if that happened in a client project).

@erral
Copy link
Member

erral commented Jan 22, 2025

Don't get me wrong, I have also faced picky SEO things, and I understand how hard and important is.

I understand that if we have hit the limit we need to provide a way to fix it, and we have done so using the index thing.

I only say that just removing the sitemap.xml.gz can be a "hard" thing, so to say. That's why I proposed 2 possible solutions (redirect and limit the amount of the URLs).

If none of them is valid, or if the solution is worst than the problem, fair enough. I just wanted to raise the concern of removing a URL.

@reebalazs
Copy link
Member Author

reebalazs commented Jan 22, 2025

Don't get me wrong, I have also faced picky SEO things, and I understand how hard and important is.

I understand that if we have hit the limit we need to provide a way to fix it, and we have done so using the index thing.

I only say that just removing the sitemap.xml.gz can be a "hard" thing, so to say. That's why I proposed 2 possible solutions (redirect and limit the amount of the URLs).

If none of them is valid, or if the solution is worst than the problem, fair enough. I just wanted to raise the concern of removing a URL.

Ok so if we want to do the redirect, we have to test it out. There is no other way.

  • Does Google follow the redirect at all?
  • Is it a problem for Google that we redirect a .gz file to an unzipped one, iow does Google process the file that ends with .gz but it's actually a plain xml?

If the answer to these is true then yes, we can do the redirect. If no, then no it does not work.

EDIT actually we can eliminate the redirect and simply serve the index file also with the old name. (Even simpler to do from the middleware.) Then only the second question remains, is it a problem for Google that the extension is .gz but it's actually a plain xml? If Google is opportunistic about this, then this might just work. (If not, we'd need to gzip it, which is also not too hard.)

Does anyone have the capacity to test this out with the Google console, see if a file is actually picked up? I'm flooded right now so I can't make any promises.

@erral
Copy link
Member

erral commented Jan 28, 2025

I will test the redirect thing between sitemap.xml.gz and sitemap-index.xml and report back

@sneridagh
Copy link
Member

We have to document the issues sitemap-index.html / sitemap.xml.gz nuances.

@stevepiercy
Copy link
Collaborator

We have to document the issues sitemap-index.html / sitemap.xml.gz nuances.

Where should the docs go? I'd say at least a mention in the Upgrade Guide. Is any of this configurable or changeable? If so, then it's probably a how to guide, else it's probably explanation.

@erral
Copy link
Member

erral commented Jan 28, 2025

After talking with the guy that handles Search Console in our company, the solution may be easier than we thought:

The sitemap-index.xml file it's in itself a sitemap file, so instead of creating a new URL with the index, we can use the sitemap.xml.gz URL and serve the index file there.

This way, we don't need to change anything, we just change the contents of the sitemap.xml.gz.

My colleague says it doesn't matter to serve the file compress or uncompressed, he even suggests to compress every sitemap file, even the index one.

@reebalazs
Copy link
Member Author

After talking with the guy that handles Search Console in our company, the solution may be easier than we thought:

The sitemap-index.xml file it's in itself a sitemap file, so instead of creating a new URL with the index, we can use the sitemap.xml.gz URL and serve the index file there.

This way, we don't need to change anything, we just change the contents of the sitemap.xml.gz.

My colleague says it doesn't matter to serve the file compress or uncompressed, he even suggests to compress every sitemap file, even the index one.

That should be pretty easy then. I also suspected that Google does not care if the file is compressed or uncompressed.

Still for future compatibility I suggest:

  • We continue serving sitemap-index.xml, and we also put it to robot.txt just like in the current PR
  • We add back serving sitemap.xml.gz but we actually serve the same content as sitemap-index.xml with this name.

That way every new site setup would see sitemap-index.xml which indicates that this is, in fact, a batched index. Also this is available for picking up from robots.txt. But all old deployments will continue to work even if the old sitemap.xml.gz is added to the console.

@erral
Copy link
Member

erral commented Feb 11, 2025

We have tested this approach (serve the index file gzipped in the sitemap.xml.gz file) in 2 sites of ours and Search Console reports that the files are OK and the reported number of URLs is correct.

@reebalazs reebalazs force-pushed the ree-remove-old-sitemap branch from 3ccc9b3 to 6dc9010 Compare February 11, 2025 09:12
@reebalazs reebalazs requested a review from erral February 11, 2025 09:13
@reebalazs
Copy link
Member Author

We have tested this approach (serve the index file gzipped in the sitemap.xml.gz file) in 2 sites of ours and Search Console reports that the files are OK and the reported number of URLs is correct.

@erral I've added this, can you please test it out and approve? Thank you!

@reebalazs reebalazs force-pushed the ree-remove-old-sitemap branch from 6dc9010 to c1e3d8c Compare February 11, 2025 09:22
@erral
Copy link
Member

erral commented Feb 11, 2025

I will try to deploy these changes in a Volto 18 site.

Copy link
Member

@erral erral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested it locally before deploying it, and I see that the sitemap.xml.gz contents are not gzipped, it directly contains the index file.

@reebalazs
Copy link
Member Author

I have tested it locally before deploying it, and I see that the sitemap.xml.gz contents are not gzipped, it directly contains the index file.

I've left out some part and force pushed later, can you please double check that you have the final?

It has res.set('Content-Type', 'application/x-gzip'); which causes the gzipping in the same way as originally or with the currently batched files.

@erral
Copy link
Member

erral commented Feb 11, 2025

Mmm, strange.

The headers mark that the content is sent gzipped, which it happens, but the transferred file is not a gzipped file, but a text/plain:

erral@lindari:/tmp$ wget -S http://localhost:3000/sitemap.xml.gz
--2025-02-11 10:46:09--  http://localhost:3000/sitemap.xml.gz
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:3000... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Content-Type: application/x-gzip; charset=utf-8
  Content-Disposition: attachment; filename="sitemap.xml.gz"
  Content-Length: 199
  ETag: W/"c7-Jl9aJBoMZB2Ag5+54aah1NO77Sk"
  Date: Tue, 11 Feb 2025 09:46:09 GMT
  Connection: keep-alive
  Keep-Alive: timeout=5
Length: 199 [application/x-gzip]
Saving to: 'sitemap.xml.gz'

sitemap.xml.gz                              100%[========================================================================================>]     199  --.-KB/s    in 0s      

2025-02-11 10:46:09 (20.7 MB/s) - 'sitemap.xml.gz' saved [199/199]

And then:

erral@lindari:/tmp$ more sitemap.xml.gz 
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>http://localhost:3000/sitemap1.xml.gz</loc>
  </sitemap>
</sitemapindex>

I would expect that not only the transfer is gzipped, but the file itself also, right?

This is the example with the current https://demo.plone.org/sitemap.xml.gz

erral@lindari:/tmp$ wget https://demo.plone.org/sitemap.xml.gz
--2025-02-11 10:47:24--  https://demo.plone.org/sitemap.xml.gz
Resolving demo.plone.org (demo.plone.org)... 104.25.83.118, 172.67.80.190, 104.25.84.118, ...
Connecting to demo.plone.org (demo.plone.org)|104.25.83.118|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 927 [application/x-gzip]
Saving to: 'sitemap.xml.gz'

sitemap.xml.gz                              100%[========================================================================================>]     927  --.-KB/s    in 0s      

2025-02-11 10:47:24 (9.97 MB/s) - 'sitemap.xml.gz' saved [927/927]

And the contents:

erral@lindari:/tmp$ more sitemap.xml.gz 
�o�0���+P���m 	Oiz۩;-�v�<p�7�v��O��
aFQ���ы�H���`t��;�qg�1
�y#3�����<ԪhX�y�x�.I\������
��b��+��Y6B��	����:��W�׉��f�ׁ�Hp�sP�,ұ��x��KV�������5��$A<(g�E���� YN�L5Zp~DJ�x��A����b�(şH�Y�
R��O
erral@lindari:/tmp$ 

@reebalazs reebalazs changed the title Stop generating sitemap.xml.gz (#6561) WIP Stop generating sitemap.xml.gz (#6561) Feb 11, 2025
@reebalazs
Copy link
Member Author

@erral, sorry, my bad. I'll ping you when I've updated it.

@reebalazs reebalazs force-pushed the ree-remove-old-sitemap branch from c1e3d8c to 4212296 Compare February 11, 2025 10:11
@reebalazs
Copy link
Member Author

@erral please check it out now!

Copy link
Member

@erral erral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now it works as expected

@reebalazs reebalazs changed the title WIP Stop generating sitemap.xml.gz (#6561) Stop generating sitemap.xml.gz (#6561) Feb 11, 2025
We generate sitemap-index.xml which is also put correctly into robots.txt. However we still provide the old sitemap.xml.gz.

But, while this serves no purpose in addition to the index, it can cause problems if Google somehow parses it (for example it's submitted to the Google Search console). For this reason, we  stop providing sitemap.xml.gz.

- also serve the batched sitemap under the old name sitemap.xml.gz
- update comment in robots.txt
@reebalazs reebalazs force-pushed the ree-remove-old-sitemap branch from 4212296 to 17491ce Compare February 12, 2025 14:38
Copy link
Member

@sneridagh sneridagh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @plone/volto-team please another review?

Copy link
Member

@davisagli davisagli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sneridagh sneridagh merged commit 2599c72 into main Feb 13, 2025
80 checks passed
@sneridagh sneridagh deleted the ree-remove-old-sitemap branch February 13, 2025 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Do not deliver the old sitemap.xml.gz
7 participants