-
Notifications
You must be signed in to change notification settings - Fork 212
Split single sitemap into index and sub-sitemaps by starting letter in crate-name #1222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome, thanks so much ❤️
@Nemo157 may want to take a look at the GROUP BY changes but they seem fine to me.
src/web/sitemap.rs
Outdated
rustdoc_status = true AND | ||
( | ||
crates.name like $1 OR | ||
crates.name like $2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use ILIKE here to avoid needing two parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was intentional, ILIKE
only can use a BTREE
index when the search doesn't start with non-alphabetic characters. I thought it was a little overkill to create a new index (GIN
or BTREE
on lower(name)
) for that.
(see also the postgres docs on index types)
Another alternative would be to use the db proc normalize_crate_name
, which already has a specific index that is used in web::match_version
. But that also seemed a little off, but I'm happy to change it if you think that fits the project more.
In any case I see this is definitely worth a comment :)
@@ -0,0 +1,8 @@ | |||
<?xml version="1.0" encoding="UTF-8"?> | |||
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> | |||
{% for which in sitemaps -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kixiron do you know if tera happens to have a "character range" operator? I found range(end=5)
but I'm not sure if it works for characters instead.
Templates render fast enough I don't think we need to worry about this (especially for non-interactive requests like a sitemap). The database will take way longer than going through tera.
If you can figure it out it sounds neat, but I wouldn't spend too much time on it. We're hoping to switch away from iron sometime soon (#747). |
@Nemo157 I've seen postgres ignoring indexes when the seq scan is faster, |
Yeah, I doubt it matters much, 14ms on a dump from production from ~Feb seems fine, if necessary when the db gets bigger we can rebuild the index with C locale explicitly so it is usable. |
that being true, I want to understand why it's not being used in this specific case :) (unrelated this this PR) |
yep, it's about the index locale:
as we see, with the current 50k records it doesn't help much. |
now the question remains if I should simplify the code by using Or do we want to quickly create the index later and the code should stay like this? |
I think if the difference is only 2 ms on a production dump with 200k crates, we should just use ILIKE - there are much slower queries that will need to be fixed if crates.io grows by an order of magnitude.
Not sure which that counts as? But I don't think it's worth messing with for a query that's already very fast. :) Here's the same
|
That counts as
@jyn514 here I'm confused, you prefer the current solution where the index would be used in production? |
( for example, the locale in the docker-container for the local database is set as |
I would change to ILIKE because it's easier to read and understand.
For next time, I prefer rebasing to merging, but it's not a big deal. |
will do
good to know, but also easy to change for this PR. I'm used to doing squash-merging on the final merge to master, sorry about that :) (could have asked) |
Co-authored-by: Joshua Nelson <joshua@yottadb.com>
Co-authored-by: Joshua Nelson <joshua@yottadb.com>
also the force-push after rebase breaks the history in the PR :) (the changes coming after comments) |
8eef24d
to
62dcdef
Compare
@jyn514 rebased and force-pushed. Tests work locally, also did another manual test after the changes. |
This is awesome, thanks so much! |
I'm happy to help. After seeing you are doing rebate-merges I'll squash some commits next time 😀 |
This is a first draft of an implementation for #1174.
My idea was to split the sitemap by starting letter, since
(crate-count per letter below)
Also I changed the query from
DISTINCT ON
to usingGROUP BY
andMAX
.Since we don't do
ORDER BY
,DISTINCT ON
relies on the order the columns have on disk. If you don't do anything apart from working on single records, it doesn't matter and the first records (which are picked byDISTINCT ON
) are the newest records. This only can be different with different data loading techniques, or when manually handling many records. I think theGROUP BY
andMAX
is more explicit, so better in that case.But I'm also happy to revert that part if you don't think it's a good idea.
While I have a (long) history in software-dev and databases, I'm relatively new to rust. Also this is my first contribution to this project. So I'm happy to implement any changes / improvements.
Things that I could see, but I'm not sure if necessary:
robots.txt
reference the sub-sitemaps, not the index? (so, also use a template?)sitemap.a.xml
, but I couldn't find an easy way to get router to handle this, and it's tricky to find much information about iron).the data (tm)
starting letters in crates.io index
starting letters in crates.io index (convert to lowercase)
code used