Skip to content

Split single sitemap into index and sub-sitemaps by starting letter in crate-name #1222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 27, 2020

Conversation

syphar
Copy link
Member

@syphar syphar commented Dec 26, 2020

This is a first draft of an implementation for #1174.

My idea was to split the sitemap by starting letter, since

  • this is rather static on the index-side,
  • allows an easy and indexed filtering by postgres when generating the sub-sitemaps,
  • and doesn't change the result when re-requesting a site-map.

(crate-count per letter below)

Also I changed the query from DISTINCT ON to using GROUP BY and MAX.
Since we don't do ORDER BY, DISTINCT ON relies on the order the columns have on disk. If you don't do anything apart from working on single records, it doesn't matter and the first records (which are picked by DISTINCT ON) are the newest records. This only can be different with different data loading techniques, or when manually handling many records. I think the GROUP BY and MAX is more explicit, so better in that case.
But I'm also happy to revert that part if you don't think it's a good idea.

While I have a (long) history in software-dev and databases, I'm relatively new to rust. Also this is my first contribution to this project. So I'm happy to implement any changes / improvements.

Things that I could see, but I'm not sure if necessary:

  • different tests? more tests? (i tried to follow the existing pattern / depth)
  • make sitemap-index completely static? (though that would have quite some repetition)
  • make robots.txt reference the sub-sitemaps, not the index? (so, also use a template?)
  • different URL-pattern? (the thing I originally wanted was a file-name with a variable element like sitemap.a.xml, but I couldn't find an easy way to get router to handle this, and it's tricky to find much information about iron).

the data (tm)

starting letters in crates.io index
s       5406
r       4376
c       4305
a       3080
t       3042
p       2983
m       2931
g       2582
l       2500
d       2259
b       2226
f       2116
e       1783
i       1576
n       1570
w       1426
h       1398
o       1225
u       942
k       818
v       760
j       636
x       454
q       374
y       370
z       354
G       13
R       13
C       11
H       10
A       9
I       8
M       8
P       8
S       8
N       7
D       5
F       5
L       5
T       5
Q       4
Y       4
B       3
E       3
K       3
U       3
X       3
W       2
Z       2
J       1
O       1
starting letters in crates.io index (convert to lowercase)
s       5414
r       4389
c       4316
a       3089
t       3047
p       2991
m       2939
g       2595
l       2505
d       2264
b       2229
f       2121
e       1786
i       1584
n       1577
w       1428
h       1408
o       1226
u       945
k       821
v       760
j       637
x       457
q       378
y       374
z       356
code used
use counter::Counter;
use crates_index;

fn main() {
    let index = crates_index::BareIndex::new_cargo_default();

    let repo = index.open_or_clone().unwrap();

    let counter = repo
        .crates()
        .map(|c| c.name().to_lowercase().chars().nth(0).unwrap())
        .collect::<Counter<_>>();

    for (elem, c) in counter.most_common_ordered() {
        println!("{}\t{}", elem, c);
    }
}

Copy link
Member

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome, thanks so much ❤️

@Nemo157 may want to take a look at the GROUP BY changes but they seem fine to me.

rustdoc_status = true AND
(
crates.name like $1 OR
crates.name like $2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use ILIKE here to avoid needing two parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was intentional, ILIKE only can use a BTREE index when the search doesn't start with non-alphabetic characters. I thought it was a little overkill to create a new index (GIN or BTREE on lower(name)) for that.

(see also the postgres docs on index types)

Another alternative would be to use the db proc normalize_crate_name , which already has a specific index that is used in web::match_version. But that also seemed a little off, but I'm happy to change it if you think that fits the project more.

In any case I see this is definitely worth a comment :)

@@ -0,0 +1,8 @@
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{% for which in sitemaps -%}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kixiron do you know if tera happens to have a "character range" operator? I found range(end=5) but I'm not sure if it works for characters instead.

@jyn514
Copy link
Member

jyn514 commented Dec 27, 2020

make sitemap-index completely static? (though that would have quite some repetition)

Templates render fast enough I don't think we need to worry about this (especially for non-interactive requests like a sitemap). The database will take way longer than going through tera.

I originally wanted was a file-name with a variable element like sitemap.a.xml, but I couldn't find an easy way to get router to handle this, and it's tricky to find much information about iron).

If you can figure it out it sounds neat, but I wouldn't spend too much time on it. We're hoping to switch away from iron sometime soon (#747).

@syphar
Copy link
Member Author

syphar commented Dec 27, 2020

Thank you for checking the code @jyn514 :)
I added some commits for the requested changes, I hope they satisfy you.

Remaining open questions seem to be at @Kixiron and @Nemo157

@Nemo157
Copy link
Member

Nemo157 commented Dec 27, 2020

GROUP BY looks fine, and it seems better to be explicit about it. Analyzing the query the GROUP BY actually appears to be faster. It's not using the index for the prefix test, probably because my server is using UTF8 instead of C locale, not sure what's used in production.

image

@syphar
Copy link
Member Author

syphar commented Dec 27, 2020

@Nemo157 I've seen postgres ignoring indexes when the seq scan is faster,
but I'll create a bigger dataset and validate.

@Nemo157
Copy link
Member

Nemo157 commented Dec 27, 2020

Yeah, I doubt it matters much, 14ms on a dump from production from ~Feb seems fine, if necessary when the db gets bigger we can rebuild the index with C locale explicitly so it is usable.

@syphar
Copy link
Member Author

syphar commented Dec 27, 2020

that being true, I want to understand why it's not being used in this specific case :) (unrelated this this PR)

@syphar
Copy link
Member Author

syphar commented Dec 27, 2020

yep, it's about the index locale:

cratesfyi=# explain analyze select name from crates where name like 'r%';
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                QUERY PLAN                                                 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Seq Scan on crates  (cost=0.00..1047.90 rows=4176 width=11) (actual time=1.032..35.844 rows=4379 loops=1) │
│   Filter: ((name)::text ~~ 'r%'::text)                                                                    │
│   Rows Removed by Filter: 47293                                                                           │
│ Planning Time: 0.293 ms                                                                                   │
│ Execution Time: 65.746 ms                                                                                 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(5 rows)

cratesfyi=# create index testindex on crates using btree (name collate pg_catalog."default" varchar_pattern_ops);
CREATE INDEX
cratesfyi=# analyze crates;
ANALYZE
cratesfyi=# explain analyze select name from crates where name like 'r%';
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                           QUERY PLAN                                                            │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Only Scan using testindex on crates  (cost=0.29..175.88 rows=4697 width=11) (actual time=0.032..31.779 rows=4379 loops=1) │
│   Index Cond: ((name ~>=~ 'r'::text) AND (name ~<~ 's'::text))                                                                  │
│   Filter: ((name)::text ~~ 'r%'::text)                                                                                          │
│   Heap Fetches: 0                                                                                                               │
│ Planning Time: 0.551 ms                                                                                                         │
│ Execution Time: 60.490 ms                                                                                                       │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)

as we see, with the current 50k records it doesn't help much.

@syphar
Copy link
Member Author

syphar commented Dec 27, 2020

now the question remains if I should simplify the code by using ILIKE and removing this column?

Or do we want to quickly create the index later and the code should stay like this?

@jyn514
Copy link
Member

jyn514 commented Dec 27, 2020

I think if the difference is only 2 ms on a production dump with 200k crates, we should just use ILIKE - there are much slower queries that will need to be fixed if crates.io grows by an order of magnitude.

It's not using the index for the prefix test, probably because my server is using UTF8 instead of C locale, not sure what's used in production.

$ locale
LANG=C.UTF-8

Not sure which that counts as? But I don't think it's worth messing with for a query that's already very fast. :) Here's the same explain analyze command in prod:

cratesfyi=>  explain analyze select name from crates where name like 'r%';
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on crates  (cost=0.00..2561.46 rows=4767 width=11) (actual time=0.010..17.264 rows=4376 loops=1)
   Filter: ((name)::text ~~ 'r%'::text)
   Rows Removed by Filter: 48236
 Planning time: 0.601 ms
 Execution time: 17.555 ms
(5 rows)

@syphar
Copy link
Member Author

syphar commented Dec 27, 2020

  • I added the proposed changes,
  • fixed the test,
  • changed to using the character range in the sitemap-index-handler
  • merged the new master and fixed the conflicts

Not sure which that counts as? But I don't think it's worth messing with for a query that's already very fast. :) Here's the same explain analyze command in prod:

That counts as C locale, and the index is used for the LIKE query (and wouldn't be for ILIKE)

I think if the difference is only 2 ms on a production dump with 200k crates, we should just use ILIKE - there are much slower queries that will need to be fixed if crates.io grows by an order of magnitude.

@jyn514 here I'm confused, you prefer the current solution where the index would be used in production?
or should I change to ILIKE because it's easier to read and understand?

@syphar
Copy link
Member Author

syphar commented Dec 27, 2020

( for example, the locale in the docker-container for the local database is set as en_US.utf8)

@jyn514
Copy link
Member

jyn514 commented Dec 27, 2020

I would change to ILIKE because it's easier to read and understand.

merged the new master and fixed the conflicts

For next time, I prefer rebasing to merging, but it's not a big deal.

@syphar
Copy link
Member Author

syphar commented Dec 27, 2020

I would change to ILIKE because it's easier to read and understand.

will do

merged the new master and fixed the conflicts

For next time, I prefer rebasing to merging, but it's not a big deal.

good to know, but also easy to change for this PR.

I'm used to doing squash-merging on the final merge to master, sorry about that :)

(could have asked)

@syphar
Copy link
Member Author

syphar commented Dec 27, 2020

also the force-push after rebase breaks the history in the PR :)

(the changes coming after comments)

@syphar
Copy link
Member Author

syphar commented Dec 27, 2020

@jyn514 rebased and force-pushed.

Tests work locally, also did another manual test after the changes.

@jyn514 jyn514 merged commit 279752c into rust-lang:master Dec 27, 2020
@jyn514
Copy link
Member

jyn514 commented Dec 27, 2020

This is awesome, thanks so much!

@syphar syphar deleted the sitemap-split branch December 27, 2020 16:45
@syphar
Copy link
Member Author

syphar commented Dec 27, 2020

I'm happy to help.

After seeing you are doing rebate-merges I'll squash some commits next time 😀

@jyn514 jyn514 added S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it and removed S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it labels Dec 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants