Split single sitemap into index and sub-sitemaps by starting letter in crate-name #1222

syphar · 2020-12-26T11:26:41Z

This is a first draft of an implementation for #1174.

My idea was to split the sitemap by starting letter, since

this is rather static on the index-side,
allows an easy and indexed filtering by postgres when generating the sub-sitemaps,
and doesn't change the result when re-requesting a site-map.

(crate-count per letter below)

Also I changed the query from DISTINCT ON to using GROUP BY and MAX.
Since we don't do ORDER BY, DISTINCT ON relies on the order the columns have on disk. If you don't do anything apart from working on single records, it doesn't matter and the first records (which are picked by DISTINCT ON) are the newest records. This only can be different with different data loading techniques, or when manually handling many records. I think the GROUP BY and MAX is more explicit, so better in that case.
But I'm also happy to revert that part if you don't think it's a good idea.

While I have a (long) history in software-dev and databases, I'm relatively new to rust. Also this is my first contribution to this project. So I'm happy to implement any changes / improvements.

Things that I could see, but I'm not sure if necessary:

different tests? more tests? (i tried to follow the existing pattern / depth)
make sitemap-index completely static? (though that would have quite some repetition)
make robots.txt reference the sub-sitemaps, not the index? (so, also use a template?)
different URL-pattern? (the thing I originally wanted was a file-name with a variable element like sitemap.a.xml, but I couldn't find an easy way to get router to handle this, and it's tricky to find much information about iron).

the data (tm)

starting letters in crates.io index

s       5406
r       4376
c       4305
a       3080
t       3042
p       2983
m       2931
g       2582
l       2500
d       2259
b       2226
f       2116
e       1783
i       1576
n       1570
w       1426
h       1398
o       1225
u       942
k       818
v       760
j       636
x       454
q       374
y       370
z       354
G       13
R       13
C       11
H       10
A       9
I       8
M       8
P       8
S       8
N       7
D       5
F       5
L       5
T       5
Q       4
Y       4
B       3
E       3
K       3
U       3
X       3
W       2
Z       2
J       1
O       1

starting letters in crates.io index (convert to lowercase)

s       5414
r       4389
c       4316
a       3089
t       3047
p       2991
m       2939
g       2595
l       2505
d       2264
b       2229
f       2121
e       1786
i       1584
n       1577
w       1428
h       1408
o       1226
u       945
k       821
v       760
j       637
x       457
q       378
y       374
z       356

code used

use counter::Counter;
use crates_index;

fn main() {
    let index = crates_index::BareIndex::new_cargo_default();

    let repo = index.open_or_clone().unwrap();

    let counter = repo
        .crates()
        .map(|c| c.name().to_lowercase().chars().nth(0).unwrap())
        .collect::<Counter<_>>();

    for (elem, c) in counter.most_common_ordered() {
        println!("{}\t{}", elem, c);
    }
}

jyn514

This looks awesome, thanks so much ❤️

@Nemo157 may want to take a look at the GROUP BY changes but they seem fine to me.

src/web/sitemap.rs

jyn514 · 2020-12-27T05:43:25Z

src/web/sitemap.rs

+                rustdoc_status = true AND 
+                ( 
+                    crates.name like $1 OR 
+                    crates.name like $2


I think you can use ILIKE here to avoid needing two parameters.

that was intentional, ILIKE only can use a BTREE index when the search doesn't start with non-alphabetic characters. I thought it was a little overkill to create a new index (GIN or BTREE on lower(name)) for that.

(see also the postgres docs on index types)

Another alternative would be to use the db proc normalize_crate_name , which already has a specific index that is used in web::match_version. But that also seemed a little off, but I'm happy to change it if you think that fits the project more.

In any case I see this is definitely worth a comment :)

jyn514 · 2020-12-27T05:47:10Z

templates/core/sitemapindex.xml

@@ -0,0 +1,8 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+    {% for which in sitemaps -%}


@Kixiron do you know if tera happens to have a "character range" operator? I found range(end=5) but I'm not sure if it works for characters instead.

src/web/sitemap.rs

jyn514 · 2020-12-27T05:52:22Z

make sitemap-index completely static? (though that would have quite some repetition)

Templates render fast enough I don't think we need to worry about this (especially for non-interactive requests like a sitemap). The database will take way longer than going through tera.

I originally wanted was a file-name with a variable element like sitemap.a.xml, but I couldn't find an easy way to get router to handle this, and it's tricky to find much information about iron).

If you can figure it out it sounds neat, but I wouldn't spend too much time on it. We're hoping to switch away from iron sometime soon (#747).

syphar · 2020-12-27T08:14:40Z

Thank you for checking the code @jyn514 :)
I added some commits for the requested changes, I hope they satisfy you.

Remaining open questions seem to be at @Kixiron and @Nemo157

Nemo157 · 2020-12-27T09:14:42Z

GROUP BY looks fine, and it seems better to be explicit about it. Analyzing the query the GROUP BY actually appears to be faster. It's not using the index for the prefix test, probably because my server is using UTF8 instead of C locale, not sure what's used in production.

syphar · 2020-12-27T10:21:12Z

@Nemo157 I've seen postgres ignoring indexes when the seq scan is faster,
but I'll create a bigger dataset and validate.

Nemo157 · 2020-12-27T10:36:41Z

Yeah, I doubt it matters much, 14ms on a dump from production from ~Feb seems fine, if necessary when the db gets bigger we can rebuild the index with C locale explicitly so it is usable.

syphar · 2020-12-27T10:42:59Z

that being true, I want to understand why it's not being used in this specific case :) (unrelated this this PR)

syphar · 2020-12-27T11:00:15Z

yep, it's about the index locale:

cratesfyi=# explain analyze select name from crates where name like 'r%';
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                QUERY PLAN                                                 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Seq Scan on crates  (cost=0.00..1047.90 rows=4176 width=11) (actual time=1.032..35.844 rows=4379 loops=1) │
│   Filter: ((name)::text ~~ 'r%'::text)                                                                    │
│   Rows Removed by Filter: 47293                                                                           │
│ Planning Time: 0.293 ms                                                                                   │
│ Execution Time: 65.746 ms                                                                                 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(5 rows)

cratesfyi=# create index testindex on crates using btree (name collate pg_catalog."default" varchar_pattern_ops);
CREATE INDEX
cratesfyi=# analyze crates;
ANALYZE
cratesfyi=# explain analyze select name from crates where name like 'r%';
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                           QUERY PLAN                                                            │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Only Scan using testindex on crates  (cost=0.29..175.88 rows=4697 width=11) (actual time=0.032..31.779 rows=4379 loops=1) │
│   Index Cond: ((name ~>=~ 'r'::text) AND (name ~<~ 's'::text))                                                                  │
│   Filter: ((name)::text ~~ 'r%'::text)                                                                                          │
│   Heap Fetches: 0                                                                                                               │
│ Planning Time: 0.551 ms                                                                                                         │
│ Execution Time: 60.490 ms                                                                                                       │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)

as we see, with the current 50k records it doesn't help much.

syphar · 2020-12-27T11:03:18Z

now the question remains if I should simplify the code by using ILIKE and removing this column?

Or do we want to quickly create the index later and the code should stay like this?

jyn514 · 2020-12-27T13:22:09Z

I think if the difference is only 2 ms on a production dump with 200k crates, we should just use ILIKE - there are much slower queries that will need to be fixed if crates.io grows by an order of magnitude.

It's not using the index for the prefix test, probably because my server is using UTF8 instead of C locale, not sure what's used in production.

$ locale
LANG=C.UTF-8

Not sure which that counts as? But I don't think it's worth messing with for a query that's already very fast. :) Here's the same explain analyze command in prod:

cratesfyi=>  explain analyze select name from crates where name like 'r%';
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on crates  (cost=0.00..2561.46 rows=4767 width=11) (actual time=0.010..17.264 rows=4376 loops=1)
   Filter: ((name)::text ~~ 'r%'::text)
   Rows Removed by Filter: 48236
 Planning time: 0.601 ms
 Execution time: 17.555 ms
(5 rows)

src/web/sitemap.rs

syphar · 2020-12-27T15:48:56Z

I added the proposed changes,
fixed the test,
changed to using the character range in the sitemap-index-handler
merged the new master and fixed the conflicts

Not sure which that counts as? But I don't think it's worth messing with for a query that's already very fast. :) Here's the same explain analyze command in prod:

That counts as C locale, and the index is used for the LIKE query (and wouldn't be for ILIKE)

I think if the difference is only 2 ms on a production dump with 200k crates, we should just use ILIKE - there are much slower queries that will need to be fixed if crates.io grows by an order of magnitude.

@jyn514 here I'm confused, you prefer the current solution where the index would be used in production?
or should I change to ILIKE because it's easier to read and understand?

syphar · 2020-12-27T15:49:46Z

( for example, the locale in the docker-container for the local database is set as en_US.utf8)

jyn514 · 2020-12-27T15:51:38Z

I would change to ILIKE because it's easier to read and understand.

merged the new master and fixed the conflicts

For next time, I prefer rebasing to merging, but it's not a big deal.

syphar · 2020-12-27T15:53:24Z

I would change to ILIKE because it's easier to read and understand.

will do

merged the new master and fixed the conflicts

For next time, I prefer rebasing to merging, but it's not a big deal.

good to know, but also easy to change for this PR.

I'm used to doing squash-merging on the final merge to master, sorry about that :)

(could have asked)

Co-authored-by: Joshua Nelson <joshua@yottadb.com>

syphar · 2020-12-27T15:59:13Z

also the force-push after rebase breaks the history in the PR :)

(the changes coming after comments)

syphar · 2020-12-27T16:00:54Z

@jyn514 rebased and force-pushed.

Tests work locally, also did another manual test after the changes.

jyn514 · 2020-12-27T16:12:56Z

This is awesome, thanks so much!

syphar · 2020-12-27T16:46:14Z

I'm happy to help.

After seeing you are doing rebate-merges I'll squash some commits next time 😀

syphar mentioned this pull request Dec 26, 2020

Add support for multiple sitemaps #1174

Closed

jyn514 reviewed Dec 27, 2020

View reviewed changes

src/web/sitemap.rs Outdated Show resolved Hide resolved

src/web/sitemap.rs Outdated Show resolved Hide resolved

syphar and others added 8 commits December 27, 2020 16:57

Split single sitemap into index and sub-sitemaps per starting character

53b1092

rename route parameter name

8ee50fd

add comment about psql LIKE index usage

906720a

add more tests, fail more often

e293a50

Update src/web/sitemap.rs

ccc59c2

Co-authored-by: Joshua Nelson <joshua@yottadb.com>

Update src/web/sitemap.rs

0126fc8

Co-authored-by: Joshua Nelson <joshua@yottadb.com>

use char range where possible

40c10bf

switch from LIKE to ILIKE for better readability

62dcdef

syphar force-pushed the sitemap-split branch from 8eef24d to 62dcdef Compare December 27, 2020 16:00

jyn514 approved these changes Dec 27, 2020

View reviewed changes

jyn514 merged commit 279752c into rust-lang:master Dec 27, 2020

syphar deleted the sitemap-split branch December 27, 2020 16:45

jyn514 added S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it and removed S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it labels Dec 28, 2020

Split single sitemap into index and sub-sitemaps by starting letter in crate-name #1222

Split single sitemap into index and sub-sitemaps by starting letter in crate-name #1222

Uh oh!

Conversation

syphar commented Dec 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

the data (tm)

Uh oh!

jyn514 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jyn514 Dec 27, 2020

Choose a reason for hiding this comment

Uh oh!

syphar Dec 27, 2020

Choose a reason for hiding this comment

Uh oh!

jyn514 Dec 27, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jyn514 commented Dec 27, 2020

Uh oh!

syphar commented Dec 27, 2020

Uh oh!

Nemo157 commented Dec 27, 2020

Uh oh!

syphar commented Dec 27, 2020

Uh oh!

Nemo157 commented Dec 27, 2020

Uh oh!

syphar commented Dec 27, 2020

Uh oh!

syphar commented Dec 27, 2020

Uh oh!

syphar commented Dec 27, 2020

Uh oh!

jyn514 commented Dec 27, 2020

Uh oh!

Uh oh!

Uh oh!

syphar commented Dec 27, 2020

Uh oh!

syphar commented Dec 27, 2020

Uh oh!

jyn514 commented Dec 27, 2020

Uh oh!

syphar commented Dec 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

syphar commented Dec 27, 2020

Uh oh!

syphar commented Dec 27, 2020

Uh oh!

jyn514 commented Dec 27, 2020

Uh oh!

syphar commented Dec 27, 2020

Uh oh!

Uh oh!

syphar commented Dec 26, 2020 •

edited

Loading

syphar commented Dec 27, 2020 •

edited

Loading