-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proof of concept implementation of get_run via api (RFC). #1993
base: master
Are you sure you want to change the base?
Conversation
Serializing of a run with 500 tasks only takes a couple of ms, but I am worried the middleware will wreak havoc with the timings. It would be nice to have a more direct mechanism for communicating between instances than going through nginx + middleware... |
redis or memcached, but it's another tool to setup and maintain. |
I was more thinking of something basic like shared memory. |
Instead of returning the serialized run in the body, the primary instance could write it to shared memory where the secondary instance would find it. But it is not so obvious what is the cleanest way to make this thread safe. What you would actually want is an in-memory mongodb collection (i.e. a collection which is never written to disk). |
Another option is a named pipe. The primary instance creates it, writes the serialized run to it and communicates the name to the secondary instance. The secondary instance reads it, and then removes it. Not sure this is all possible. |
Of course we may just as well use a real temporary file for this then. |
Right now PROD is able to withstand 60k cores or 100k cores with #1983 (the usual contribution is 3-5k cores). |
snippet of nginx config server {
listen 80;
listen [::]:80;
server_name dfts-0.pigazzini.it;
location / {
return 301 https://$host/tests;
}
location /api/ {
proxy_set_header Connection "";
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host $host:$server_port;
proxy_set_header X-Forwarded-Port $server_port;
proxy_set_header X-Country-Code $region;
client_max_body_size 120m;
client_body_buffer_size 128k;
proxy_connect_timeout 5s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
proxy_max_temp_file_size 0;
proxy_redirect off;
proxy_http_version 1.1;
proxy_pass http://$backends;
location /api/serialize_run/ {
allow 127.0.0.1;
deny all;
proxy_pass http://$backends;
}
}
}
server {
listen 443 ssl http2;
listen [::]:443 ssl http2 ipv6only=on;
...
location / {
rewrite ^/$ /tests permanent;
...
proxy_pass http://$backends;
location /api/serialize_run/ {
deny all;
}
}
}
|
Well #1983 is something else. This PR is because in #1985 and #1992 the run read from the database may be missing some tasks compared to what's in the cache. This could lead to confusing behavior. This PR seems conceptually to be the most elegant way to fix that, but I am not sure if it scales... EDIT: Note that this PR may be optimized a little bit by skipping the base64 encoding and just returning the gzipped data as |
I was talking about named pipe, shared memory, not this PR. I'm not a fan to add another tool, hitting the rough corner, but redis is the standard tool to cache in memory the db. curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt update
sudo apt -y install redis
sudo systemctl status redis-server On python side something like this: # pip install redis
import redis
import json
r = redis.Redis(host='localhost', port=6379, db=0)
def get_run(self, r_id):
r_id = str(r_id)
run_cache_item = r.get(r_id)
if run_cache_item is not None:
run_cache_item = json.loads(run_cache_item)
run_cache_item["rtime"] = time.time()
r.set(r_id, json.dumps(run_cache_item))
return run_cache_item["run"]
run = self.runs.find_one({"_id": ObjectId(r_id)})
if run is not None:
run_cache_item = {
"rtime": time.time(),
"ftime": time.time(),
"run": run,
"dirty": False,
}
r.set(r_id, json.dumps(run_cache_item))
return run
def buffer(self, run, flush):
r_id = str(run["_id"])
if flush:
run_cache_item = {
"dirty": False,
"rtime": time.time(),
"ftime": time.time(),
"run": run,
}
r.set(r_id, json.dumps(run_cache_item))
self.runs.replace_one({"_id": ObjectId(r_id)}, run)
else:
run_cache_item = r.get(r_id)
if run_cache_item is not None:
run_cache_item = json.loads(run_cache_item)
ftime = run_cache_item["ftime"]
else:
ftime = time.time()
run_cache_item = {
"dirty": True,
"rtime": time.time(),
"ftime": ftime,
"run": run,
}
r.set(r_id, json.dumps(run_cache_item)) But datetime and id must be converted before the dump. run["start_time"] = run["start_time"].isoformat()
run["_id"] = str(run["_id"]) |
I don't understand. This still seems to return the run from the db, which is not what you want... |
Written on the fly, code fixed. |
One can use bson instead of json. bson doesn't have this issue. |
@ppigazzini Your code seems to be much more ambitious than what I am trying to do in this PR.
|
In the latest push I skip the base64 encoding. This makes the payload smaller. EDIT: I should perhaps point out that runs compress extremely well (obviously because there is so much redundancy). I did an experiment with a typical run (500 tasks, 50 bad tasks and 101 SPSA history entries) and the compression ratio was 77:1. I.e. 425k vs 5.5k. |
ae56deb
to
f7e42b0
Compare
In nginx config we set at http level few rate limits:
Also with the cookie module with a js challenge Those rate limits should apply to this API as well. I must replicate the filtering on DEV to verify the hypothesis and test a new configuration to whitelist localhost. BTW: DEV running with the PR. |
From api.log
|
On PROD
|
PROD nginx is running without limits in place right now. |
You mean this PR could work if I rebase it? |
It worked on DEV, so I expect that it should work on PROD as well now with a simpler nginx configuration. |
Ok tonight I will have a look... |
Rebased. I hope I did not mess up too much. |
DEV running with the PR, logs clean right now. |
It seems more expensive. This an extract of
|
PROD has been running with the PR for some time now. Timings before the change: # logging from [28/Jun/2024:17:36:27 +0000] to [28/Jun/2024:18:51:30 +0000] ########
# duration (seconds) : 4503
# calls in total : 10000
# calls per second : 2
# calls not reaching the backend: 42
# calls handled by the backend : 9958
# route calls total average minimum maximum
/tests 251 84.421 0.336 0.008 0.469
/tests/user/ 99 22.129 0.224 0.035 0.275
/api/update_task 4759 17.804 0.004 0.001 0.042
/api/actions 1350 12.130 0.009 0.002 0.115
/tests/run 8 10.708 1.339 0.490 2.408
/api/beat 2447 7.666 0.003 0.001 0.105
/tests/finished 61 5.852 0.096 0.027 0.143
/api/upload_pgn 153 3.825 0.025 0.008 0.109
/api/active_runs 121 3.714 0.031 0.004 2.615
/actions 3 3.468 1.156 0.829 1.332
/api/get_elo/ 122 3.414 0.028 0.008 0.048
/tests/machines 9 1.931 0.215 0.178 0.250
/tests/live_elo/ 70 1.389 0.020 0.003 0.034
/tests/view/ 50 1.307 0.026 0.013 0.054
/contributors 4 1.037 0.259 0.130 0.377
/api/request_task 173 0.775 0.004 0.002 0.029
/tests/tasks/ 11 0.593 0.054 0.022 0.190
/nns 19 0.430 0.023 0.009 0.121
/api/request_version 181 0.290 0.002 0.001 0.006
/api/nn/ 20 0.108 0.005 0.003 0.006
/api/request_spsa 35 0.101 0.003 0.002 0.004
/tests/approve 2 0.040 0.020 0.013 0.027
/contributors/monthly 1 0.034 0.034 0.034 0.034
/tests/delete 1 0.025 0.025 0.025 0.025
/login 3 0.023 0.008 0.004 0.010
/api/failed_task 2 0.018 0.009 0.008 0.010
/ 1 0.009 0.009 0.009 0.009
/sprt_calc 1 0.008 0.008 0.008 0.008
/workers/ 1 0.003 0.003 0.003 0.003 Timings after the change: # logging from [28/Jun/2024:18:59:01 +0000] to [28/Jun/2024:20:08:07 +0000] ########
# duration (seconds) : 4146
# calls in total : 10000
# calls per second : 2
# calls not reaching the backend: 37
# calls handled by the backend : 9963
# route calls total average minimum maximum
/tests 215 63.553 0.296 0.001 0.516
/api/update_task 5047 19.137 0.004 0.001 0.041
/tests/user/ 52 11.593 0.223 0.026 0.279
/api/actions 1135 9.883 0.009 0.002 0.050
/api/upload_pgn 172 8.999 0.052 0.007 0.171
/api/beat 2369 7.177 0.003 0.001 0.031
/tests/run 4 5.402 1.351 0.627 2.269
/tests/finished 45 4.634 0.103 0.026 0.133
/actions 7 4.419 0.631 0.080 0.947
/api/serialize_run/ 172 4.306 0.025 0.001 0.108
/api/active_runs 106 3.344 0.032 0.004 2.456
/contributors 5 1.842 0.368 0.220 0.516
/tests/machines 6 1.327 0.221 0.182 0.251
/api/get_elo/ 54 1.321 0.024 0.008 0.045
/tests/view/ 43 1.197 0.028 0.002 0.087
/api/request_task 200 0.855 0.004 0.002 0.028
/tests/live_elo/ 33 0.462 0.014 0.002 0.027
/api/request_version 228 0.366 0.002 0.001 0.004
/tests/tasks/ 6 0.324 0.054 0.030 0.147
/api/pgn/ 3 0.231 0.077 0.041 0.124
/nns 11 0.195 0.018 0.014 0.024
/ 4 0.131 0.033 0.001 0.113
/api/nn/ 22 0.116 0.005 0.004 0.006
/api/request_spsa 20 0.055 0.003 0.002 0.004
/tests/stop 1 0.047 0.047 0.047 0.047
/tests/approve 2 0.027 0.013 0.013 0.014
/api/failed_task 1 0.008 0.008 0.008 0.008
|
Probably for validation of the task id it is overkill to ask the primary instance for the whole run. |
By the way, logs are clean. |
Monitoring PROD I saw many nginx warnings for cache on disk during the pgn upload, so I raised the nginx buffer to lower the number of warnings: - client_body_buffer_size 128k;
+ client_body_buffer_size 1m; 2024/06/28 19:23:44 [warn] 94515#94515: *88 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000001, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 19:26:19 [warn] 94516#94516: *429 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000002, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 19:43:38 [warn] 94516#94516: *2606 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000003, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 19:50:55 [warn] 94515#94515: *3541 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000004, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 19:56:54 [warn] 94516#94516: *4245 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000005, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 20:00:08 [warn] 94516#94516: *4679 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000006, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 20:02:35 [warn] 94515#94515: *5023 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000007, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 20:09:42 [warn] 94515#94515: *5898 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000008, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org" |
Before & after /api/upload_pgn 153 3.825 0.025 0.008 0.109 This confirms what I thought. |
8 hours timings with PR using the new nginx config (no cache on disk during the pgn upload). # logging from [29/Jun/2024:00:00:26 +0000] to [29/Jun/2024:08:53:53 +0000] ########
# duration (seconds) : 32007
# calls in total : 100000
# calls per second : 3
# calls not reaching the backend: 330
# calls handled by the backend : 99670
# route calls total average minimum maximum
/tests 1467 504.141 0.344 0.001 0.576
/tests/user/ 656 149.462 0.228 0.013 0.991
/api/update_task 35768 135.750 0.004 0.001 0.173
/api/actions 9269 95.834 0.010 0.001 0.142
/api/get_elo/ 2563 73.265 0.029 0.007 0.134
/actions 102 62.721 0.615 0.038 3.352
/api/upload_pgn 1160 60.732 0.052 0.008 0.238
/api/beat 16149 49.470 0.003 0.001 0.109
/api/run_pgns/ 6 41.123 6.854 2.346 15.032
/contributors 46 34.405 0.748 0.000 23.490
/tests/run 44 29.028 0.660 0.001 2.626
/tests/finished 280 28.911 0.103 0.018 0.181
/api/serialize_run/ 1160 27.212 0.023 0.001 0.111
/tests/view/ 779 18.880 0.024 0.001 0.549
/tests/live_elo/ 889 17.377 0.020 0.001 0.294
/tests/machines 37 7.921 0.214 0.000 0.276
/api/active_runs 803 7.429 0.009 0.004 0.092
/api/request_task 1353 6.145 0.005 0.002 0.131
/tests/tasks/ 207 4.577 0.022 0.002 0.138
/api/request_version 1428 2.366 0.002 0.000 0.032
/nns 81 1.695 0.021 0.007 0.100
/user_management 16 1.577 0.099 0.000 0.222
/api/request_spsa 410 1.213 0.003 0.001 0.038
/ 137 0.787 0.006 0.001 0.014
/api/nn/ 147 0.773 0.005 0.002 0.016
/tests/stats/ 13 0.711 0.055 0.020 0.073
/contributors/monthly 23 0.500 0.022 0.011 0.033
/login 43 0.329 0.008 0.001 0.013
/tests/approve 10 0.250 0.025 0.016 0.028
/workers/ 10 0.158 0.016 0.003 0.060
/user 8 0.141 0.018 0.008 0.026
/signup 8 0.127 0.016 0.007 0.061
/user/ 6 0.099 0.017 0.002 0.028
/tests/modify 4 0.087 0.022 0.010 0.035
/api/failed_task 5 0.043 0.009 0.002 0.016
/tests/stop 2 0.041 0.021 0.014 0.027
/sprt_calc 2 0.022 0.011 0.008 0.014
/api/finished_runs 2 0.018 0.009 0.006 0.012
/api/get_run/ 1 0.014 0.014 0.014 0.014
/tests/delete 1 0.003 0.003 0.003 0.003 |
It seems the timings for I think the I will try to cook something up. |
With this PR /api/serialize_run hosted by the main instance can be invoked by a secondary instance to receive a copy of the run from the cache. In this way there is never an issue that the run might not be up to date. Note: /api/serialize_run restricts access to localhost over http. Some form of access control is necessary.
12 hours timings with master and updated nginx configuration (it solves the warnings in the log, but there is not an improvement on the average timing of the pgn upload) # logging from [29/Jun/2024:10:08:36 +0000] to [29/Jun/2024:22:28:46 +0000] ########
# duration (seconds) : 44410
# calls in total : 100000
# calls per second : 2
# calls not reaching the backend: 456
# calls handled by the backend : 99544
# route calls total average minimum maximum
/tests 3442 1156.082 0.336 0.000 0.898
/api/update_task 47904 181.824 0.004 0.000 0.921
/tests/user/ 694 163.920 0.236 0.023 0.645
/api/run_pgns/ 12 152.795 12.733 0.768 30.307
/api/actions 12483 132.893 0.011 0.002 0.856
/api/beat 22386 71.249 0.003 0.000 1.799
/tests/run 55 61.134 1.112 0.002 2.685
/actions 90 58.371 0.649 0.031 2.754
/tests/view/ 875 57.171 0.065 0.002 2.370
/tests/finished 471 52.820 0.112 0.018 3.009
/api/get_elo/ 1629 45.579 0.028 0.007 0.138
/api/upload_pgn 1683 43.686 0.026 0.005 0.175
/tests/machines 102 23.422 0.230 0.148 0.281
/api/active_runs 1177 20.413 0.017 0.001 2.652
/contributors 59 16.381 0.278 0.000 0.725
/tests/live_elo/ 789 14.208 0.018 0.002 0.125
/tests/tasks/ 169 10.483 0.062 0.001 0.216
/api/request_task 1925 8.604 0.004 0.002 0.047
/api/nn/ 646 3.432 0.005 0.000 0.066
/api/request_version 2004 3.385 0.002 0.001 0.109
/nns 142 3.047 0.021 0.002 0.178
/contributors/monthly 61 1.515 0.025 0.008 0.157
/api/request_spsa 453 1.230 0.003 0.001 0.018
/ 122 0.925 0.008 0.001 0.080
/tests/stats/ 11 0.619 0.056 0.006 0.106
/tests/approve 20 0.503 0.025 0.010 0.031
/login 44 0.389 0.009 0.002 0.015
/signup 22 0.330 0.015 0.002 0.098
/tests/modify 13 0.313 0.024 0.015 0.030
/user_management 9 0.311 0.035 0.001 0.168
/tests/stop 7 0.250 0.036 0.025 0.047
/api/failed_task 16 0.157 0.010 0.004 0.022
/workers/ 16 0.139 0.009 0.002 0.029
/user/ 6 0.103 0.017 0.003 0.036
/sprt_calc 3 0.033 0.011 0.009 0.014
/api/finished_runs 2 0.022 0.011 0.005 0.017
/user 1 0.007 0.007 0.007 0.007
/api/get_run/ 1 0.006 0.006 0.006 0.006 |
The convention throughout the code base is that run_id is a string. Since a run_id and the corresponding ObjectId print in exactly the same way, violating this convention may lead to hard to understand bugs such as this one https://tests.stockfishchess.org/actions?action=log_message&user=&text=internal which was introduced in official-stockfish#1993.
The convention throughout the code base is that run_id is a string. Since a run_id and the corresponding ObjectId print in exactly the same way, violating this convention may lead to hard to understand bugs such as this one https://tests.stockfishchess.org/actions?action=log_message&user=&text=internal which was introduced in official-stockfish#1993.
The convention throughout the code base is that run_id is a string. Since a run_id and the corresponding ObjectId print in exactly the same way, violating this convention may lead to hard to understand bugs such as this one https://tests.stockfishchess.org/actions?action=log_message&user=&text=internal which was introduced in #1993.
With this PR
/api/serialize_run
hosted by the main instance can be invoked by a secondary instance to receive a copy of the run from the cache.In this way there is never an issue that the run might not be up to date.
Sadly I don't know how feasible this is. A typical serialized run is 5.5k. Is this too much?
Note:
/api/serialize_run
restricts access tolocalhost
overhttp
. Some form of access control is necessary but it can also be done externally.I have tested that the PR works. The deserialized runs pass validation (for this I needed to supply the
tz_aware
codec option tobson.decode()
).