-
Notifications
You must be signed in to change notification settings - Fork 923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why are some diff uploads slow? #1710
Comments
Can we get a slowlog published?
It also seems we can have transactions up to 1M seconds, or 12 days: It can also be a reading transaction, say, a forgotten open psql on terminal.
To get distribution of queries and keep sanity reading endless postgres logfiles: https://github.com/dalibo/pgbadger |
Please don't confuse things @Komzpa - as you yourself admit there are lots of reasons for long postgres transactions that have nothing to do with changeset uploads. A slow log isn't going to help unless the delay is down to a single statement and in any case trying to sanitise a slow log for publication is not a reasonable request. I'm already working on @zerebubuth request so how about we cut out the back seat debugging long enough to let me investigate... |
So to get back to the point, the longest upload I can see yesterday was about 503 seconds or just over 8 minutes. That was changeset 54833679 and it had a total elapsed time of 503868ms and an activerecord time of 237165.8ms. |
I've now gone back over all the logs we have, which is basically a week, and the longest I see is changeset 54716666 which has an elapsed time of 1028520ms or around 17 minutes and an activerecord time of 524142.0ms. |
My analysis last year (drolbr/Overpass-API#346 (comment)) was based on the number of minutely state.txt files, a particular transaction number appeared in the It could be that some other transaction that is unrelated to the upload process also shows up in that txnActiveList. I think it would probably to a good idea to cross check those results by going through all minutely diff file contents, to see how much of a delay we experience in the worst case. Let's see if I can get those figures. |
Well every week you will get a day long transaction when the backup is running just as a start. |
I think the background autovacuum jobs also cause random very long transactions as well. |
Thanks @tomhughes! Changeset 54833679 has 1465 ways and 3103 nodes, which is large but doesn't seem excessive. Do you know how the 237s in activerecord breaks down? Can we assume that's all database and network time, or is activerecord doing any significant processing? It looks, from the timestamps of the nodes and ways in that changeset, that they span the whole 8 minutes. Which would seem to imply the other 266s is spent in per-element processing on the node. I'm not aware of any external requests that are happening (i.e: non activerecord ones), so the implication appears to be it's in the parsing code? The next step will be for me, or anyone else who wants to help, to reproduce this locally as part of a test or other automated process so that we can analyse it more thoroughly. |
There's no breakdown - rails just logs a value which I assume is the total time it spent executing statements. I've just pushed openstreetmap/chef@0d5f853 to enable basic logging of any statement taking over a second. If that isn't too noisy we can think about reducing the time and if that isn't enough we can think about enabling |
Complete analysis for all 765000 minutely diffs in directory Top30:
(reminder to self: changeset limit was changed around Jan 31st 2017 from 50k to 10k)
Full results: diff_ts_minmax.zip (10MB files size compressed, 67 MB uncompressed CSV). Overlapping diffs(Delta Max Timestamp previous diff vs. Min Timestamp current diff) Based on 999907 minutely diffs in
|
I ran a very small local test based on Changeset 53407241, which took around 15 minutes via JOSM upload. Most CPU time was spent on puma 001 process (rails). On the console I noticed a rather large number of single row INSERT/DELETE/UPDATE operations:
Overall, I'm also suspecting shared/exclusive locks to contribute to the larger overall processing time on the server. With long running transactions, chances increase that other people happen to update the same objects. One easy way to reproduce this locking situation is to start an upload in JOSM, get bored after 5 minutes, cancel it, and then start the upload again, with the same changeset being used. The second upload will then hang all the time until the "cancelled" upload also finished on the server on a lock for " Long running transactions can also impact other transactions that want to acquire a shared lock via: Discussion on redesign moved to separate issue Wouldn't that be a good fit for cgimap? |
Possibly. I've added an issue so that we can track it there. If anyone would like to help out with the implementation, please do! |
BTW: There's a test version available on a dedicated server now, see this diary post for details: https://www.openstreetmap.org/user/mmd/diary/44318 |
Code is also deployed on https://master.apis.dev.openstreetmap.org for further testing. |
Issue can be closed, see https://operations.osmfoundation.org/2019/05/31/may.html |
Some stats with cgimap in place: (starting in April 2020) Overlapping minutely diffsFiles: 004/000/000.osc.gz - 004/452/716.osc.gz (Delta Max Timestamp previous diff vs. Min Timestamp current diff)
Minutely diffs with largest overlap
Frequently, top 40 minutely diffs include changesets with a large number of relation members, such as https://www.openstreetmap.org/changeset/100474253 in 004/442/132.osc.gz with 197591 relation members in total. |
As pointed out in this comment by @mmd-osm, diff uploads are buffered in memory before being processed. However, another comment points out that some transactions can take up to 2 hours to process. This seems strange, given that it should just be parsing the XML (which is slow, but not that slow) and talking to the nearby database (which is also slow, but not that slow).
To investigate further, we need to find a recent diff upload that took a long time to process. That should give us some clues to start with.
The text was updated successfully, but these errors were encountered: