-
Notifications
You must be signed in to change notification settings - Fork 17
Missing upgrade script for 0.3.2 to 0.5.0 #274
Comments
Hi Tom, which one of the packages did you use to install the extension? The SQL script is not present in the repository, because it’s generated in the build process. |
The deb from the timescaledb repository at https://packagecloud.io/timescale/timescaledb/ubuntu. |
The github doesn't seem to have it either:
|
The file layout is correct. There's no direct path between |
Ah I only looked at the last line of the log where it said the version was wrong and then went and tried to upgrade it manually but I see that there was a different error on the line before:
so it seems the real problem was a lock:
|
@tomhughes that would explain it. Was the error message clear enough for you to continue with the upgrade? Was it successful? |
I had to significantly increase the value of |
Thank you for the feedback! Do you have any more visibility into which part of the upgrade took an hour? It would be very valuable to better understand this. |
I'm not sure there's much to go on... This was the log:
So it first installed the base version witch took about half an hour (restarting the server to increase the locks seemed to make the extension disappear completely, possibly because the old .so was no longer present) and then about another half hour to upgrade it. As far as I know that second part at least was just it executing an ALTER to update the extension but the CPUs were fairly flat out the whole hour. |
Thank you, that is incredibly helpful. Could you perhaps also provide the postgres logs for the same time interval? |
I think this is everything from when I restarted the server after increasing the lock count the second time until the upgrade completed, but with the routine autovacuum noise stripped out:
|
That is amazing, thank you again for providing us with this! |
@tomhughes we've had a chance to take a closer look at the logs that you've provided and have some follow-up questions. The first question relates to this line:
Did you manually cancel the background worker here using something like The second question relates to these lines:
Do you know which application this was? I assume that there are applications other than Promscale connected to your DB, perhaps Grafana? Any others which might be relevant? This is mostly so that we can understand what the overall environment looks like. Thanks in advance! |
I have no idea what cause the first message I'm afraid. I got rid of our scheduled running of the maintenance task a while ago as timescale now takes care of that so I assumed it had started a background job which got cancelled. As to the others I guess it was likely grafana or alertmanager though I don't think we have anything there which routinely hits promscale rather than hitting prometheus directly. The setup is prometheus, feeding into promscale (which is still experimental as we haven't got it to run reliably yet) plus grafana, alertmanager and karma to provide a web interface to alerts. It could maybe be the postgres collector hitting the database to collect statistics? I also ran some selects in a console session to try and see what it was doing and had |
@tomhughes thanks again for the invaluable feedback.
The focus of our team over the next few months will be on improving reliability. Could you perhaps provide us with some insight into the types of problems that you've been having? |
The principle problem we've been struggling with over the last year is getting autovacuum to keep up with the sheer number of tables and to be able to analyse everything in time to prevent things freezing up due to transaction ID wraparound. It has improved recently, to the extent that the time we can run for before things collapse is now a matter of weeks rather than days - most recently we were able to run from mid April until last week when we hit the wall again - that was what led to me trying to update to the latest promscale and extension to see if there were any further improvements. That said upgrading from 0.10.0/0.3.2 to 0.11.0/0.50 seems to have it's own issues in that there has been a significant increase in the CPU used by the ingestor, or rather by postgres on the ingestor's behalf I think, although it's all a bit confusing as you can see here: When I first started 0.11.0 succesfully on the 20th it ran flat out for a few hours and I then reduced the number of autovacuum workers from 56 to 28 (it's a 28 core/56 thread machine) and that improved things for a few days until it hit transaction wraparound issues again on the 22nd at which point I stopped the ingestor while it caught up with autovacuum. Since I restarted the ingestor on the 23rd it's just being going nuts with the CPUs all flat out... The good news is that the I/O utilisation has been way down since that restart and it has been keeping the unfrozen IDs under control: Whether that is good overall is unclear because I don't really understand why the behaviour changed so dramatically after that restart... |
Thank you for these insights, we will take a look at them in detail and get back to you with further questions. In the meantime, could you tell us what the volume of metrics is that you're currently ingesting? The Promscale connector has log outputs like:
which can help to identify this. Could you perhaps also provide the output of the following queries: SELECT count(*) FROM timescaledb_information.chunks; SELECT count(*)
FROM pg_class
JOIN pg_namespace n ON n.oid = pg_class.relnamespace
WHERE n.nspname <> 'pg_catalog'
AND n.nspname !~ '^pg_toast'
AND n.nspname <> 'information_schema'; |
Sure, here's a snapshot of the throughput:
and the first query:
and the second:
|
This is incredibly valuable feedback! Thank you @tomhughes! |
@tomhughes sorry to keep bothering you for more information. Could you provide us with some information about the server that you ran the upgrade on? I'm primarily interested in understanding the performance characteristics of the hardware.
If you're using a cloud hosting provider (AWS/GCP/hetzner etc.) you can just mention the type of instance that you're using. |
It's this machine which is bare metal with 2 x 14 core Xeons, 192Gb of RAM and 4 x 1Tb SSDs in RAID5. That wasn't necessarily intended as the final home for this, but part of what we were trying to figure out was what resources (mostly how much disk) we would need before we bought a machine for it and this was a machine we had available to trial it. |
Promscale 0.11.0 is refusing to start claiming the extension is too old even though I have 0.5.0 installed. Trying to update it manually fails:
and the package does indeed appear to be missing an SQL script to do the update, as does this repository.
The text was updated successfully, but these errors were encountered: