-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
obtaining certificate: context canceled, but does not restart #3202
Comments
Just so you know, configs are saved to the file system regardless:
So if your API key is in the config, it will be in that file. You can turn this off, but be aware of the implications: if Caddy can't save the last active config, it can't resume it, so any config changes made via the API won't be persisted if the server is restarted. Is your machine really so loosely secured that other users/untrusted code can read files with 0600 permissions, i.e. they share the same user account? That's probably what you need to fix instead. Anyway, I'd like to help you, but I'll need to reproduce the behavior you're seeing. Please provide a full and minimal config file needed to reproduce the problem, along with the exact steps to take. Ideally, we need to be able to reproduce the bug in the most minimal way possible. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either. I've attached a template below that will help make this easier and faster! It will ask for some information you've already provided; that's OK, just fill it out the best you can. 👍 I've also included some helpful tips below the template. Feel free to let me know if you have any questions! Thank you again for your report, we look forward to resolving it! Template
Helpful tips
Example of a tutorial: Create a config file: |
oops. I'm sorry I didn't put the configuration in. That was my bad. Here you go. {
"apps": {
"http": {
"servers": {
"ssh-proxy": {
"automatic_https": {
"@id": "automatic_https",
"disable_redirects": true,
"skip": [
]
},
"listen": [
":80",
":443"
],
"routes": [
{
"@id": "shine",
"group": "shine",
"handle": [
{
"@id": "proxy_shine",
"handler": "reverse_proxy",
"upstreams": [
{
"dial": "localhost"
}
]
}
],
"match": [
{
"host": [
"*.shine.caddy.test.shinenelson.xyz"
]
}
],
"terminal": true
},
{
"@id": "mark",
"group": "mark",
"handle": [
{
"@id": "proxy_mark",
"handler": "reverse_proxy",
"upstreams": [
{
"dial": "localhost"
}
]
}
],
"match": [
{
"host": [
"*.mark.caddy.test.shinenelson.xyz"
]
}
],
"terminal": true
},
{
"@id": "lola",
"group": "lola",
"handle": [
{
"@id": "proxy_lola",
"handler": "reverse_proxy",
"upstreams": [
{
"dial": "localhost"
}
]
}
],
"match": [
{
"host": [
"*.lola.caddy.test.shinenelson.xyz"
]
}
],
"terminal": true
}
],
"tls_connection_policies": [
{
"match": {
"sni": [
"*.shine.caddy.test.shinenelson.xyz",
"*.mark.caddy.test.shinenelson.xyz",
"*.lola.caddy.test.shinenelson.xyz"
]
}
}
]
}
}
},
"tls": {
"automation": {
"policies": [
{
"issuer": {
"ca" : "https://acme-staging-v02.api.letsencrypt.org/directory",
"challenges": {
"dns": {
"@id": "dns-challenge",
"auth_token": "dummmydigitaloceantoken",
"provider": "digitalocean"
}
},
"module": "acme"
},
"subjects": [
"*.shine.caddy.test.shinenelson.xyz",
"*.mark.caddy.test.shinenelson.xyz",
"*.lola.caddy.test.shinenelson.xyz"
]
}
]
}
}
}
}
I knew this, but I didn't bother too much since I could possibly move the running config also to within
I'm not 100% sure, but for me, using the systemd init script, the configuration was never persisted. It always started afresh from the running config. I'm guessing that it is the |
Thanks, but what about the rest of the template? What curl and caddy commands do I run, what should I see, what do you expect, etc? |
Steps to Reproduce
log after this command
log after this command
A new
log after this command
Then the new certificates are provisioned and everything goes about as expected. PS : Expected BehaviourAfter the previous context is killed, a new context for the current configuration should start. What Happens InsteadThe new context for the |
Thanks for the explanation. I still don't quite follow what is going on. Can you reduce the problem space down more minimally? Simplify the configs, the commands to run, remove systemd from the equation, etc -- I know it's work but please put some effort into it and follow the issue template's suggestions, it will help speed things up incredibly.
Are you sure?
^ same second.
Hmm, that doesn't sound right... how can I reproduce this? Having troubles seeing that. |
I'm sorry I couldn't get this tested sooner. I'm currently limited by a test environment, so I will not be able to give you more detailed test cases for reproduction.
I will get back to this as soon as I have my test environment back online.
~ shine
-- Sent from /e/ Mail.
…On 30 March 2020 8:04:22 PM IST, Matt Holt ***@***.***> wrote:
Thanks for the explanation. I still don't quite follow what is going
on. Can you reduce the problem space down more minimally? Simplify the
configs, the commands to run, remove systemd from the equation, etc --
I know it's work but please put some effort into it and follow the
issue template's suggestions, it will help speed things up incredibly.
> A new certificate maintenance routine is not created for at least a
minute.
Are you sure?
```
Mar 28 22:07:50 caddy[21276]: 2020/03/28 22:07:50
[INFO][cache:0xc000660be0] Stopped certificate maintenance routine
Mar 28 22:07:50 caddy[21276]:
{"level":"info","ts":1585433270.1380074,"msg":"autosaved
config","file":"/var/lib/caddy/.config/caddy/autosave.json"}
Mar 28 22:07:50 caddy[21276]:
{"level":"info","ts":1585433270.1384685,"logger":"admin","msg":"stopped
previous server"}
Mar 28 22:07:50 caddy[21276]: 2020/03/28 22:07:50
[INFO][cache:0xc0007fca50] Started certificate maintenance routine
```
^ same second.
> PS : systemctl reload caddy would still not reload from the
autosave.json in the default caddy.service unless the --config flag is
removed.
Hmm, that doesn't sound right... how can I reproduce this? Having
troubles seeing that.
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#3202 (comment)
|
Okay. Will close in the meantime, until the issue becomes actionable then. |
I know I messed up the issue template the last time ( unfortunately, it somehow doesn't exist in the repository or come up while creating a new issue. I had to copy-paste it from an earlier comment ) and went all over the place, so this time, I'm going to be a little bit more diligent with the issue template at least. I'm going to try and collate everything I said previously into this one comment so that you don't have to scroll back and forth for context. 1. Environment1a. Operating system and version
1b. Caddy version (run
|
Ah, interesting. I think the template is on the v1 branch, we'll need to add it to the new v2/master branch now that we migrated the default branch. Thanks for pointing that out |
@shinenelson Thanks! That's much easier for me to follow -- I assume no log lines have been omitted. By this:
What do you mean by "it just stays there" - and, separate question: what exactly is not working?
That appears to be happening. Edit: @francislavoie I think, maybe he is referring to the revised unit file that we worked on last week? |
Yes, he was using |
Yes, I think so; caddy-api.service didn't exist at the time. Edit: To clarify, @shinenelson, I'm still not quite sure that anything is wrong; I am trying to figure out what the disconnect is between what you expect and what reality is. Did the wrong command / missing CLI flag resolve the problem then? |
I mentioned the service file again only because I mentioned it earlier ( to reply to a previous question for clarification ). This time around, I took that out of the equation while testing as requested. The service file does not have anything to do with the issue.
Other than a few
That acquiring lock happens if I trigger another reload of the server. If I didn't reload, the new context would probably never start. Let me try and put it in text for you :
At this point, shouldn't there be another new That doesn't seem to be happening which is what I'm pointing as the problem. That's what I meant by "it just stays there" I do the next step only to test whether
My expectation was that |
To summarize even further : Step 1 : Ideally, my workflow should have ended in Step 2, but I didn't have any certificates for my hosts at that point; so, I had to go to step 3. So, basically, what happens is that if a new |
Interesting... I think I see what you're talking about now. I have a hunch, I'll need more time to mull it over though and try some things. If I'm right, the good news is that we're not leaking resources -- the bad news is that we're cleaning up too many resources. Either way, as you've noted, a practical workaround in the meantime is to trigger another config reload. You can do this by POSTing your config to the /load endpoint with |
I was able to reproduce this. My hunch was correct. I'll explain more in a follow-up. @shinenelson if I open a PR in the next few minutes, would you be able to test the CI artifact today? If so, it's a very minor change so it should be able to go out with 2.0. |
sure; it'll take me a bit to get my test instance ready. Let me go and get that instance up by the time you merge that change in. |
@shinenelson Okay, I just pushed it to master since it's just an update of go.mod. Fixed in 2609a72 -- I verified it repeatedly on my machine, please confirm when you can! |
On it. Side note : I don't know why I come up with issues that should be reported at caddyserver/certmagic and report them here. I'll try to remember that the certmagic repository also exists next time I have an issue with TLS certificates. |
It's fine, this was a weird case that is caused by the interplay between the two. CertMagic would have been a more confusing place to report this bug, but the fix was much much easier in CertMagic. |
All good! I too tested it multiple times. I broke the DNS auth a couple of times to make sure it would still generate the new certificates. This is good to go 🚀 |
Great to hear it! Here's what was happening: When a Caddy config is loaded and started, it initializes all the apps, like the Under the hood, CertMagic works on "obtain" and "renew" jobs asynchronously (in the background). When Caddy asks to obtain certificates, CertMagic creates a series of jobs, one for each name. Each job had a unique name/ID so that jobs wouldn't be duplicated. In other words, you don't need 3 jobs that all attempt to renew the same certificate. If a job with the same name was already queued, CertMagic would ignore the new job so as to not duplicate it and clog up the queue. The problem is that Caddy calls The change I made removes job de-duplication for "obtain" jobs -- i.e. gives them an empty name. This is because we now assume that whoever is calling In other words, the fix is to allow multiple/overlapping "obtain" jobs, so for a brief time, yes there are two jobs that try to obtain the same certificate, but they are synced by a lock mechanism, and one of them is canceled quickly anyway. So, I think this is a nice and simple solution to a tricky and obscure problem. Thanks for your patience and diligence in getting it sorted out! |
That certainly was a weird and obscure problem. The quirks of having inter-connecting applications doing asynchronous calls - multiple points of failures. I love the way you handled the fix. Like you said - it's a simple fix to a kind of complicated problem. Thanks again. |
I'm trying to automate a caddy deployment and since I'm using multiple sub-domains, I'd like to get a wildcard TLS certificate.
In order to provision wildcard certificates, I need to use
DNS-01
challenge. And I did not want to put the API keys to my domain registrar on a static file that would be on the file system on the server.Hence, I put in a dummy key and then loaded the actual key manually via the caddy API.
However, the problem is that as soon as the caddy server starts, it starts the
certificate maintenance routine
which would fail withacme: error presenting token: digitalocean: HTTP 401: unauthorized
.Now, after I put in the correct API key for the registrar in via the caddy API, the caddy server reloads again. And this time when the
certificate maintenance routine
starts, it notices that there is already another obtaining certificate context and kills it.log
The problem is that the
certificate maintenance routine
does not retry obtaining the certificate again. I'll have to reload the caddy server again before it'll try again and generate the TLS certificates ( if the API keys are correct ) using the DNS challenge.The text was updated successfully, but these errors were encountered: