Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mongo upgrade 4.2 to 4.4: Windows docker volume lock file issue #435

Closed
fdubois1 opened this issue Nov 17, 2020 · 8 comments
Closed

Mongo upgrade 4.2 to 4.4: Windows docker volume lock file issue #435

fdubois1 opened this issue Nov 17, 2020 · 8 comments

Comments

@fdubois1
Copy link

fdubois1 commented Nov 17, 2020

Something similar to #385

I'm using "library/mongo:4.2-windowsservercore-1809" on Windows Server 2019 with a docker volume. Now, I want to move to version 4.4. I just stopped my container, removed it and ran the command docker run pointing on the "library/mongo:4.4-windowsservercore-1809"

It failed to start with that log
{"t":{"$date":"2020-11-17T20:45:31.948+00:00"},"s":"I", "c":"CONTROL", "id":23285, "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"} {"t":{"$date":"2020-11-17T20:45:31.975+00:00"},"s":"W", "c":"ASIO", "id":22601, "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"} {"t":{"$date":"2020-11-17T20:45:31.976+00:00"},"s":"I", "c":"NETWORK", "id":4648602, "ctx":"main","msg":"Implicit TCP FastOpen in use."} {"t":{"$date":"2020-11-17T20:45:32.153+00:00"},"s":"I", "c":"STORAGE", "id":4615611, "ctx":"initandlisten","msg":"MongoDB starting","attr":{"pid":12012,"port":27017,"dbPath":"C:/data/db/","architecture":"64-bit","host":"4b5e866f3aa5"}} {"t":{"$date":"2020-11-17T20:45:32.153+00:00"},"s":"I", "c":"CONTROL", "id":23398, "ctx":"initandlisten","msg":"Target operating system minimum version","attr":{"targetMinOS":"Windows 7/Windows Server 2008 R2"}} {"t":{"$date":"2020-11-17T20:45:32.153+00:00"},"s":"I", "c":"CONTROL", "id":23403, "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"4.4.1","gitVersion":"ad91a93a5a31e175f5cbf8c69561e788bbc55ce1","modules":[],"allocator":"tcmalloc","environment":{"distmod":"windows","distarch":"x86_64","target_arch":"x86_64"}}}} {"t":{"$date":"2020-11-17T20:45:32.153+00:00"},"s":"I", "c":"CONTROL", "id":51765, "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"Microsoft Windows Server 2019","version":"10.0 (build 17763)"}}} {"t":{"$date":"2020-11-17T20:45:32.153+00:00"},"s":"I", "c":"CONTROL", "id":21951, "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"net":{"bindIp":"*"}}}} {"t":{"$date":"2020-11-17T20:45:32.190+00:00"},"s":"I", "c":"STORAGE", "id":22270, "ctx":"initandlisten","msg":"Storage engine to use detected by data files","attr":{"dbpath":"C:/data/db/","storageEngine":"wiredTiger"}} {"t":{"$date":"2020-11-17T20:45:32.191+00:00"},"s":"I", "c":"STORAGE", "id":22315, "ctx":"initandlisten","msg":"Opening WiredTiger","attr":{"config":"create,cache_size=3583M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress,compact_progress],"}} {"t":{"$date":"2020-11-17T20:45:32.206+00:00"},"s":"E", "c":"STORAGE", "id":22435, "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":16,"message":"[1605645932:205673][12012:140734189032032], wiredtiger_open: __win_file_lock, 239: C:\\data\\db\\\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device"}} {"t":{"$date":"2020-11-17T20:45:32.206+00:00"},"s":"E", "c":"STORAGE", "id":22435, "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":16,"message":"[1605645932:205673][12012:140734189032032], wiredtiger_open: __conn_single, 1677: WiredTiger database is already being managed by another process: Resource device"}} {"t":{"$date":"2020-11-17T20:45:32.206+00:00"},"s":"E", "c":"STORAGE", "id":22435, "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":16,"message":"[1605645932:206677][12012:140734189032032], wiredtiger_open: __win_file_lock, 239: C:\\data\\db\\\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device"}} {"t":{"$date":"2020-11-17T20:45:32.206+00:00"},"s":"E", "c":"STORAGE", "id":22435, "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":16,"message":"[1605645932:206677][12012:140734189032032], wiredtiger_open: __conn_single, 1677: WiredTiger database is already being managed by another process: Resource device"}} {"t":{"$date":"2020-11-17T20:45:32.206+00:00"},"s":"W", "c":"STORAGE", "id":22347, "ctx":"initandlisten","msg":"Failed to start up WiredTiger under any compatibility version. This may be due to an unsupported upgrade or downgrade."} {"t":{"$date":"2020-11-17T20:45:32.206+00:00"},"s":"F", "c":"STORAGE", "id":28595, "ctx":"initandlisten","msg":"Terminating.","attr":{"reason":"16: Resource device"}} {"t":{"$date":"2020-11-17T20:45:32.207+00:00"},"s":"F", "c":"-", "id":23091, "ctx":"initandlisten","msg":"Fatal assertion","attr":{"msgid":28595,"file":"src\\mongo\\db\\storage\\wiredtiger\\wiredtiger_kv_engine.cpp","line":1101}} {"t":{"$date":"2020-11-17T20:45:32.207+00:00"},"s":"F", "c":"-", "id":23092, "ctx":"initandlisten","msg":"\n\n***aborting after fassert() failure\n\n"}

If I delete my docker volume and start the version 4.4, no problem and I can stop/start as I want, as long as I delete the lock file before launching the container. But if I move from 4.2 to 4.4 and I want to keep my docker volume, so I use the same volume for 4.4 that I used for 4.2, even if I delete the lock file before launching 4.4, it doesn't start with the log above. Any hint ?

@tianon
Copy link
Member

tianon commented Nov 20, 2020

I think this might be due to unclean shutdown:

$ docker pull mongo:4.2
4.2: Pulling from library/mongo
Digest: sha256:73c0bd81c638e33ad8afd705330ad5addfa1df31b5c646dda6490ec2894f0977
Status: Image is up to date for mongo:4.2
docker.io/library/mongo:4.2

$ docker volume create test
test

$ docker run -d --name test -v 'test:C:\data\db' mongo:4.2
5ccce09bde3f6901137289d345dbb0fc18542bd27a6c168c4c60aa80dfa2ec0c

$ docker logs test | grep -iE 'listening on|waiting for connections'
2020-11-19T18:50:20.757-0800 I  NETWORK  [listener] Listening on 0.0.0.0
2020-11-19T18:50:20.757-0800 I  NETWORK  [listener] waiting for connections on port 27017

$ docker logs -f --tail=0 test

$ # in a new terminal:
$ docker stop test
test
$ # above "docker logs" has no output (and has exited), meaning this was an unclean shutdown that MongoDB did not appropriately capture :(

I then did docker start test and checked the logs again and confirmed:

$ docker logs test
...
2020-11-19T18:53:07.851-0800 W  STORAGE  [initandlisten] Detected unclean shutdown - C:\data\db\mongod.lock is not empty.
2020-11-19T18:53:07.854-0800 I  STORAGE  [initandlisten] Detected data files in C:\data\db\ created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2020-11-19T18:53:07.854-0800 W  STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
...

I tried again (wiping everything and starting completely clean) and instead of using docker stop I used docker kill -sTERM explicitly, and the container/process still dies immediately instead of gracefully shutting down, so I think this is a case of mongod.exe not properly handling the ShutdownComputeSystem event from Windows (related: moby/moby#25982). 😞

@fdubois1
Copy link
Author

Hummm... the issue that you described is exactly the issue here : #385 That issue is closed because it exists that issue : microsoft/Windows-Containers#37

So we know that it exists an issue where the lock file is not deleted. But in my case, even if I delete the lock file before I launch version 4.4, it doesn't start if I point on a volume that was used by a version 4.2. So I don't have any way to bypass this error since even deleting the lock file before launching it, it doesn't work.

@tianon
Copy link
Member

tianon commented Nov 20, 2020

Right, my point is that this is deeper than just the lock file described in microsoft/Windows-Containers#37 -- when MongoDB detects an unclean shutdown, it has to do a repair. On the same version of MongoDB, this usually works pretty successfully and without too much fanfare (these days -- historically it has had some challenges). Additionally, when you run the new MongoDB and it detects old data, it does a migration. What I think is happening is that these two features are colliding in your case -- it needs to repair and needs to migrate/upgrade, and perhaps is not able to repair the 4.2 data successfully.

I ran another test today and did docker exec -it test mongo, followed by use admin and db.shutdownServer() (which I weirdly had to run twice for it to actually work) which did successfully shut down MongoDB gracefully. Once the container was stopped, I removed it, and recreated it against mongo:4.4 with the same data directory, which also worked successfully (I was able to connect to it shortly, and didn't see any messages in the container logs about unclean shutdown).

So in my testing, what appears to have changed between my comment in #385 (comment) and now is that MongoDB no longer seems to successfully/appropriately respond to docker stop. 😞

@tianon
Copy link
Member

tianon commented Nov 20, 2020

(Although to be fair, I didn't explicitly say in #385 (comment) whether I'd checked the logs to ensure that it did indeed do a graceful shutdown, so maybe this was the behavior then too and the repair was just working fine. 🤦)

@tianon
Copy link
Member

tianon commented Dec 4, 2020

Arg, #444 definitely threw a wrench in my testing above -- I've just tested with 4.4-windowsservercore-1809 explicitly, and the shutdown is still slightly delayed, but does actually happen and complete before Windows destroys the container (on docker stop).

Repeating #435 (comment), but this time with -windowsservercore-1809 so I'm not using those ancient (and likely not-as-well-behaved) LTSC 2016 images, plus #438 to fix #426:

$ docker pull mongo:4.2-windowsservercore-1809
4.2-windowsservercore-1809: Pulling from library/mongo
Digest: sha256:c5abf954175aeb60dca5ef4f3c3b38f781d6a13b6d3f81286a886686c5be95f2
Status: Image is up to date for mongo:4.2-windowsservercore-1809
docker.io/library/mongo:4.2-windowsservercore-1809

$ docker volume create test
test

$ docker run -d --name test -v 'test:C:\data\db' mongo:4.2-windowsservercore-1809
6305cf4c581667f86b9d37cd5d3670441bea4c8630a6eb5548da1cbcd97b032c

$ docker logs test | grep -iE 'listening on|waiting for connections'
2020-12-03T18:22:30.397-0800 I  NETWORK  [listener] Listening on 0.0.0.0
2020-12-03T18:22:30.397-0800 I  NETWORK  [listener] waiting for connections on port 27017

$ # in a new terminal:
$ docker stop test
test

$ # above "docker logs" has the following output:
2020-12-03T18:24:12.167-0800 I  CONTROL  [thread1] CTRL_SHUTDOWN_EVENT signal
2020-12-03T18:24:12.167-0800 I  CONTROL  [consoleTerminate] got CTRL_SHUTDOWN_EVENT, will terminate after current cmd ends
2020-12-03T18:24:12.167-0800 I  REPL     [consoleTerminate] Stepping down the ReplicationCoordinator for shutdown, waitTime: 10000ms
2020-12-03T18:24:12.173-0800 I  SHARDING [consoleTerminate] Shutting down the WaitForMajorityService
2020-12-03T18:24:12.176-0800 I  CONTROL  [consoleTerminate] Shutting down the LogicalSessionCache
2020-12-03T18:24:12.176-0800 I  NETWORK  [consoleTerminate] shutdown: going to close listening sockets...
2020-12-03T18:24:12.177-0800 I  NETWORK  [consoleTerminate] Shutting down the global connection pool
2020-12-03T18:24:12.177-0800 I  STORAGE  [consoleTerminate] Shutting down the FlowControlTicketholder
2020-12-03T18:24:12.177-0800 I  -        [consoleTerminate] Stopping further Flow Control ticket acquisitions.
2020-12-03T18:24:12.177-0800 I  STORAGE  [consoleTerminate] Shutting down the PeriodicThreadToAbortExpiredTransactions
2020-12-03T18:24:12.178-0800 I  STORAGE  [consoleTerminate] Shutting down the PeriodicThreadToDecreaseSnapshotHistoryIfNotNeeded
2020-12-03T18:24:12.178-0800 I  REPL     [consoleTerminate] Shutting down the ReplicationCoordinator
2020-12-03T18:24:12.178-0800 I  SHARDING [consoleTerminate] Shutting down the ShardingInitializationMongoD
2020-12-03T18:24:12.178-0800 I  REPL     [consoleTerminate] Enqueuing the ReplicationStateTransitionLock for shutdown
2020-12-03T18:24:12.178-0800 I  -        [consoleTerminate] Killing all operations for shutdown
2020-12-03T18:24:12.178-0800 I  COMMAND  [consoleTerminate] Shutting down all open transactions
2020-12-03T18:24:12.178-0800 I  REPL     [consoleTerminate] Acquiring the ReplicationStateTransitionLock for shutdown
2020-12-03T18:24:12.178-0800 I  INDEX    [consoleTerminate] Shutting down the IndexBuildsCoordinator
2020-12-03T18:24:12.179-0800 I  NETWORK  [consoleTerminate] Shutting down the ReplicaSetMonitor
2020-12-03T18:24:12.179-0800 I  CONTROL  [consoleTerminate] Shutting down free monitoring
2020-12-03T18:24:12.179-0800 I  CONTROL  [consoleTerminate] Shutting down free monitoring
2020-12-03T18:24:12.180-0800 I  FTDC     [consoleTerminate] Shutting down full-time data capture
2020-12-03T18:24:12.180-0800 I  FTDC     [consoleTerminate] Shutting down full-time diagnostic data capture
2020-12-03T18:24:12.187-0800 I  STORAGE  [consoleTerminate] Shutting down the HealthLog
2020-12-03T18:24:12.187-0800 I  STORAGE  [consoleTerminate] Shutting down the storage engine
2020-12-03T18:24:12.187-0800 I  STORAGE  [consoleTerminate] Deregistering all the collections
2020-12-03T18:24:12.188-0800 I  STORAGE  [consoleTerminate] Timestamp monitor shutting down
2020-12-03T18:24:12.188-0800 I  STORAGE  [consoleTerminate] WiredTigerKVEngine shutting down
2020-12-03T18:24:12.188-0800 I  STORAGE  [consoleTerminate] Shutting down session sweeper thread
2020-12-03T18:24:12.188-0800 I  STORAGE  [consoleTerminate] Finished shutting down session sweeper thread
2020-12-03T18:24:12.188-0800 I  STORAGE  [consoleTerminate] Shutting down journal flusher thread
2020-12-03T18:24:12.250-0800 I  STORAGE  [consoleTerminate] Finished shutting down journal flusher thread
2020-12-03T18:24:12.250-0800 I  STORAGE  [consoleTerminate] Shutting down checkpoint thread
2020-12-03T18:24:12.250-0800 I  STORAGE  [consoleTerminate] Finished shutting down checkpoint thread
2020-12-03T18:24:12.332-0800 I  STORAGE  [consoleTerminate] shutdown: removing fs lock...
2020-12-03T18:24:12.335-0800 I  -        [consoleTerminate] Dropping the scope cache for shutdown
2020-12-03T18:24:12.335-0800 I  CONTROL  [consoleTerminate] now exiting
2020-12-03T18:24:12.335-0800 I  CONTROL  [consoleTerminate] shutting down with code:12

Then, I did the following:

$ docker pull mongo:4.4-windowsservercore-1809
4.4-windowsservercore-1809: Pulling from library/mongo
Digest: sha256:6acd97216542cec1489bb1f8bc79df24634fbe8e2f8d5d5cab28f346d620b8d4
Status: Image is up to date for mongo:4.4-windowsservercore-1809
docker.io/library/mongo:4.4-windowsservercore-1809

$ docker volume ls | grep test
local               test
$ docker rm test
test

$ docker run --rm --name test -v 'test:C:\data\db' mongo:4.4-windowsservercore-1809
{"t":{"$date":"2020-12-03T18:26:38.847-08:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2020-12-03T18:26:39.778-08:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}
{"t":{"$date":"2020-12-03T18:26:39.781-08:00"},"s":"I",  "c":"NETWORK",  "id":4648602, "ctx":"main","msg":"Implicit TCP FastOpen in use."}
{"t":{"$date":"2020-12-03T18:26:39.796-08:00"},"s":"I",  "c":"STORAGE",  "id":4615611, "ctx":"initandlisten","msg":"MongoDB starting","attr":{"pid":1512,"port":27017,"dbPath":"C:/data/db/","architecture":"64-bit","host":"136c04ab0891"}}
{"t":{"$date":"2020-12-03T18:26:39.797-08:00"},"s":"I",  "c":"CONTROL",  "id":23398,   "ctx":"initandlisten","msg":"Target operating system minimum version","attr":{"targetMinOS":"Windows 7/Windows Server 2008 R2"}}
{"t":{"$date":"2020-12-03T18:26:39.798-08:00"},"s":"I",  "c":"CONTROL",  "id":23403,   "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"4.4.2","gitVersion":"15e73dc5738d2278b688f8929aee605fe4279b0e","modules":[],"allocator":"tcmalloc","environment":{"distmod":"windows","distarch":"x86_64","target_arch":"x86_64"}}}}
{"t":{"$date":"2020-12-03T18:26:39.799-08:00"},"s":"I",  "c":"CONTROL",  "id":51765,   "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"Microsoft Windows Server 2019","version":"10.0 (build 17763)"}}}
{"t":{"$date":"2020-12-03T18:26:39.799-08:00"},"s":"I",  "c":"CONTROL",  "id":21951,   "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"net":{"bindIp":"*"}}}}
{"t":{"$date":"2020-12-03T18:26:39.824-08:00"},"s":"I",  "c":"STORAGE",  "id":22270,   "ctx":"initandlisten","msg":"Storage engine to use detected by data files","attr":{"dbpath":"C:/data/db/","storageEngine":"wiredTiger"}}
{"t":{"$date":"2020-12-03T18:26:39.830-08:00"},"s":"I",  "c":"STORAGE",  "id":22315,   "ctx":"initandlisten","msg":"Opening WiredTiger","attr":{"config":"create,cache_size=256M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress,compact_progress],"}}
{"t":{"$date":"2020-12-03T18:26:39.860-08:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":16,"message":"[1607048799:859229][1512:140733227422304], wiredtiger_open: __win_file_lock, 239: C:\\data\\db\\\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device"}}
{"t":{"$date":"2020-12-03T18:26:39.860-08:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":16,"message":"[1607048799:860181][1512:140733227422304], wiredtiger_open: __conn_single, 1664: WiredTiger database is already being managed by another process: Resource device"}}
{"t":{"$date":"2020-12-03T18:26:39.860-08:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":16,"message":"[1607048799:860181][1512:140733227422304], wiredtiger_open: __win_file_lock, 239: C:\\data\\db\\\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device"}}
{"t":{"$date":"2020-12-03T18:26:39.861-08:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":16,"message":"[1607048799:860181][1512:140733227422304], wiredtiger_open: __conn_single, 1664: WiredTiger database is already being managed by another process: Resource device"}}
{"t":{"$date":"2020-12-03T18:26:39.861-08:00"},"s":"W",  "c":"STORAGE",  "id":22347,   "ctx":"initandlisten","msg":"Failed to start up WiredTiger under any compatibility version. This may be due to an unsupported upgrade or downgrade."}
{"t":{"$date":"2020-12-03T18:26:39.861-08:00"},"s":"F",  "c":"STORAGE",  "id":28595,   "ctx":"initandlisten","msg":"Terminating.","attr":{"reason":"16: Resource device"}}
{"t":{"$date":"2020-12-03T18:26:39.862-08:00"},"s":"F",  "c":"-",        "id":23091,   "ctx":"initandlisten","msg":"Fatal assertion","attr":{"msgid":28595,"file":"src\\mongo\\db\\storage\\wiredtiger\\wiredtiger_kv_engine.cpp","line":1123}}
{"t":{"$date":"2020-12-03T18:26:39.862-08:00"},"s":"F",  "c":"-",        "id":23092,   "ctx":"initandlisten","msg":"\n\n***aborting after fassert() failure\n\n"}

So unfortunately, we're still hit by microsoft/Windows-Containers#37, and even after deleting WiredTiger.lock it still fails with the same error (as expected). 😞

@awakecoding
Copy link
Contributor

@tianon thanks for looking into this. While microsoft/Windows-Containers#37 is definitely not helping here, my understanding is we have multiple potentially critical issues affecting MongoDB in Windows containers. We've been doing the ugly workaround of using a few lines of PowerShell to find and delete the WiredTiger.lock file from the Docker volume every time we launch our multi-container application, but in this case, I feel like we're toast with no way out.

Even though we documented how to install MongoDB "the hard way" as Windows service, we have a hard time getting our customers to go through all the additional steps when the container has always worked for them. I was okay with a few lines of PowerShell to "make it work" a year ago, but it is very far from funny at this point, as this is now a critical issue for us with real customers on the line.

One thing that always bothered me was the current lack of images based on the ltsc2019 tag, which is not the same as 1809, this would help weed out potential unknowns that we'd rather no learn about. ltsc2019 would be the safest option for us, and we require our customers to use Windows Server 2019 because older versions of Windows Server have even more issues and limitations with Windows containers.

The other thing I noticed is that the Linux containers have a docker-entrypoint.sh, but the Windows containers only have a comment saying maybe there should be a docker-entrypoint.ps1. We are considering to make our own Windows container images to try and workaround all sorts of issues, starting with building the image on top of ltsc2019 instead of 1809, but while we're at this we could maybe work on getting a docker-entrypoint.ps1 wrapper script just like Linux.

Last but not least, maybe help us put the pressure from MongoDB with regards to issues like microsoft/Windows-Containers#37 that plague Docker for Windows? In the end, their problems are definitely becoming your problems, because as of today, running MongoDB in containers on Windows Server is... annoying at best. Let's just say that it does give one the feeling that they are alone doing it because of such critical issues, and it's not even MongoDB's fault, a lot of it has to do with bugs in Docker for Windows. I just hate that this all falls on you in the end.

Sorry about the rant, let's just shake things up a bit and try to get things moving!

@tianon
Copy link
Member

tianon commented Dec 8, 2020

One thing that always bothered me was the current lack of images based on the ltsc2019 tag, which is not the same as 1809, this would help weed out potential unknowns that we'd rather no learn about. ltsc2019 would be the safest option for us, and we require our customers to use Windows Server 2019 because older versions of Windows Server have even more issues and limitations with Windows containers.

That's not exactly true -- as you can see on https://docs.microsoft.com/en-us/virtualization/windowscontainers/deploy-containers/base-image-lifecycle, the "OS build" for both ltsc2019 and 1809 is 17763 (which is what actually matters WRT container compatibility, hence why we track that value -- ltsc2019 can, has, and likely will again change OS build numbers, whereas 1809 should not).

However, this actually gets a lot more interesting if we directly inspect the published images instead of theoretical differences contained in that documentation:

$ wget -qO ltsc2019-manifest-list.json 'https://mcr.microsoft.com/v2/windows/servercore/manifests/ltsc2019' --header 'Accept: application/vnd.docker.distribution.manifest.v2+json' --header 'Accept: application/vnd.docker.distribution.manifest.list.v2+json'
$ wget -qO 1809-manifest-list.json 'https://mcr.microsoft.com/v2/windows/servercore/manifests/1809' --header 'Accept: application/vnd.docker.distribution.manifest.v2+json' --header 'Accept: application/vnd.docker.distribution.manifest.list.v2+json'
$ diff -u ltsc2019-manifest-list.json 1809-manifest-list.json
--- ltsc2019-manifest.json	2020-12-08 12:31:19.067038684 -0800
+++ 1809-manifest.json	2020-12-08 12:32:26.797220871 -0800
@@ -5,7 +5,7 @@
       {
          "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
          "size": 886,
-         "digest": "sha256:77904ce9c47b7b7f4313c0e4d74d9b75ccf2330e911b5839e5256ef932333954",
+         "digest": "sha256:ea59e234726dc50e67b67ec14eb77fc96094dbce61c71f2bda6e82b8caf787aa",
          "platform": {
             "architecture": "amd64",
             "os": "windows",
$ wget -qO ltsc2019-manifest.json 'https://mcr.microsoft.com/v2/windows/servercore/manifests/sha256:77904ce9c47b7b7f4313c0e4d74d9b75ccf2330e911b5839e5256ef932333954' --header 'Accept: application/vnd.docker.distribution.manifest.v2+json' --header 'Accept: application/vnd.docker.distribution.manifest.list.v2+json'
$ wget -qO 1809-manifest.json 'https://mcr.microsoft.com/v2/windows/servercore/manifests/sha256:ea59e234726dc50e67b67ec14eb77fc96094dbce61c71f2bda6e82b8caf787aa' --header 'Accept: application/vnd.docker.distribution.manifest.v2+json' --header 'Accept: application/vnd.docker.distribution.manifest.list.v2+json'
$ # using "jq" to get pretty-printed JSON so the diff is easier to read/see
$ diff -u <(jq . ltsc2019-manifest.json) <(jq . 1809-manifest.json)
--- /dev/fd/63	2020-12-08 12:37:26.661162766 -0800
+++ /dev/fd/62	2020-12-08 12:37:26.661162766 -0800
@@ -3,8 +3,8 @@
   "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
   "config": {
     "mediaType": "application/vnd.docker.container.image.v1+json",
-    "size": 563,
-    "digest": "sha256:c0773f9ce35398a9525c4508b9ece7c17d62d241324c5f9f3b7624f845167803"
+    "size": 559,
+    "digest": "sha256:5c1f582f60a9895cf4d66595d265ec88225efd6a33e90edfd1a2ff161991ce50"
   },
   "layers": [
     {
$ wget -qO ltsc2019-config.json 'https://mcr.microsoft.com/v2/windows/servercore/blobs/sha256:c0773f9ce35398a9525c4508b9ece7c17d62d241324c5f9f3b7624f845167803' --header 'Accept: application/vnd.docker.distribution.manifest.v2+json' --header 'Accept: application/vnd.docker.distribution.manifest.list.v2+json'
$ wget -qO 1809-config.json 'https://mcr.microsoft.com/v2/windows/servercore/blobs/sha256:5c1f582f60a9895cf4d66595d265ec88225efd6a33e90edfd1a2ff161991ce50' --header 'Accept: application/vnd.docker.distribution.manifest.v2+json' --header 'Accept: application/vnd.docker.distribution.manifest.list.v2+json'
$ diff -u <(jq . ltsc2019-config.json) <(jq . 1809-config.json)
--- /dev/fd/63	2020-12-08 12:38:18.907757157 -0800
+++ /dev/fd/62	2020-12-08 12:38:18.907757157 -0800
@@ -10,7 +10,7 @@
     },
     {
       "created": "2020-12-04T02:13:01.8249382+00:00",
-      "created_by": "Install update ltsc2019-amd64"
+      "created_by": "Install update 1809-amd64"
     }
   ],
   "rootfS": {

In conclusion, the full delta between mcr.microsoft.com/windows/servercore:1809 and mcr.microsoft.com/windows/servercore:ltsc2019 is a tiny bit of textual metadata that's only displayed in docker history (the layer checksums and thus image contents are 100% identical).

The other thing I noticed is that the Linux containers have a docker-entrypoint.sh, but the Windows containers only have a comment saying maybe there should be a docker-entrypoint.ps1. We are considering to make our own Windows container images to try and workaround all sorts of issues, starting with building the image on top of ltsc2019 instead of 1809, but while we're at this we could maybe work on getting a docker-entrypoint.ps1 wrapper script just like Linux.

https://github.com/docker-library/faq#why-isnt-there-a-windows-equivalent-of-docker-entrypointsh

Last but not least, maybe help us put the pressure from MongoDB with regards to issues like microsoft/Windows-Containers#37 that plague Docker for Windows? In the end, their problems are definitely becoming your problems, because as of today, running MongoDB in containers on Windows Server is... annoying at best. Let's just say that it does give one the feeling that they are alone doing it because of such critical issues, and it's not even MongoDB's fault, a lot of it has to do with bugs in Docker for Windows. I just hate that this all falls on you in the end.

Yeah, it is unfortunate that the bugs fall here -- all we're doing is taking the pre-built packages of MongoDB from MongoDB, Inc and providing them in container images in line with Microsoft's images/recommendations and Docker best practices. 😞

I have spent a great deal of time trying to work around this or figure out what's going on (as have others), and it is pretty clear that there's something happening here beyond us, and at this point even beyond our ability to provide a workaround for in the image itself -- any fix here needs to come from one of those other two parties (most likely Microsoft, as they've already confirmed this is a known issue).

@tianon
Copy link
Member

tianon commented May 11, 2021

Copying my sentiment from microsoft/Windows-Containers#37 (comment) -- this is fixed in today's updated base layers! 🎉

The official image rebuilds are still in-progress, but I'm going to close this as the underlying problem is resolved (and the updated images will be up Soon).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants