Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrecoverable crash every few days: context deadline exceeded #359

Open
nathang21 opened this issue Dec 6, 2024 · 5 comments
Open

Unrecoverable crash every few days: context deadline exceeded #359

nathang21 opened this issue Dec 6, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@nathang21
Copy link

  • [x ] I have checked the existing issues to avoid duplicates
  • [x ] I have redacted any info hashes and content metadata from any logs or screenshots attached to this issue

Describe the bug

Bitmagnet container crashes after some delay, about every ~1-3 days from what I can tell. It does not restart/recover automatically, seemingly because the exit code is 1 (docker is weird about restarting only for certain exit codes). I have been having some intermittent network stability issues with my ISP, and it seems loosely correlated with that, although it has happened even when I didn't notice any other problems on my network, so i'm not fully convinced that is the trigger/root cause.

I've attached the raw debug logs from a recent failover, AFAICT there isn't anything sensitive in them as it's mostly errors but apologize if I missed anything, happy to edit/redact if needed.
bitmagnet.log

To Reproduce

Steps to reproduce the behavior:

  1. Boot Bitmagnet
  2. Wait for issue to re-occur
  3. View stopped container and inspect logs

Expected behavior

Bitmagnet to remain stable and not crash, however if it does crash ideally it would self-recover better as well.

Environment Information (Required)

  • Bitmagnet version: v0.9.5
  • OS and version: macOS 15.1.1 (24B2091)
  • Browser and version (if issue is with WebUI): Version 131.0.6778.70 (Official Build) (arm64) (not WebUI related)
  • Please specify any config values for which you have overridden the defaults: See docker compose below

Additional context

Bitmagnet was super heavy on Disk I/O, and I have plenty of RAM so I made some tweaks to the Postgres config to prefer RAM over Disk I/O in some cases, which has helped a lot with the performance of my Synology NAS DS423+.

Docker Compose:

bitmagnet:
    container_name: bitmagnet
    image: ghcr.io/bitmagnet-io/bitmagnet:latest
    volumes:
      - /volume2/docker/starr-trash/bitmagnet:/root/.local/share/bitmagnet
    restart: always
    environment:
      - LOG_FILE_ROTATOR_ENABLED=true
      - POSTGRES_HOST=bitmagnet-postgres
      - POSTGRES_PASSWORD=<REDACTED>
      - TMDB_API_KEY=<REDACTED>
      - CLASSIFIER_DELETE_XXX=true
      - DHT_CRAWLER.SCAILING_FACTOR=5
      - LOG_LEVEL=debug
    labels:
      - autoheal=true
    shm_size: 1g
    logging:
      driver: json-file
      options:
        max-file: ${DOCKERLOGGING_MAXFILE}
        max-size: ${DOCKERLOGGING_MAXSIZE}
    # logging:
    #   driver: none
    # Ports mapped via VPN
    # ports:
    #   - 3333:3333 # Bitmagnet - API and WebUI
    #   - 3334:3334/tcp # Bitmagnet - BitTorrent
    #   - 3334:3334/udp # Bitmagnet - BitTorrent
    network_mode: service:gluetun
    depends_on:
      gluetun:
        condition: service_healthy # Used by gluetun-healthcheck.sh script.
        restart: true
      bitmagnet-postgres:
        condition: service_healthy
        restart: true
    healthcheck:
      test: "nc -z localhost 9999 || kill 1"
      interval: 1m
      timeout: 1m
      start_period: 300s
    command:
      - worker
      - run
      # Run all workers:
      - --all
      # Or enable individual workers:
      # - --keys=http_server
      # - --keys=queue_server
      # - --keys=dht_crawler

  bitmagnet-postgres:
    image: postgres:16-alpine
    container_name: bitmagnet-postgres
    volumes:
      - /volume2/docker/starr-trash/bitmagnet/postgres:/var/lib/postgresql/data
    ports:
      - "6432:5432"
    shm_size: 3g
    restart: always
    command:
      -c shared_buffers=3GB
      -c work_mem=256MB
      -c maintenance_work_mem=512MB
      -c checkpoint_timeout=30min
      -c checkpoint_completion_target=0.9
      -c wal_buffers=128MB
      -c effective_cache_size=6GB
      -c synchronous_commit=off
      -c autovacuum_vacuum_cost_limit=2000
      -c autovacuum_vacuum_cost_delay=10ms
      -c autovacuum_max_workers=3
      -c autovacuum_naptime=20s
      -c autovacuum_vacuum_scale_factor=0.05
      -c autovacuum_analyze_scale_factor=0.02
      -c temp_file_limit=5GB
      # Risk data loss
      # -c fsync=off
      # -c full_page_writes=off
    # logging:
    #   driver: none
    environment:
      - POSTGRES_PASSWORD=<REDACTED>
      - POSTGRES_DB=bitmagnet
      - PGUSER=postgres
    healthcheck:
      test: ["CMD-SHELL", "pg_isready"]
      interval: 10s
      start_period: 60s
    networks:
      syno-bridge:
        # https://github.com/qdm12/gluetun-wiki/blob/main/setup/inter-containers-networking.md#between-a-gluetun-connected-container-and-another-container
        # Required until fixed: https://github.com/qdm12/gluetun/issues/281
        ipv4_address: <REDACTED>
@nathang21 nathang21 added the bug Something isn't working label Dec 6, 2024
@nathang21 nathang21 changed the title Short description of bug Unrecoverable crash every few days: context deadline exceeded Dec 6, 2024
@rraymondgh
Copy link
Contributor

rraymondgh commented Dec 13, 2024

I found running bitmagnet on macOS Catalina and an old version of docker problematic. I could not update macOS or docker due Mac being a 2012 Mac mini. Split Fusion Drive and created partitions of ssd and hdd. Then dual booted to Ubuntu 24.10 and latest version of docker. Bitmagnet is now very stable - no crashes of docker daemon. Hence I suspect docker version and host os version are important for stable networking with bitmagnet crawling network loads

@DerBunteBall
Copy link

There are various problem reports about Bitmagnet in connection with Docker on Windows or macOS hosts.

The following should always be borne in mind:
Bitmagnet is an app with high I/O requirements in terms of both storage and network.

On Windows and macOS, it must always be borne in mind that the usual Docker stacks are mostly a virtualization solution. Probably Hyper-V on Windows today and Hyperkit on macOS.

These are already highly optimized hypervisors for the platforms. The small Linux running in them is also highly optimized. However, both the hypervisor and the device model of the VM (emulated network card, etc.) or some setting in the parts of the stack (be it a kernel option in the kernel of the small Linux system) can lead to problems.
Bitmagnet lives best on physical hardware, a system isolated for operation, which is best equipped with HVMe storage. CPUs of older generations, the middle class and also mobile processors are perfectly adequate. Just add 16-32GB of RAM and you're good to go.

@nathang21
Copy link
Author

Thanks for chiming in folks. To be clear for this issue, I am running docker natively on linux not via any hypervisor. I am using a Synology DS423+ NAS (which unfortunately still uses an older Docker Daemon - v24.0.2), paired with a modern M.2 NVME 1TB SSD and 18 GB of RAM.

This wasn't a problem previously but i have been running Bitmagnet for almost 6 months so maybe the increased size of the DB is putting more pressure?. It is intermittent, I haven't had a crash in about a week.

Honestly, an occasional crash isn't a big deal to me, but I'd like to debug why it was unable to recover on its own, and required my manually intervention.

@DerBunteBall
Copy link

I was assuming macOS as it was in the first post above.

NAS devices are also extremely unsuitable for software like Bitmagnet.
On the one hand, the underlying Linux distributions are highly optimized. Possibly even so that it is problematic for Bitmagnet. In addition, the Docker versions are often very old.
The hardware can also still cause problems here. Even with an SSD cache. There are two main reasons for this. The I/O level that NAS devices often have and the fact that there is a high CPU base load (especially with smaller devices).

At first glance, I would assume from the debug log that the restart does not succeed at all because “not the container” is terminated but because “the process is shot down” and Docker then stops the container. It could also be that it is simply getting tangled up.

For me, the reason could be that there are situations in which files can no longer be opened. I would suspect this because of the “bad file descriptor” errors. It looks like it can't write to the socket, which causes it to crash at some point.

That wouldn't surprise me with a NAS either. On the one hand, they already have a lot of files open, and on the other, a lot of small network connections open and therefore a lot of sockets open. It should be borne in mind that sockets in Unix are also files according to the “everything is a file” principle.
I think that somewhere there lies the problem. Possibly also kernel options that influence socket sizes or other socket options. Possibly other limitations are also relevant that often want to keep processes in check on NAS devices.

@nathang21
Copy link
Author

Ah apologies, I thought that was for debugging the web app (which I am using macOS in this case).

Regarding NAS devices in general, that is an interesting perspective, and that does make some sense, although i'm not certain it's what i'm experiencing based on a few things:

  • To be clear, i'm not using an "SSD Cache", but a full SSD volume, where docker and all the relevant containers and their state live. This is technically the same system as the rest of my NAS with HDD's but the SSD is only used for containers and other applications, which allows for a very performant operation.
  • Regarding socket sizes / file limits, I have run into that issue before with Plex, and I did have to increase the default limits, which has worked well for me. I previously changed my system config to the following fs.inotify.max_user_watches=524288 and fs.inotify.max_user_instances=1024, perhaps i could continue scaling up if i'm reaching these limits again, but will need to determine if thats what is actually happening.
  • I haven't had any other issues running numerous other containers on this device, with varying levels of load across CPU/RAM/DISK/NET. Overall my system runs quite smooth, with occasional bursts of activity due to scheduled tasks or general use, but i've never had this interrupt bitmagnet previously.
  • Interestingly, it hasn't crashed in about a week (since posting this issue) so that is good news, and my network has been stable so it seems correlated to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants