Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download of basebackup always stalls #359

Open
ilicmilan opened this issue Jul 10, 2019 · 4 comments
Open

Download of basebackup always stalls #359

ilicmilan opened this issue Jul 10, 2019 · 4 comments

Comments

@ilicmilan
Copy link

Hello,

I'm facing the issue where I'm not able to download a basebackup using pghoard_restore command since download always stalls.

Restore command:

sudo -u postgres pghoard_restore get-basebackup --config pghoard.json --restore-to-master --overwrite --target-dir /var/lib/pgsql/9.5/data-new/

The appropriate backup was selected, but nothing happens.
ps auxf shows that pghoard_restore creates 9 additional processes but the download progress is constantly 0%, which after 3 x 2 minutes turns to fail.

Command output:

Found 1 applicable basebackup 

Basebackup                                Backup size    Orig size  Start time          
----------------------------------------  -----------  -----------  --------------------
server-f-postgres-01/basebackup/2019-07-10_12-27_0.00000000.pghoard     13245 MB     35432 MB  2019-07-10T12:27:32Z
    metadata: {'compression-algorithm': 'snappy', 'format': 'pghoard-bb-v2', 'original-file-size': '81920', 'host': 'server-f-postgres-01', 'end-time': '2019-07-10 14:33:12.657815+02:00', 'end-wal-segment': '000000010000001A0000004A', 'pg-version': '90518', 'start-wal-segment': '000000010000001A00000048', 'total-size-plain': '37153730560', 'total-size-enc': '13888641735'}

Selecting 'server-f-postgres-01/basebackup/2019-07-10_12-27_0.00000000.pghoard' for restore
2019-07-10 15:20:34,941%BasebackupFetcher       MainThread      ERROR   Download stalled for 120.43377648199385 seconds, aborting downloaders
2019-07-10 15:22:35,674%BasebackupFetcher       MainThread      ERROR   Download stalled for 120.44614975301374 seconds, aborting downloaders
2019-07-10 15:24:36,392%BasebackupFetcher       MainThread      ERROR   Download stalled for 120.47685114300111 seconds, aborting downloaders
2019-07-10 15:24:36,612 BasebackupFetcher       MainThread      ERROR   Download stalled despite retries, aborting
FATAL: RestoreError: Backup download/extraction failed with 1 errors

pghoard.conf:

{
    "backup_location": "./metadata",
    "backup_sites": {
        "server-f-postgres-01": {
            "active_backup_mode": "pg_receivexlog",
            "basebackup_mode": "local-tar",
            "basebackup_chunks_in_progress": 5,
            "basebackup_chunk_size": 2147483648,
            "basebackup_hour": 5,
            "basebackup_interval_hours": 24,
            "basebackup_minute": 40,
            "pg_data_directory": "/var/lib/pgsql/9.5/data",
            "nodes": [
                {
                    "host": "127.0.0.1",
                    "user": "postgres",
                    "password": "secret",
                    "port": 5432
                }
            ],
            "object_storage": {
                "storage_type": "google",
                "project_id": "postgres-dev",
                "bucket_name": "test-pghoard"
            }
        }
    }
}
@eriveltonvichroski
Copy link

Hi,
I have the same problem in pghoard 2.1.0. Any tips to solve?

2020-10-16 11:00:09,131%BasebackupFetcher MainThread ERROR Download stalled for 120.13373475382105 seconds, aborting downloader

Thanks.

@rikonen
Copy link
Collaborator

rikonen commented Oct 19, 2020

There shouldn't be any generic issue with this as we've done very large amount of restorations across all major cloud providers and haven't been seeing this. If this is reproducible then you should check out what's happening on network level.

@eriveltonvichroski
Copy link

Hi,

On the line https://github.com/aiven/pghoard/blob/master/pghoard/rohmu/object_storage/google.py#L60

# googleapiclient download performs some 3-4 times better with 50 MB chunk size than 5 MB chunk size;
# but decrypting/decompressing big chunks needs a lot of memory so use smaller chunks on systems with less
# than 2 GB RAM
DOWNLOAD_CHUNK_SIZE = 1024 * 1024 * 5 if get_total_memory() < 2048 else 1024 * 1024 * 50
UPLOAD_CHUNK_SIZE = 1024 * 1024 * 5

Debugging, including on a machine/network in the CGP itself, I realized that the problem occurs when a machine has> 2 GB of RAM, because enters the condition "if get_total_memory () <2048 else 1024 * 1024 * 50"

DOWNLOAD_CHUNK_SIZE = 1024 * 1024 * 5 if get_total_memory () <2048 else 1024 * 1024 * 50

That is, the problem occurs when DOWNLOAD_CHUNK_SIZE = 50MB

First I tested with DOWNLOAD_CHUNK_SIZE = 1024 * 1024 * 5 and the download was successful!

The maximum value that the download works is DOWNLOAD_CHUNK_SIZE = 1024 * 1024 * 25, that is, 25 MB

Is there an automated test that runs on a machine with> 2GB of RAM?

Cheers

@rikonen
Copy link
Collaborator

rikonen commented Oct 20, 2020

Is there an automated test that runs on a machine with> 2GB of RAM?

Yes.

It would probably make sense to add an optional configuration parameter that can be used to set the chunk size. 50 MiB performs better than 5 MiB so it is preferable when download performance is important and as mentioned we haven't seen any issues with this but in general 50 MiB is fairly large chunk size and setting smaller one via config would be reasonable, especially if the machine is otherwise somehow memory constrained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants