Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove timeouts that are not proportional to the pages #327

Closed
apyrgio opened this issue Jan 30, 2023 · 2 comments · Fixed by #332
Closed

Remove timeouts that are not proportional to the pages #327

apyrgio opened this issue Jan 30, 2023 · 2 comments · Fixed by #332
Assignees
Labels
enhancement New feature or request timeout Dangerzone Times Out
Milestone

Comments

@apyrgio
Copy link
Contributor

apyrgio commented Jan 30, 2023

The Dangerzone container currently has two types of timeouts:

# timeout in seconds for any single subprocess
DEFAULT_TIMEOUT: float = 120
# timeout in seconds for compressing a single page of the final document
COMPRESSION_TIMEOUT: float = 10

  • DEFAULT_TIMEOUT is a timeout that applies to any command that runs in a Dangerzone container.
  • COMPRESSION_TIMEOUT is a timeout that applies to a specific command (compressing a PDF) that is proportional to the number of pages in the PDF.

The second type of timeout should be our end goal. The reason is that the first type of timeout is page-agnostic, meaning that operations on large PDFs may surpass it. There are several issues that exemplify this problem:

The reason we're still stuck with the first type of timeout is that (until recently) we didn't have a way to know the number of pages of a PDF beforehand. Now that we do, we can do the following:

  1. Grab the number of pages in a document as soon as we can.
    • Generally, this means after we have a PDF document we can work with.
    • This means that operations on files that are not PDFs (e.g., .docx, .xlsx) will have to use a page-agnostic timeout.
    • For large images or non-PDF documents, maybe we can have a per-MB timeout.
  2. Settle on a per-page timeout that is sufficient for all of the PDF operations that we use. Currently this seems to be 10 seconds.
  3. Replace all global timeouts with a pages x per page timeout.
  4. Add a CLI command that will allow users to change this proportional timeout to something bigger. This should act as an escape hatch, so that users won't need to re-build the image, if they need to change it.
    • Since it's an escape hatch, it will not be visible in the UI for now. The end goal after all is to have something sensible that all users can work with.
  5. Add a CI test that converts a very large PDF (e.g, 1000 pages). The rationale is that we've missed these cases, because for the time being we didn't have such tests. Bonus points if it fails in the main branch but works on our new implementation.
@apyrgio apyrgio self-assigned this Jan 30, 2023
@apyrgio apyrgio added enhancement New feature or request timeout Dangerzone Times Out labels Jan 30, 2023
@apyrgio apyrgio added this to the 0.4.1 milestone Jan 30, 2023
@deeplow
Copy link
Contributor

deeplow commented Jan 30, 2023

Timeout for office documents

This may be the only thing that's out of the scope of this issue. I noticed that the there are two situations where timeouts cannot be proportional to the number of pages. This is because they happen before we have a PDF document and thus before we can even calculate its size. Images should be fine but office documents may be normally large

run_command(
args,
error_message="Conversion to PDF with LibreOffice failed",
timeout_message=f"Error converting document to PDF, LibreOffice timed out after {DEFAULT_TIMEOUT} seconds",

I tested converting a 500 slide document and it went past a minute on my system. I don't think libreoffice --headless provides a way to keep track of the progress of the conversion. Maybe some side-effect of the conversion (like tempfile or logs) can help us detect that. But I don't think that would be super wise.

Another way to solve this it to make it proportional document's file size or just simply remove the timeout here.

@apyrgio
Copy link
Contributor Author

apyrgio commented Jan 31, 2023

Another way to solve this it to make it proportional document's file size or just simply remove the timeout here.

I agree, that's why I suggested above that we can have a per-MB timeout. But we'll see...

apyrgio added a commit that referenced this issue Feb 2, 2023
Introduce proportional timeouts in the container code, where the
conversion logic runs.

Previously, we had a single timeout for each command (120 seconds),
which didn't scale well either with the number of pages in a document,
or with the size of the document.

In this commit, we look into each operation, and we're trying to figure
out the following:

1. What's the number of pages we will operate on?
2. How large is the document?

Knowing the above, we can break down a command into multiple operations,
at least conceptually. Having a number of operations and a sane timeout
value per operation (10 seconds), we can multiply those and reach to a
timeout that fits the command better.

Refs #327
apyrgio added a commit that referenced this issue Feb 15, 2023
Introduce proportional timeouts in the container code, where the
conversion logic runs.

Previously, we had a single timeout for each command (120 seconds),
which didn't scale well either with the number of pages in a document,
or with the size of the document.

In this commit, we look into each operation, and we're trying to figure
out the following:

1. What's the number of pages we will operate on?
2. How large is the document?

Knowing the above, we can break down a command into multiple operations,
at least conceptually. Having a number of operations and a sane timeout
value per operation (10 seconds), we can multiply those and reach to a
timeout that fits the command better.

Refs #327
apyrgio added a commit that referenced this issue Feb 15, 2023
Introduce proportional timeouts in the container code, where the
conversion logic runs.

Previously, we had a single timeout for each command (120 seconds),
which didn't scale well either with the number of pages in a document,
or with the size of the document.

In this commit, we look into each operation, and we're trying to figure
out the following:

1. What's the number of pages we will operate on?
2. How large is the document?

Knowing the above, we can break down a command into multiple operations,
at least conceptually. Having a number of operations and a sane timeout
value per operation (10 seconds), we can multiply those and reach to a
timeout that fits the command better.

Refs #327
apyrgio added a commit that referenced this issue Feb 15, 2023
Introduce proportional timeouts in the container code, where the
conversion logic runs.

Previously, we had a single timeout for each command (120 seconds),
which didn't scale well either with the number of pages in a document,
or with the size of the document.

In this commit, we look into each operation, and we're trying to figure
out the following:

1. What's the number of pages we will operate on?
2. How large is the document?

Knowing the above, we can break down a command into multiple operations,
at least conceptually. Having a number of operations and a sane timeout
value per operation (10 seconds), we can multiply those and reach to a
timeout that fits the command better.

Refs #327
apyrgio added a commit that referenced this issue Feb 15, 2023
Introduce proportional timeouts in the container code, where the
conversion logic runs.

Previously, we had a single timeout for each command (120 seconds),
which didn't scale well either with the number of pages in a document,
or with the size of the document.

In this commit, we look into each operation, and we're trying to figure
out the following:

1. What's the number of pages we will operate on?
2. How large is the document?

Knowing the above, we can break down a command into multiple operations,
at least conceptually. Having a number of operations and a sane timeout
value per operation (10 seconds), we can multiply those and reach to a
timeout that fits the command better.

Fixes #306
Fixes #314
Refs #327
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request timeout Dangerzone Times Out
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants