Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to generate temporary downloadable link for task logs(stored on cloud storage) on UI #44753

Open
2 tasks done
rawwar opened this issue Dec 7, 2024 · 5 comments
Open
2 tasks done
Assignees
Labels
area:logging area:UI Related to UI/UX. For Frontend Developers. kind:feature Feature Requests

Comments

@rawwar
Copy link
Collaborator

rawwar commented Dec 7, 2024

Description

When a task generates many logs, the web server usually gets OOM Killed while trying to render the task logs. For this situation, we can give the user a downloadable link to the file on UI. If the task logs are located on cloud storage, we can generate the following:

  1. AWS - Presigned URLs
  2. GCP - Signed URL's
  3. Azure - Shared Access Signature

Use case/motivation

To render large task log files

Related issues

#33625

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@rawwar rawwar added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet labels Dec 7, 2024
@rawwar rawwar self-assigned this Dec 7, 2024
@dosubot dosubot bot added area:logging area:UI Related to UI/UX. For Frontend Developers. labels Dec 7, 2024
@rawwar rawwar changed the title Add ability to generate downloadable link for task logs on UI Add ability to generate downloadable link for task logs(stored on cloud storage) on UI Dec 7, 2024
@rawwar rawwar changed the title Add ability to generate downloadable link for task logs(stored on cloud storage) on UI Add ability to generate temporary downloadable link for task logs(stored on cloud storage) on UI Dec 7, 2024
@vatsrahul1001 vatsrahul1001 removed the needs-triage label for new issues that we didn't triage yet label Dec 7, 2024
@jscheffl
Copy link
Contributor

jscheffl commented Dec 7, 2024

I have also espcially seen that the webserver gets OOM even if you download because the FileTaskHandler tries to sort+merge different log sources. One important thing is not only the download but also the FileTaskHandler must 1:1 stream the logs from the backend, else it will go OOM as well.

This included when I took a look last time... why actually log sorting and merging is implemented in general I doubt the usefulness because I also had situations where log sorting by timestamp gave "strange" results. In my view logs should not be modified/sorted/merged when served from webserver. This would also remove the OOM problem in general... whereas I have seen situations as well where the browser/client "dies" in rendering. So a limit in logs provided to browser might be a important feature as well (but please not like Github is doing it)

@potiuk
Copy link
Member

potiuk commented Dec 7, 2024

Mostly agree with @jscheffl -> but I still think merging logs might be useful in some cases, though the "naive" version it is done now should be either limited to certain log size that should be able to fit in memory or fixed to support arbitrary log size. Loading whole log to memory is generally bad idea (but OK if we can confirm they will fit in memory).

There are some algorithms that could be used to do it "well" even when the logs are huge, but they require much more sophisticated behaviour and local file storage and likely are not suitable to run in an "API" call. So I am not sure if it at all worth doing it (airflow is NOT a sophisticated logging solution) - but if we have this "download full log" (and even there we could download zipped logs from several sources without merging them) - that could be a useful counterpart for merging for big files.

There are two options how to do it, I think:

  1. if the files are huge (and we could set arbitrary value here), only show the original task log (streaming it) and add a link to "download" the .zip file where you see the missing logs as well
  2. if the files are huge just download the "max" part of the files - perform merge and add a note that the logs are incomplete and that you should download the whole .zip content of the several logs

Generally, yes, I think we should implement it (and prevent the OOM from happening).

@jason810496
Copy link
Contributor

jason810496 commented Dec 8, 2024

Hi, after tracing down related GitHub issues and attempting to reproduce the OOM issue, I have some questions and ideas about related improvements.

Related Issues and Context

Ideas for Related Improvements

I’d be happy to take on these improvements if there are no objections!

  1. Expose log_pos and end_of_log as Query Parameters:

    • Implement in Flask and backport to v2.10.x.
    • Implement in FastAPI. ( new UI will have to implement the pagination logic as well )
  2. Pagination for Legacy UI:

  3. Refactor FileTaskHandler:

    • Address the root cause of OOM directly at the handler level.

Though I couldn't reliably reproduce OOM on the webserver, the browser tab crashing due to large logs remains an issue.

Question

How can I reliably reproduce the OOM issue when fetching large task logs?
I’ve tried observing memory usage with Memray but found it challenging to reproduce.

  • Steps Taken:
    • Used the DAG mentioned in PR #29390.
    • Triggered runs multiple times and managed to reproduce OOM only once.
    • Used breeze start-airflow --python 3.9 --backend sqlite as the environment.

Any tips on how to consistently reproduce the issue? Or is there a specific environment setup recommended for this?

@rawwar
Copy link
Collaborator Author

rawwar commented Dec 8, 2024

@jason810496 , I think you need to put resource limits on each airflow component. On managed services, the webserver is usually limited, On Astro its about 1 vcpu and 2GiB memory.

@jason810496
Copy link
Contributor

@jason810496 , I think you need to put resource limits on each airflow component. On managed services, the webserver is usually limited, On Astro its about 1 vcpu and 2GiB memory.

Thanks, I will try it with docker compose !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:logging area:UI Related to UI/UX. For Frontend Developers. kind:feature Feature Requests
Projects
None yet
Development

No branches or pull requests

5 participants