Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process workflow logs in batches #4045

Merged
merged 15 commits into from
Sep 18, 2024
Merged

Process workflow logs in batches #4045

merged 15 commits into from
Sep 18, 2024

Conversation

hg
Copy link
Contributor

@hg hg commented Aug 16, 2024

Hopefully this removes the first bottleneck.

The benchmark mentioned in the issue now finishes in 2 seconds because its output fits into one batch. Anything that spills over into more batches hangs waiting for the server to write out the previous ones.

edit: original description, not relevant anymore.

Closes #3999
Ref: #2072
( Close #2064 )

@6543
Copy link
Member

6543 commented Aug 16, 2024

in itselfe looks god, we just a:

  • make sure on graceful stop the last logs are commited
  • line numbers are still working correctly

@hg
Copy link
Contributor Author

hg commented Aug 16, 2024

Thank you for the review. Line numbers aren't affected as log records are simply passed on unchanged in the same order as they were previously.

The queue should indeed be drained on shutdown, thanks for catching this. Should be fixed now. (As far as GRPC connection lifetime permits — I did not change it and don't think it should be touched by this PR).

@qwerty287 qwerty287 changed the title Send agent → server logs in batches (#3999) Send agent → server logs in batches Aug 16, 2024
@hg
Copy link
Contributor Author

hg commented Aug 17, 2024

The second commit depends on the first one and is more of a demo at the moment, to show that the protocol breakage is worth it.

This job with 1 million lines of output:

when:
  event: push

steps:
  logs:
    image: ubuntu:24.04
    commands:
      - seq 1 1000000

finishes in 18-20 seconds when writing to disk, and in 5.5 minutes when writing to PostgreSQL.

IIRC, Drone shows better performance for the latter case by saving the entire massive log blob as one record. Probably not something the project is interested in.

@hg hg changed the title Send agent → server logs in batches Process workflow logs in batches Aug 17, 2024
@6543
Copy link
Member

6543 commented Aug 17, 2024

yes drone let the agent stream the log and use that stream just for live logs and then at the end send the whole log again as blob to be stored. We want to have a single stream to both store and view the logs at the same time ... as it ensures that what you see is what is actually happening and later on you still can ensure that all seen logs are stored also the same.

In the end your pull finish what i tested and then had in mind at #2072 but never go to implement it ...

@6543
Copy link
Member

6543 commented Aug 17, 2024

well it does not enhance the storing logs part... but that's as you wrote up for the next pull ...

@hg
Copy link
Contributor Author

hg commented Aug 17, 2024

line numbers are still working correctly

I misunderstood this: the log lines were indeed getting reordered, but only for live streaming web clients. It should be fixed now, but with limited Go knowledge I only came up with a solution that blocks in Write() when any subscriber's log channel is full. The blocking is prevented by dropping logs to that subscriber on overflow, which might result in gaps.

While this shouldn't happen much in practice (each channel stores up to 100 seconds of logs, or up to about 25 megabytes of text, whichever limit comes first), it's still a limitation that can probably be removed. If you're not happy about it -- I'll try and explore some more.

well it does not enhance the storing logs part

It solves the problem for file storage, which is likely already about as fast as it's going to get; less so for db. It's not taking hours anymore, but saving only 12 megabytes of logs per minute is still not ideal. I'm not sure what else can be done about it tbh, besides increasing granularity from "1 log line = 1 db row" to "full logs for 1 step = 1 db row" (like Drone does), or "logs printed at [time XX truncated to the second] = 1 db row".

@anbraten
Copy link
Member

Speeding up the saving of log entries to the database could be done by inserting batches directly (https://xorm.io/docs/chapter-04/readme/) and if still necessary dropping the primary id auto increment and replacing it with sth like uuid and so on.

However before adjusting any furthercode parts it would be awesome if we to do some profiling to determine the actual bottlenecks to not complicate the code in unrelated parts.

@hg
Copy link
Contributor Author

hg commented Aug 19, 2024

Sure, records are already saved in batches, regardless of which type of storage is used. Everything but the database storage is relatively efficient (if file storage is used, the server can process ~50k log rows/second compared to ~100-150 rows/second on 2.7.0/next), and fixing that requires changes to the storage format (i.e. the log_entries table), which is beyond the scope of this PR.

agent/rpc/client_grpc.go Outdated Show resolved Hide resolved
agent/rpc/client_grpc.go Outdated Show resolved Hide resolved
server/api/stream.go Outdated Show resolved Hide resolved
server/logging/log.go Show resolved Hide resolved
server/services/log/file/file.go Show resolved Hide resolved
@woodpecker-bot
Copy link
Collaborator

woodpecker-bot commented Aug 25, 2024

@6543
Copy link
Member

6543 commented Sep 5, 2024

if you need help to resolve the conflict just ask :)

@6543 6543 added the build_pr_images If set, the CI will build images for this PR and push to Dockerhub label Sep 5, 2024
@hg
Copy link
Contributor Author

hg commented Sep 7, 2024

No problem, I just haven't had the time lately.

FWIW, I've been using this in production for the past few weeks, and it at least works well… the part that pushes updates to web clients is less than ideal, though.

server/grpc/rpc.go Outdated Show resolved Hide resolved
server/logging/log_test.go Outdated Show resolved Hide resolved
@6543
Copy link
Member

6543 commented Sep 16, 2024

beside this two nits it looks good to go 🎉

@6543 6543 merged commit 276b279 into woodpecker-ci:main Sep 18, 2024
6 of 7 checks passed
@6543
Copy link
Member

6543 commented Sep 18, 2024

thanks for the awesome work!!!

@woodpecker-bot woodpecker-bot mentioned this pull request Sep 18, 2024
1 task
@6543 6543 removed the build_pr_images If set, the CI will build images for this PR and push to Dockerhub label Sep 18, 2024
@6543
Copy link
Member

6543 commented Sep 26, 2024

@hg if you want you could join the maintainers group and or get some swag :)

@hg
Copy link
Contributor Author

hg commented Sep 26, 2024

@6543

you could join the maintainers group

Not sure if I've earned it yet, but if other maintainers don't mind, why not. I don't have a lot of time right now, but I should be able to start working on the project in a week or so.

As for the swag, I live very far away from both the US and Europe, and shipping will cost more than the swag itself, but thank you for the offer.

@6543
Copy link
Member

6543 commented Sep 27, 2024

In this case you might DM me via matrix https://matrix.to/#/@marddl:obermui.de if you find a timeslot, so i can onbord u.

If you dont have an account, just use the matrix.org to create one ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent enhancement improve existing features server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inefficient handling of verbose log output woodpecker slowed done
5 participants