Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Engine] Evaluation of ZSTD (Z-standard) compression algorithm for log data #19

Open
Superskyyy opened this issue Sep 11, 2022 · 7 comments
Assignees
Labels
Core Core functionality that impacts the engine design Engine The work is on the engine side

Comments

@Superskyyy
Copy link
Member

Superskyyy commented Sep 11, 2022

AIOps engine will receive a large amount of log data from SkyWalking, and we decided to utilize a Redis stream as the buffer before stream processing. One noticeable issue is that standard Zlib cannot compress logs well as they arrive one by one (not enough knowledge to compress), according to how-compression-algorithm-works; therefore, costing extra memory/disk.

Note we delete logs immediately from the stream after processing, but still, it's worth to compressing logs for the sake of network bandwidth and prevent overloading Redis.

So here comes ZSTD , which can facilitate our flow by two directions (I): simply replacing Zlib with ZSTD to achieve a 2x average compression speed. (II**): to utilize dictionary compressor, that is, learning from a small sample batch of logs and then using the knowledge to further boost compression, this could save extra memory/disk. (Todo evaluate how to execute the learning phase - do we learn one for each service? do we learn one unified model or periodically retrain? etc.)

Some public discussions that prove its feasibility:
https://groups.google.com/g/redis-db/c/slk-c33EZ7U/m/tx81gCMDDQAJ - adoption case
http://facebook.github.io/zstd/ - performance comparison
https://github.com/animalize/pyzstd - target python lib for implementation

=======================================
Initial Experimentation Results and suggestions are welcome:

The results below show ZSTD with dictionary training on a very small amount (first 1k, increasing to 5k doesn't help) log data from the same service would save 33% more memory/disk in storage for the remaining 500k data.

(further experiments are needed to see if generally applicable)
The additional idea is that if we compose a good dataset that represents "what a normal log would look like", then it can be used as universal training data, compression ratio could be further pushed.

Note: My docker Redis bandwidth is slow.

ZLIB
size of log in Megabyte 86.237173MB
Time taken to send 500k messages with batch 2000: 12.09048318862915 seconds
92MB used in actual Redis key

ZSTD with dict training
done training dict on first 1000 log samples
func:train_zstd took: 0.06115330 sec
size of log in Mega byte 54.717921MB
Time taken to send 500k messages with batch 2000: 8.131911993026733 seconds
58MB used in actual Redis key

ZSTD with basic compressor [default level]
size of log in Megabyte 88.285950MB
Time taken to send 500k messages with batch 2000: 9.860241889953613 seconds

ZSTD with rich memory compressor [default level] (a bit decreased compression ratio)
size of log in Megabyte 88.386098MB
Time taken to send 500k messages with batch 2000: 9.413931131362915 seconds

@Superskyyy Superskyyy added Engine The work is on the engine side Core Core functionality that impacts the engine design labels Sep 11, 2022
@Superskyyy Superskyyy self-assigned this Sep 11, 2022
@wu-sheng
Copy link
Member

Notice, Redis is not allowed as a dependency in the ASF, due to license.
It is OK you choose for now.

@Superskyyy
Copy link
Member Author

Superskyyy commented Sep 11, 2022

Notice, Redis is not allowed as a dependency in the ASF, due to license.
It is OK you choose for now.

I checked Redis-core itself is BSD3, we do not use any extensions/modules that have any code with their RSAL license. Would that still be a problem? I'm a bit confused about these things and hope to learn more. Also, in skywalking-python, we have a docker-compose.yaml that deploys Redis during test.. Does it mean that Redis can be used in dev and testing as long as the final release artifact doesn't involve it?

In the future, we could switch to ship with kvrocks, but it unfortunately doesn't fully support stream consumer group commands yet (that we heavily rely on).

@wu-sheng
Copy link
Member

Are you only using Redis core? Many modules would be AGPL, even common clause.

I didn't check the features you are going to use, so, this is a reminder.

Also, you mentioned it works as a buffer, that is usually queue server role, why do you choose redis queue?

@Superskyyy
Copy link
Member Author

Superskyyy commented Sep 12, 2022

Are you only using Redis core? Many modules would be AGPL, even common clause.

I didn't check the features you are going to use, so, this is a reminder.

Also, you mentioned it works as a buffer, that is usually queue server role, why do you choose redis queue?

Thanks for the clarification! I just rechecked and it's strictly only Redis core as this screenshot shows streams engine in it. And I don't plan to use anything beyond core.
image

There are two main reasons why I choose Redis over a full-size MQ:

  1. We also use Redis to store machine learning model snapshots and other metadata. So the reason is to not introduce another dependency, it will be too much for a secondary system (AIOps engine) for a secondary system (SkyWalking)
  2. I find Redis Streams provide the exact same functionalities/speed as Kafka can offer to our use case, but are easier to work with/maintain than MQs.

I plan to add support for queue-based storage (Kafka) in the long run. For now, I think Redis streams work the best.

@wu-sheng
Copy link
Member

OK, like I said, for now, even for an AGPL module, it is fine. Until you want to move this into the ASF.

@Superskyyy
Copy link
Member Author

OK, like I said, for now, even for an AGPL module, it is fine. Until you want to move this into the ASF.

Understood, Thank you!

@Superskyyy
Copy link
Member Author

TODO: implement a self-optimizer by monitoring the metric of compression ratio, if that degrades significantly, we retrain the dictionary and propagate it to each consumer to improve compression performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core Core functionality that impacts the engine design Engine The work is on the engine side
Projects
Status: No status
Development

No branches or pull requests

2 participants