Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ZFS.md #1694

Closed
wants to merge 11 commits into from
Closed

Add ZFS.md #1694

wants to merge 11 commits into from

Conversation

kinrokinro
Copy link
Contributor

@kinrokinro kinrokinro commented Sep 3, 2024

Overview

Add the documentation on how to run a DA node with ZFS compression

Summary by CodeRabbit

  • New Features
    • Added a comprehensive guide for setting up a DA node with on-the-fly compression using ZFS.
    • Included detailed instructions on hardware requirements, environment setup, and advanced configuration options for optimizing storage efficiency.

Copy link
Contributor

coderabbitai bot commented Sep 3, 2024

Walkthrough

A new file, nodes/zfs.md, has been added to provide a comprehensive guide for setting up a Data Availability (DA) node with on-the-fly compression using ZFS. The document includes hardware requirements, step-by-step instructions for installation and configuration, and advanced tuning options. It emphasizes the importance of syncing from scratch and offers practical commands to enable and check compression rates.

Changes

Files Change Summary
nodes/zfs.md New document added with a comprehensive guide on setting up ZFS compression for DA nodes, including requirements and step-by-step instructions.

Poem

🐇 In the meadow where data flows,
A bridge node now with ZFS glows.
Compression on-the-fly, oh what a sight,
Storage efficiency, taking flight!
With each command, we hop and cheer,
Optimizing nodes, the future is here! 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range, codebase verification and nitpick comments (2)
nodes/zfs.md (2)

25-25: Use a hyphen for the compound adjective "Bare Metal".

-1. Bare Metal server with decent amount of RAM (64GB+)
+1. Bare-Metal server with decent amount of RAM (64GB+)
Tools
LanguageTool

[uncategorized] ~25-~25: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ia-bridge/ ``` ::: ## Requirements: 1. Bare Metal server with decent amount of RAM (64GB+...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


31-33: Specify the language for shell command code blocks.

For better syntax highlighting and readability, specify the language as "sh" for the following code blocks:

  • Lines 31-33
  • Lines 37-38
  • Lines 42-43
  • Lines 47-48
  • Lines 53-54
  • Lines 59-60
  • Lines 64-65
  • Lines 72-74
  • Lines 77-79
  • Lines 82-84
  • Lines 96-97

Example:

-```
+```sh
 lsblk --nodeps -o name



Also applies to: 37-38, 42-43, 47-48, 53-54, 59-60, 64-65, 72-74, 77-79, 82-84, 96-97

</blockquote></details>

</blockquote></details>

<details>
<summary>Review details</summary>

**Configuration used: CodeRabbit UI**
**Review profile: CHILL**

<details>
<summary>Commits</summary>

Files that changed from the base of the PR and between 95dc48c08af2229039346c9dbb7f25720b0aab54 and 9c2fb73e883a8710626e20548081cf55f90f4388.

</details>


<details>
<summary>Files selected for processing (2)</summary>

* nodes/bridge-node.md (1 hunks)
* nodes/zfs.md (1 hunks)

</details>










<details>
<summary>Additional context used</summary>

<details>
<summary>LanguageTool</summary><blockquote>

<details>
<summary>nodes/zfs.md</summary><blockquote>

[uncategorized] ~25-~25: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ia-bridge/ ``` :::  ## Requirements: 1. Bare Metal server with decent amount of RAM (64GB+...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

</blockquote></details>
<details>
<summary>nodes/bridge-node.md</summary><blockquote>

[grammar] ~209-~209: The word “setup” is a noun. The verb is spelled with a white space.
Context: ...ith ZFS  Follow the [tutorial on how to setup your DA node to use on-fly compression ...

(NOUN_VERB_CONFUSION)

</blockquote></details>

</blockquote></details>
<details>
<summary>Markdownlint</summary><blockquote>

<details>
<summary>nodes/zfs.md</summary><blockquote>

24-24: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)

---

28-28: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)

---

11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</blockquote></details>

</blockquote></details>

</details>
<details>
<summary>Additional comments not posted (2)</summary><blockquote>

<details>
<summary>nodes/zfs.md (1)</summary><blockquote>

`1-97`: **Documentation looks great!**

The documentation provides a clear and comprehensive guide for setting up ZFS compression on a DA node. The step-by-step instructions, command examples for different network environments, and additional notes make it easy for users to follow along and implement ZFS compression on their nodes.

Great work on putting together this helpful resource!

<details>
<summary>Tools</summary>

<details>
<summary>LanguageTool</summary><blockquote>

[uncategorized] ~25-~25: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ia-bridge/ ``` :::  ## Requirements: 1. Bare Metal server with decent amount of RAM (64GB+...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

</blockquote></details>
<details>
<summary>Markdownlint</summary><blockquote>

24-24: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)

---

28-28: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)

---

11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</blockquote></details>

</details>

</blockquote></details>
<details>
<summary>nodes/bridge-node.md (1)</summary><blockquote>

`205-209`: **LGTM!**

The addition of the optional section for enabling on-the-fly compression using ZFS is a valuable enhancement to the bridge node setup instructions. It provides users with additional options for optimizing their node performance.

<details>
<summary>Tools</summary>

<details>
<summary>LanguageTool</summary><blockquote>

[grammar] ~209-~209: The word “setup” is a noun. The verb is spelled with a white space.
Context: ...ith ZFS  Follow the [tutorial on how to setup your DA node to use on-fly compression ...

(NOUN_VERB_CONFUSION)

</blockquote></details>

</details>

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

nodes/bridge-node.md Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 9c2fb73 and eb9a83d.

Files selected for processing (1)
  • nodes/bridge-node.md (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • nodes/bridge-node.md

@jcstein
Copy link
Member

jcstein commented Sep 4, 2024

hmm @Wondertan I think we'd probably advise not to include this hack in docs?

@Wondertan
Copy link
Member

@jcstein, mentioning that in the docs may be helpful, especially if storage issues are pressing for node runners, as it's ~ a 2x reduction.

@Wondertan
Copy link
Member

@kinrokinro, for how long have you been running the node with compression on? Do you see it running stable?

@kinrokinro
Copy link
Contributor Author

you been running the node with compression on? Do you see it running stable?

For a month and yes, it running stable, our bridge is always in sync and everything looks pretty good to me, that's the reason why I recommend to add this option to the docs.

Anyways we always can mark it as "not recommended" if there is any concern from your side.

BTW zstd-3 is a light compression, max is zstd-19.

@kinrokinro
Copy link
Contributor Author

Currently we're running both our bridges on testnet and mainnet on zfs, so our metrics is available in your OTEL

@jcstein
Copy link
Member

jcstein commented Sep 4, 2024

thank you for the context and feedback! I think it is safe to recommend this if @Wondertan approves. And then I will work into the sidebar menu

@mogoll92
Copy link

mogoll92 commented Sep 9, 2024

you been running the node with compression on? Do you see it running stable?

For a month and yes, it running stable, our bridge is always in sync and everything looks pretty good to me, that's the reason why I recommend to add this option to the docs.

Anyways we always can mark it as "not recommended" if there is any concern from your side.

BTW zstd-3 is a light compression, max is zstd-19.

I think zstd-3 is optimal as with more zstd-x more time is needed to compress so with bigger load node could be behind sync all the time.

@jcstein
Copy link
Member

jcstein commented Sep 9, 2024

I think zstd-3 is optimal as with more zstd-x more time is needed to compress so with bigger load node could be behind sync all the time.

confirming with @kinrokinro that this uses zstd-3?

@mogoll92
Copy link

mogoll92 commented Sep 9, 2024

I think zstd-3 is optimal as with more zstd-x more time is needed to compress so with bigger load node could be behind sync all the time.

confirming with @kinrokinro that this uses zstd-3?

according to the doc that @kinrokinro wrote - yes, it uses zstd-3 and with this compression he achieved 2.05x ratio.
I found an interesting discussion regarding ZFS levels, it gives overall info about ZFS levels. Here is:
https://www.reddit.com/r/zfs/comments/sxx9p7/a_simple_real_world_zfs_compression_speed_an/

According to it level more than zstd-10 leads to crazy time downgrade and ratio isn't so huge to sacrifice with performance. I'm going to try zstd-5 with my mocha bridge node to see if no issues with performance and what compression ratio I will get comparing to zstd-3, probably just after full sync ask to share @kinrokinro and compare with what I got.

@kinrokinro
Copy link
Contributor Author

I think zstd-3 is optimal as with more zstd-x more time is needed to compress so with bigger load node could be behind sync all the time.

confirming with @kinrokinro that this uses zstd-3?

Yes, we use zstd-3 compression level

@jcstein
Copy link
Member

jcstein commented Sep 10, 2024

@kinrokinro do you want to add this to the menu? or think that it is okay as is, linked in the "optional" section?

@kinrokinro
Copy link
Contributor Author

@kinrokinro do you want to add this to the menu? or think that it is okay as is, linked in the "optional" section?

This should be optional of course

@mogoll92
Copy link

mogoll92 commented Sep 10, 2024

It would be great to add here that processor is more important than RAM here, tried with 2x Intel Xeon Silver 4108 but it doesn't fly, have to request more powerful. Bridge node just syncs too slow and CPU is always more than 90% while RAM (64GB) about 50%.

@kinrokinro
Copy link
Contributor Author

It would be great to add here that processor is more important than RAM here, tried with 2x Intel Xeon Silver 4108 but it does fly, have to request more powerful. Bridge node just syncs too slow and CPU is always more that 90% while RAM (64GB) about 50%.

Perhaps it because of more powerful compression you choose

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (2)
nodes/zfs.md (2)

24-27: Add a hyphen to "bare metal" and remove the colon.

The requirements section provides clear hardware specifications. However, please make the following minor changes:

  1. Add a hyphen to "bare metal" to make it "bare-metal" when used as a compound adjective.
  2. Remove the colon at the end of the heading to adhere to the Markdown style guide.

Apply this diff:

-## Requirements:
+## Requirements
-1. A bare metal server with a substantial amount of RAM (64GB or more) and a modern CPU (latest generation EPYC or Xeon with a clock speed of 2.1GHz or higher and 32 threads or higher is recommended)
+1. A bare-metal server with a substantial amount of RAM (64GB or more) and a modern CPU (latest generation EPYC or Xeon with a clock speed of 2.1GHz or higher and 32 threads or higher is recommended)
Tools
LanguageTool

[uncategorized] ~25-~25: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...-bridge/ ``` ::: ## Requirements: 1. A bare metal server with a substantial amount of RAM...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

Markdownlint

24-24: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


28-97: Address the static analysis hints and remove the colon from the heading.

The guide section is well-structured and provides clear instructions for setting up ZFS compression. The command examples for different network environments are helpful. However, please address the following static analysis hints:

  1. Remove the colon from the heading to adhere to the Markdown style guide.
  2. Specify the language for fenced code blocks to improve syntax highlighting.
  3. Add a comma after "In case of using a snapshot" to improve readability.

Apply this diff:

-## Guide:
+## Guide
 
 Get your disk name:
 ```sh
@@ -89,7 +89,7 @@ celestia bridge start --node.store /celestia/bridge/.celestia-bridge-arabica-11
 :::
 
 :::tip NOTE
-It is recommended to sync from scratch. In case of using a snapshot it is important to have your local route to `--data.store` identical to one in a snapshot.
+It is recommended to sync from scratch. In case of using a snapshot, it is important to have your local route to `--data.store` identical to one in a snapshot.
 :::
 
 After completing the steps above, you can begin syncing your DA node.
Tools
LanguageTool

[uncategorized] ~45-~45: You might be missing the article “a” here.
Context: ... apt install zfsutils-linux Create ZFS pool:sh zpool create $ZFS_POOL_NAME...

(AI_EN_LECTOR_MISSING_DETERMINER_A)


[uncategorized] ~89-~89: A comma might be missing here.
Context: ...o sync from scratch. In case of using a snapshot it is important to have your local rout...

(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)

Markdownlint

28-28: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between eb9a83d and 804a79e.

Files selected for processing (1)
  • nodes/zfs.md (1 hunks)
Additional context used
LanguageTool
nodes/zfs.md

[uncategorized] ~25-~25: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...-bridge/ ``` ::: ## Requirements: 1. A bare metal server with a substantial amount of RAM...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~45-~45: You might be missing the article “a” here.
Context: ... apt install zfsutils-linux Create ZFS pool:sh zpool create $ZFS_POOL_NAME...

(AI_EN_LECTOR_MISSING_DETERMINER_A)


[uncategorized] ~89-~89: A comma might be missing here.
Context: ...o sync from scratch. In case of using a snapshot it is important to have your local rout...

(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)

Markdownlint
nodes/zfs.md

24-24: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


28-28: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

Additional comments not posted (1)
nodes/zfs.md (1)

1-23: Introduction section looks good!

The introduction provides a clear and informative overview of the purpose and benefits of using ZFS compression for a DA node. The note effectively illustrates the storage savings achieved with ZFS compression compared to EXT4 without compression.

Tools
Markdownlint

11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

@mogoll92
Copy link

mogoll92 commented Sep 10, 2024

It would be great to add here that processor is more important than RAM here, tried with 2x Intel Xeon Silver 4108 but it does fly, have to request more powerful. Bridge node just syncs too slow and CPU is always more that 90% while RAM (64GB) about 50%.

Perhaps it because of more powerful compression you choose

Will provide here much more info on issues that I got with charts, just still in syncing state. But after switching from 2x Intel Xeon Silver 4108 to 2x Intel Xeon Gold 6130 my node is flying very well, just want to see final result on compression ratio with level what we set (zstd-5).

@jcstein
Copy link
Member

jcstein commented Sep 12, 2024

so this baead03 should resolve the issue you were facing @mogoll92 ?

@mogoll92
Copy link

mogoll92 commented Sep 12, 2024

so this baead03 should resolve the issue you were facing @mogoll92 ?

not completely, unfortunately.
With more synced data performance of data node has been decreasing. I thought that 2x Intel Xeon Gold 6130 solved that issue as it more powerful and I didn't see huge load on it, around 50% with +/- 10%. On screenshot you see that node reached 1.5m blocks very fast, but after it had been started degrading in performance. I assume that now it's because of NVMe disks are on PCIe-3 that why node stuck on I/O and it makes load on processor and system overall. I should mentioned that I kept testnet bridge on the server with 2x Intel Xeon Silver 4108 and disks supports PCIe-3 without zfs and node was syncing very well and working as well. Now I rent AMD EPYC 7313P and drives with PCIe-4, will try with that setup.
bad_perfromance

Also adding to decreasing in performance, on the screenshot below you can see that node progresses with 20k blocks every 3 hours now (sadly). Hope more capable I/O drives will solve it.
Screenshot 2024-09-12 at 15 36 59

@jcstein jcstein self-requested a review September 12, 2024 21:07
@jcstein
Copy link
Member

jcstein commented Sep 16, 2024

not completely, unfortunately.
With more synced data performance of data node has been decreasing.

based on this will have to recommend not to use this yet, but happy to leave PR open until when it makes sense to add

@mogoll92
Copy link

mogoll92 commented Sep 16, 2024

not completely, unfortunately.
With more synced data performance of data node has been decreasing.

based on this will have to recommend not to use this yet, but happy to leave PR open until when it makes sense to add

Agree, The 4 days are almost passed and node is still in syncing state after switching to new hardware. Less than 1m blocks are left with syncing speed 40k blocks per 12 hours but should say the syncing speed depends on how much data it has so it's faster sometimes. Let's see how it will go after reaching 2m blocks.

Will drop here all insights on issues and so on after node will get synced and get confirmed that metrics are being sent well. This testing might be worth to convert into some forum post for user's who will decides to run DA node with zfs later.

@kinrokinro
Copy link
Contributor Author

not completely, unfortunately.
With more synced data performance of data node has been decreasing.

based on this will have to recommend not to use this yet, but happy to leave PR open until when it makes sense to add

Agree, The 4 days are almost passed and node is still in syncing state after switching to new hardware. Less than 1m blocks are left with syncing speed 40k blocks per 12 hours but should say the syncing speed depends on how much data it has so it's faster sometimes. Let's see how it will go after reaching 2m blocks.

Will drop here here all insights on issues and so on after node will get synced and get confirmed that metrics send well. This testing might be worth to convert into some forum post for user's who will decides to run DA node with zfs later.

Could you try with zstd-3, not zstd-5?

@mogoll92
Copy link

mogoll92 commented Sep 16, 2024

not completely, unfortunately.
With more synced data performance of data node has been decreasing.

based on this will have to recommend not to use this yet, but happy to leave PR open until when it makes sense to add

Agree, The 4 days are almost passed and node is still in syncing state after switching to new hardware. Less than 1m blocks are left with syncing speed 40k blocks per 12 hours but should say the syncing speed depends on how much data it has so it's faster sometimes. Let's see how it will go after reaching 2m blocks.
Will drop here here all insights on issues and so on after node will get synced and get confirmed that metrics send well. This testing might be worth to convert into some forum post for user's who will decides to run DA node with zfs later.

Could you try with zstd-3, not zstd-5?

Yeah, I left the idea to try syncing with zstd-5 as it requires more processor's resources and maximum that I got compared to zstd-3 is 5% in saving storage. So I will not recommend to try everything higher than zstd-3 or do it on own risk.

Now node is on zstd-3 compression from the beginning.

@mogoll92
Copy link

mogoll92 commented Oct 7, 2024

(Moche bridge node) Alright, I've made some investigation previous month and what I've found...

Jumping ahead, all things regarding CPU, disks huge usage related ONLY during syncing bridge node, after that it becomes normal and only some really small spikes are happened. ZFS uses certain amount of CPU to process the iops, most of which are compression and checksumming. I used the following hardware setup for my node:
CPU: AMD EPYC 7313P
RAM: ECC DDR4 128GB 3200MHz
Disks: 3x Micron 7450 MTFDKBG3T8TFR

Node took about 9 days to get fully synced and here is some hardware resources consumption:
Screenshot 2024-10-07 at 12 47 00
Screenshot 2024-10-07 at 12 45 41
Screenshot 2024-10-07 at 14 20 06

Below I would like to list recommendations to start use ZFS.

  1. CPU: From the screenshot above, you can see that the CPU is really crucial, but I would say only during synchronization. Once the node is synced, CPU usage drops to 5-10%, with some spikes up to 100%, which is okay for bridge nodes, even without ZFS. I would recommend selecting the processor carefully for a server and using the one shared above as a starting point. The more powerful the processor, the less time it will take for your node to sync. However, if you have enough time, you can opt for something similar to what I have.
  2. RAM: Huge RAM isn't necessary here, 64GB DDR4 should be fine for sync and node running after it.
  3. Disk(s): The disk is a bit tricky here, as low throughput could lead to higher iowait, and as a result, increase CPU load, which is already stressed by ZFS. I would recommend using only NVMe drives with PCIe-4 support (which usually provides acceptable I/O speed). Also, check if the disk has a 4096-byte (4KB) physical sector size, as that will be important for tuning the ZFS pool correctly. Additionally, the more I/O size the disk supports (minimum/optimal), the better, though it's not strictly necessary. I’ve used disks with a 4096 bytes minimum/optimal size, and it's been fine. I know SAMSUNG disks offer higher I/O sizes, but it's up to your preference. With the following command you can check if disks support 4096 physical sector size.
sudo fdisk -l
  1. ZFS: I've found a couple of things that are important on zfs pool creation and dataset turning.
  • ashift property. The ashift property determines the block allocation size that ZFS will use per vdev. Ideally this value should be set to the sector size of the underlying physical device (the sector size being the smallest physical unit that can be read or written from/to that device), that's why it's important to disk to have 4KB physical sector size to avoid bottlenecks with I/O. So once you are sure you have recommended disk's physical sectors I recommend to set ashift property to ashift=12 on pool creation. This property is immutable and can't be changed with time. Ex:
zpool create -o ashift=12 $ZFS_POOL_NAME /dev/nvme0n1

Below I would like to show I/O wait and disk load with ashift=8 and corresponding disk physical sector size 512-byte. Yeah, NVMe disk with PCIe-4 support could have this :)
Screenshot 2024-10-07 at 13 35 52

And with correctly configured zpool ashift properly.
Screenshot 2024-10-07 at 13 39 17

The I/O wait in the first case is terrible, max iowait is up to 30% and average - almost 3%, which causes additional load on the CPU and results in the node frequently getting stuck.

  • zstd compression algorithm. I wouldn't recommend going higher than zstd-3. The higher the compression level, the more CPU is consumed during sync. Testing with zstd-5 showed that the maximum compression ratio difference compared to zstd-3 was only 5%, which is not worth the performance loss just to save 5% of storage. Therefore, the recommendation the same like in the guide is to use zstd-3.

  • recordsize property. ZFS splits files into blocks before writing them to disk. The recordsize defines the maximum size of these blocks. By default, it's set to 128KB, but you can adjust it depending on your workload and performance needs.I've set recordsize=256KB for the dataset, considering that Celestia DA nodes store a significant amount of data that can benefit from larger recordsize blocks. For the mainnet, you could even increase the recordsize to 512KB, as it handles a much larger volume of data. This property can be set at any time, but it’s recommended to set it from the beginning, as changing it later won’t affect data that’s already stored.

zfs set recordsize=256K $ZFS_POOL_NAME/$ZFS_DATASET_NAME

It gives me a better compression ratio and seems to reduce the load, as there’s no need to split data into 128KB blocks that could instead be stored in 256KB blocks, resulting in fewer I/O operations.
Here is some weird numbers as for 256KB blocks it shows huge compression, but I got ~5% more in saving storage compared to 128KB recordsize. (compared with @kinrokinro he has 1.97x)

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:     21  10.5K  10.5K     21  10.5K  10.5K      0      0      0
     1K:  8.95K  8.95M  8.96M  8.95K  8.95M  8.96M      0      0      0
     2K:  8.94K  31.3M  40.3M  8.94K  31.3M  40.3M      0      0      0
     4K:  3.77M  15.1G  15.1G  52.6K   210M   251M   572K  2.24G  2.24G
     8K:  1.06M  10.3G  25.4G  90.8K  1.37G  1.62G  4.22M  35.4G  37.6G
    16K:  1.12M  25.5G  50.9G   621K  9.93G  11.6G  1.18M  26.7G  64.4G
    32K:  1.85M  84.5G   135G   262K  15.3G  26.8G  1.86M  84.9G   149G
    64K:  15.8M  1.42T  1.55T   452K  30.6G  57.4G  15.8M  1.42T  1.57T
   128K:  14.6M  2.74T  4.29T  3.18M   432G   490G  14.6M  2.74T  4.31T
   256K:  34.6K  8.64G  4.30T  33.7M  8.42T  8.90T  36.0K  9.07G  4.31T

And overall compression.

zfs get compressratio celestia_main && du -sh /celestia_main/bridge/.celestia-bridge-mocha-4/

NAME           PROPERTY       VALUE  SOURCE
celestia_main  compressratio  2.06x  -

6.4T	/celestia_main/bridge/.celestia-bridge-mocha-4/

Summarise.
Considering all above I suggest to use the following tunes for zfs pool and dataset with taking into account all hardware recommendations.

zpool create -o ashift=12 $ZFS_POOL_NAME /dev/nvme0n1
zfs set recordsize=256K $ZFS_POOL_NAME/$ZFS_DATASET_NAME

After syncing, hardware resource consumption is fine, and the node works well. I know the Celestia team released code that significantly reduced the disk space required for storing data, but considering the 1GB blocks feature that Celestia announced recently, this tool could become relevant again. In this case I plan to use it in the future and recommend it to others.

@GeoddHQ
Copy link

GeoddHQ commented Oct 12, 2024

@mogoll92 @jcstein @kinrokinro
After many days of testing, we finally found the solution. I’ve shared all the key learnings here: Celestia Testnet - Bridge DA Node ZFS Optimizations.

Initially, we were stuck grinding through 20,000 blocks in four hours. However, after testing and optimization, we managed to complete the sync in just a couple of hours, last know was around 250,000 blocks every 2 hours or less.

One major takeaway was that the high CPU usage we saw was largely due to IOWAIT, which is essentially the system waiting for disk I/O. This can often look like high CPU load on graphs but isn’t actual CPU work. The IOWAIT came from a disk backlog, with tasks sometimes taking 20 to 60 seconds or even longer to process/complete. While upgrading to Gen5 NVMe could have helped, we didn’t want to rely solely on more hardware.

Through various tweaks, we reduced the backlog to just 100ms to 4 seconds, leading to much faster sync times. At one point, we tried increasing the recordsize, which improved sync speed even more, but it came at the cost of higher RAM usage and increased disk space. To strike a balance, we decided to keep the default 128K recordsize.

It was a long process, but the results speak for themselves! Tested on Gen4 PCIE NVME, DDR4 RAM with Intel XEON

  • Enable compression
  • Disable ZFS Auto trimming
  • Disable ZFS sync
  • Disable ZFS prefetch

@kinrokinro
Copy link
Contributor Author

@mogoll92 @jcstein @kinrokinro After many days of testing, we finally found the solution. I’ve shared all the key learnings here: Celestia Testnet - Bridge DA Node ZFS Optimizations.

Initially, we were stuck grinding through 20,000 blocks in four hours. However, after testing and optimization, we managed to complete the sync in just a couple of hours, last know was around 250,000 blocks every 2 hours or less.

One major takeaway was that the high CPU usage we saw was largely due to IOWAIT, which is essentially the system waiting for disk I/O. This can often look like high CPU load on graphs but isn’t actual CPU work. The IOWAIT came from a disk backlog, with tasks sometimes taking 20 to 60 seconds or even longer to process/complete. While upgrading to Gen5 NVMe could have helped, we didn’t want to rely solely on more hardware.

Through various tweaks, we reduced the backlog to just 100ms to 4 seconds, leading to much faster sync times. At one point, we tried increasing the recordsize, which improved sync speed even more, but it came at the cost of higher RAM usage and increased disk space. To strike a balance, we decided to keep the default 128K recordsize.

It was a long process, but the results speak for themselves! Tested on Gen4 PCIE NVME, DDR4 RAM with Intel XEON

  • Enable compression
  • Disable ZFS Auto trimming
  • Disable ZFS sync
  • Disable ZFS prefetch

I would strongly advise against disabling sync because of data integrity: we are working with the database where data integrity is the main point, if something goes wrong - the database will be corrupted. You should not disable sync no matter what, IMO.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
nodes/bridge-node.md (1)

206-209: Consider rewording "on-fly compression" for clarity.

The new section looks good overall. However, consider changing "on-fly compression" to "on-the-fly compression" for better clarity and more common usage.

Apply this diff to improve the wording:

-### Optional: enable on-fly compression with ZFS
+### Optional: enable on-the-fly compression with ZFS

 Follow the
-[tutorial on how to set up your DA node to use on-fly compression with ZFS](./zfs.md).
+[tutorial on how to set up your DA node to use on-the-fly compression with ZFS](./zfs.md).
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 804a79e and fdad795.

📒 Files selected for processing (1)
  • nodes/bridge-node.md (1 hunks)
🧰 Additional context used
🔇 Additional comments (1)
nodes/bridge-node.md (1)

206-209: LGTM! New section added and past grammar issue fixed.

The new section on ZFS compression is a valuable addition to the document. It's clearly marked as optional and provides a link to more detailed instructions. The grammar issue from the past review comment has been successfully addressed.

@kinrokinro
Copy link
Contributor Author

kinrokinro commented Oct 12, 2024

All of these tuning steps (other content is mostly the same as I can see) have been added to the documentation, along with remarks and instructions on how to use the sync=disabled setting properly (only during the sync).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
nodes/zfs.md (5)

11-16: Minor formatting improvement for code block

The code block within the tip section could benefit from a minor formatting adjustment for better readability.

Consider applying this change:

-$ zfs get compressratio celestia && du -h /celestia/bridge/.celestia-bridge
-NAME             PROPERTY       VALUE  SOURCE
-celestia         compressratio  2.05x  -
-2.6T    /celestia/bridge/.celestia-bridge
+$ zfs get compressratio celestia && du -h /celestia/bridge/.celestia-bridge
+NAME      PROPERTY       VALUE  SOURCE
+celestia  compressratio  2.05x  -
+2.6T      /celestia/bridge/.celestia-bridge

This aligns the output for better visual clarity.

🧰 Tools
🪛 Markdownlint

11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


24-24: Remove trailing colon from heading

The heading "Requirements:" contains a trailing colon, which is not recommended in Markdown.

Consider changing the heading to:

## Requirements

This adheres to Markdown best practices and improves consistency.

🧰 Tools
🪛 Markdownlint

24-24: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


28-28: Remove trailing colon from heading

The heading "Guide:" contains a trailing colon, which is not recommended in Markdown.

Consider changing the heading to:

## Guide

This adheres to Markdown best practices and improves consistency.

🧰 Tools
🪛 Markdownlint

28-28: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


11-11: Add language specifiers to code blocks

Some code blocks in the document are missing language specifiers. Adding these would improve syntax highlighting and readability.

Consider adding language specifiers to the following code blocks:

  • Line 11: ```sh
  • Line 18: ```sh
  • Line 32: ```sh
  • Line 37: ```sh
  • Line 42: ```sh
  • Line 47: ```sh
  • Line 53: ```sh
  • Line 59: ```sh
  • Line 64: ```sh
  • Line 96: ```sh
  • Line 104: ```sh
  • Line 114: ```sh
  • Line 124: ```sh

This will ensure consistent syntax highlighting throughout the document.

Also applies to: 18-18, 32-32, 37-37, 42-42, 47-47, 53-53, 59-59, 64-64, 96-96, 104-104, 114-114, 124-124

🧰 Tools
🪛 Markdownlint

11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


101-102: Clarify wording in Auto-Trim section

There's a potential typo or unclear wording in the Auto-Trim section.

Consider revising the text to avoid repetition and improve clarity:

-### Disable Auto-Trim
-Auto-trim disabling can improve I/O performance, but may lead to increased SSD wear over time.
+### Disable Auto-Trim
+Disabling auto-trim can improve I/O performance, but may lead to increased SSD wear over time.

This change removes the repetition and makes the sentence more straightforward.

🧰 Tools
🪛 LanguageTool

[duplication] ~101-~101: Possible typo: you repeated a word
Context: ...an try the following steps: ### Disable Auto-Trim Auto-trim disabling can improve I/O performance, ...

(ENGLISH_WORD_REPEAT_RULE)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between fdad795 and d338581.

📒 Files selected for processing (1)
  • nodes/zfs.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
nodes/zfs.md

[uncategorized] ~25-~25: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...-bridge/ ``` ::: ## Requirements: 1. A bare metal server with a substantial amount of RAM...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~89-~89: A comma might be missing here.
Context: ...o sync from scratch. In case of using a snapshot it is important to have your local rout...

(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)


[duplication] ~101-~101: Possible typo: you repeated a word
Context: ...an try the following steps: ### Disable Auto-Trim Auto-trim disabling can improve I/O performance, ...

(ENGLISH_WORD_REPEAT_RULE)

🪛 Markdownlint
nodes/zfs.md

24-24: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


28-28: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (3)
nodes/zfs.md (3)

30-97: Well-structured guide with clear instructions

The guide section provides clear, step-by-step instructions for setting up ZFS compression on a DA node. The use of variables, code blocks for different network configurations, and important notes (such as syncing from scratch) are all well-implemented.

The structure and content of this section effectively guide users through the setup process.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~89-~89: A comma might be missing here.
Context: ...o sync from scratch. In case of using a snapshot it is important to have your local rout...

(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)


99-129: Comprehensive ZFS Fine-Tuning section with clear warnings

The ZFS Fine-Tuning section provides advanced users with valuable optimization options. Each option is well-explained, including potential benefits and risks. The notes on how to revert changes are particularly helpful.

This section effectively balances providing advanced options with necessary cautions, ensuring users can make informed decisions about their ZFS configuration.

🧰 Tools
🪛 LanguageTool

[duplication] ~101-~101: Possible typo: you repeated a word
Context: ...an try the following steps: ### Disable Auto-Trim Auto-trim disabling can improve I/O performance, ...

(ENGLISH_WORD_REPEAT_RULE)


1-129: Comprehensive and well-structured guide for ZFS compression on DA nodes

This document successfully addresses the PR objective of introducing documentation for running a Data Availability node with ZFS compression. It provides a clear, step-by-step guide for setup, along with advanced tuning options and important considerations.

Key strengths:

  1. Clear structure with separate sections for requirements, basic setup, and advanced tuning.
  2. Effective use of code blocks and tips to highlight important information.
  3. Balanced approach to presenting advanced options with appropriate warnings.
  4. Alignment with PR discussion points, such as addressing compression levels (zstd-3) and performance considerations.

While there are minor formatting and wording suggestions mentioned in previous comments, the overall quality and content of the document are excellent. This guide will be a valuable resource for users looking to optimize their DA node performance through ZFS compression.

This document is ready for merging after addressing the minor suggestions in the previous comments.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~25-~25: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...-bridge/ ``` ::: ## Requirements: 1. A bare metal server with a substantial amount of RAM...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~89-~89: A comma might be missing here.
Context: ...o sync from scratch. In case of using a snapshot it is important to have your local rout...

(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)


[duplication] ~101-~101: Possible typo: you repeated a word
Context: ...an try the following steps: ### Disable Auto-Trim Auto-trim disabling can improve I/O performance, ...

(ENGLISH_WORD_REPEAT_RULE)

🪛 Markdownlint

24-24: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


28-28: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

@mogoll92
Copy link

mogoll92 commented Oct 12, 2024

@mogoll92 @jcstein @kinrokinro After many days of testing, we finally found the solution. I’ve shared all the key learnings here: Celestia Testnet - Bridge DA Node ZFS Optimizations.

Initially, we were stuck grinding through 20,000 blocks in four hours. However, after testing and optimization, we managed to complete the sync in just a couple of hours, last know was around 250,000 blocks every 2 hours or less.

One major takeaway was that the high CPU usage we saw was largely due to IOWAIT, which is essentially the system waiting for disk I/O. This can often look like high CPU load on graphs but isn’t actual CPU work. The IOWAIT came from a disk backlog, with tasks sometimes taking 20 to 60 seconds or even longer to process/complete. While upgrading to Gen5 NVMe could have helped, we didn’t want to rely solely on more hardware.

Through various tweaks, we reduced the backlog to just 100ms to 4 seconds, leading to much faster sync times. At one point, we tried increasing the recordsize, which improved sync speed even more, but it came at the cost of higher RAM usage and increased disk space. To strike a balance, we decided to keep the default 128K recordsize.

It was a long process, but the results speak for themselves! Tested on Gen4 PCIE NVME, DDR4 RAM with Intel XEON

  • Enable compression
  • Disable ZFS Auto trimming
  • Disable ZFS sync
  • Disable ZFS prefetch

Agree with @kinrokinro that disabling sync may lead to corruption, so need to be careful. Also could you drop some stats, charts, how long has it take, iowait during sync, space taken etc.?

@mogoll92
Copy link

mogoll92 commented Oct 12, 2024

All of these tuning steps (other content is mostly the same as I can see) have been added to the documentation, along with remarks and instructions on how to use the sync=disabled setting properly (only during the sync).

Could you also add info before creating a pool on ashift and hardware to you doc, please?
I dropped it here.
#1694 (comment)

It's crucial to have the pool with ashift 12 as less values could lead to performance degradation and huge iowait.

@kinrokinro
Copy link
Contributor Author

kinrokinro commented Oct 13, 2024

All of these tuning steps (other content is mostly the same as I can see) have been added to the documentation, along with remarks and instructions on how to use the sync=disabled setting properly (only during the sync).

Could you also add info before creating a pool on ashift and hardware to you doc, please? I dropped it here. #1694 (comment)

It's crucial to have the pool with ashift 12 as less values could lead to performance degradation and huge iowait.

Added, no concerns from my side about this settings (we have ashift=12 and recordsize=128K by default).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)
nodes/zfs.md (4)

9-22: LGTM: Informative compression examples

The note section effectively demonstrates the benefits of ZFS compression with clear examples. The formatting is consistent and easy to read.

Consider adding a brief explanation of the compression ratio (e.g., "2.05x means the data is compressed to about 48.8% of its original size") to help users interpret the results more easily.

🧰 Tools
🪛 Markdownlint

11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


24-27: LGTM: Clear and specific requirements

The requirements section provides clear and specific information about the hardware and disk needs for setting up a DA node with ZFS.

Remove the colon from the "Requirements" heading to adhere to Markdown best practices:

-## Requirements:
+## Requirements
🧰 Tools
🪛 LanguageTool

[uncategorized] ~25-~25: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...-bridge/ ``` ::: ## Requirements: 1. A bare metal server with a substantial amount of RAM...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

🪛 Markdownlint

24-24: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


28-97: LGTM: Comprehensive and well-structured guide

The guide section provides clear, step-by-step instructions for setting up ZFS compression on a DA node. The use of code blocks and network-specific instructions is excellent.

  1. Remove the colon from the "Guide" heading:
-## Guide:
+## Guide
  1. Consider adding a brief explanation of the ashift parameter in the ZFS pool creation command (line 47) to help users understand its purpose and importance.

To improve clarity and maintainability, consider using variables for the data store paths in the network-specific commands. For example:

MAINNET_DATA_STORE="/celestia/bridge/.celestia-bridge"
MOCHA_DATA_STORE="/celestia/bridge/.celestia-bridge-mocha-4"
ARABICA_DATA_STORE="/celestia/bridge/.celestia-bridge-arabica-11"

# Then use these variables in the commands, e.g.:
celestia bridge start --node.store $MAINNET_DATA_STORE ...

This approach would make it easier to update paths in the future if needed.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~89-~89: Possible missing comma found.
Context: ...o sync from scratch. In case of using a snapshot it is important to have your local rout...

(AI_HYDRA_LEO_MISSING_COMMA)

🪛 Markdownlint

28-28: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


99-135: LGTM: Comprehensive advanced tuning options with appropriate warnings

The ZFS Fine-Tuning section provides valuable advanced options with clear explanations and appropriate warnings. The structure and use of notes are excellent for guiding users through these more complex configurations.

Fix the typo in the auto-trim note:

-You always can trim maually: `sudo zpool trim $ZFS_POOL_NAME`
+You can always trim manually: `sudo zpool trim $ZFS_POOL_NAME`

Consider adding a brief explanation of the potential performance impact for each tuning option. For example, for the record size setting:

"Setting a larger record size can improve performance for large, sequential writes but may decrease performance for small, random writes. Choose the appropriate size based on your workload characteristics."

This additional context would help users make informed decisions about which optimizations to apply.

🧰 Tools
🪛 LanguageTool

[duplication] ~101-~101: Possible typo: you repeated a word
Context: ...an try the following steps: ### Disable Auto-Trim Auto-trim disabling can improve I/O performance, ...

(ENGLISH_WORD_REPEAT_RULE)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between d338581 and 36cc78f.

📒 Files selected for processing (1)
  • nodes/zfs.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
nodes/zfs.md

[uncategorized] ~25-~25: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...-bridge/ ``` ::: ## Requirements: 1. A bare metal server with a substantial amount of RAM...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~89-~89: Possible missing comma found.
Context: ...o sync from scratch. In case of using a snapshot it is important to have your local rout...

(AI_HYDRA_LEO_MISSING_COMMA)


[duplication] ~101-~101: Possible typo: you repeated a word
Context: ...an try the following steps: ### Disable Auto-Trim Auto-trim disabling can improve I/O performance, ...

(ENGLISH_WORD_REPEAT_RULE)

🪛 Markdownlint
nodes/zfs.md

24-24: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


28-28: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (2)
nodes/zfs.md (2)

1-7: LGTM: Clear and informative introduction

The file header and introduction effectively communicate the purpose of the document and the benefits of using ZFS compression for DA nodes. The content is well-structured and provides a good overview for users.


1-135: LGTM: Well-structured and formatted document

The overall structure and formatting of the document are excellent. The use of Markdown elements, such as headings, code blocks, and tip sections, is consistent and appropriate. The logical flow of information from basic setup to advanced tuning options makes the guide easy to follow.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~25-~25: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...-bridge/ ``` ::: ## Requirements: 1. A bare metal server with a substantial amount of RAM...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~89-~89: Possible missing comma found.
Context: ...o sync from scratch. In case of using a snapshot it is important to have your local rout...

(AI_HYDRA_LEO_MISSING_COMMA)


[duplication] ~101-~101: Possible typo: you repeated a word
Context: ...an try the following steps: ### Disable Auto-Trim Auto-trim disabling can improve I/O performance, ...

(ENGLISH_WORD_REPEAT_RULE)

🪛 Markdownlint

24-24: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


28-28: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

Comment on lines +1 to +135
```sh
ZFS_POOL_NAME="celestia" && ZFS_DATASET_NAME="bridge"
```

Install ZFS utils:
```sh
sudo apt update && sudo apt install zfsutils-linux
```

Create ZFS pool:
```sh
zpool create -o ashift=12 $ZFS_POOL_NAME /dev/nvme0n1
```

:::tip NOTE
If you have more than one disk available - you can add them also:
```sh
zpool create -o ashift=12 $ZFS_POOL_NAME /dev/nvme0n1 /dev/nvme1n1
```
:::

Create dataset:
```sh
zfs create $ZFS_POOL_NAME/$ZFS_DATASET_NAME
```

Enable compression:
```sh
zfs set compression=zstd-3 $ZFS_POOL_NAME/$ZFS_DATASET_NAME
```

Set the custom path to the bridge data folder:

::: code-group

```sh [Mainnet Beta]
# Add flag --node.store /celestia/bridge/.celestia-bridge to your command, example:
celestia bridge start --metrics.tls=true --metrics --metrics.endpoint otel.celestia.observer --p2p.metrics --node.store /celestia/bridge/.celestia-bridge
```

```sh [Mocha]
# Add flag --node.store /celestia/bridge/.celestia-bridge-mocha-4 to your command, example:
celestia bridge start --metrics.tls=true --metrics --metrics.endpoint otel.celestia-mocha.com --p2p.metrics --node.store /celestia/bridge/.celestia-bridge-mocha-4 --p2p.network mocha
```

```sh [Arabica]
# Add flag --node.store /celestia/bridge/.celestia-bridge-arabica-11 to your command, example:
celestia bridge start --node.store /celestia/bridge/.celestia-bridge-arabica-11 --p2p.network arabica
```

:::

:::tip NOTE
It is recommended to sync from scratch. In case of using a snapshot it is important to have your local route to `--data.store` identical to one in a snapshot.
:::

After completing the steps above, you can begin syncing your DA node.

You can check your compression rate with the following command:
```sh
zfs get compressratio $ZFS_POOL_NAME
```

## ZFS Fine-Tuning (Advanced)
If you want to increase your I/O performance and sync speed, you can try the following steps:
### Disable Auto-Trim
Auto-trim disabling can improve I/O performance, but may lead to increased SSD wear over time.
```sh
sudo zpool set autotrim=off $ZFS_POOL_NAME
```

:::tip NOTE
You always can trim maually: `sudo zpool trim $ZFS_POOL_NAME`
:::

### Disable sync
Disabling boosts write speed, but risks data loss if the system crashes before data is written to disk.
```sh
zfs set sync=disabled $ZFS_POOL_NAME
```

:::tip NOTE
You should not keep the `sync` feature disabled permanently; it is useful during the initial DA node sync but can be re-enabled afterward. You can enable `sync` again with: `sudo zfs set sync=enabled $ZFS_POOL_NAME`.
:::

### Disable prefetch
Disabling reduces memory usage but can slow down performance for sequential read workloads.
```sh
echo 1 | sudo tee /sys/module/zfs/parameters/zfs_prefetch_disable
```

:::tip NOTE
You can always re-enable it: `echo 0 | sudo tee /sys/module/zfs/parameters/zfs_prefetch_disable`
:::

### Set record size
Setting `recordsize=256K` defines the maximum block size that ZFS will use when writing data to a dataset.
```sh
zfs set recordsize=256K $ZFS_POOL_NAME/$ZFS_DATASET_NAME
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add a security considerations section

While the document provides comprehensive instructions for setting up and tuning a DA node with ZFS compression, it would benefit from a dedicated section on security considerations.

Consider adding a "Security Considerations" section near the end of the document. This section should cover:

  1. The potential risks of data loss when disabling sync, emphasizing the importance of having proper backup strategies.
  2. The importance of securing physical access to the server, especially when using advanced tuning options that may affect data integrity.
  3. Recommendations for keeping the ZFS software and the underlying operating system up to date with security patches.
  4. Any specific security best practices related to running a DA node in a ZFS environment.

This addition would help users understand and mitigate potential security risks associated with the setup and tuning options described in the guide.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~25-~25: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...-bridge/ ``` ::: ## Requirements: 1. A bare metal server with a substantial amount of RAM...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~89-~89: Possible missing comma found.
Context: ...o sync from scratch. In case of using a snapshot it is important to have your local rout...

(AI_HYDRA_LEO_MISSING_COMMA)


[duplication] ~101-~101: Possible typo: you repeated a word
Context: ...an try the following steps: ### Disable Auto-Trim Auto-trim disabling can improve I/O performance, ...

(ENGLISH_WORD_REPEAT_RULE)

🪛 Markdownlint

24-24: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


28-28: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)


11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

@GeoddHQ
Copy link

GeoddHQ commented Oct 13, 2024

I agree that the sync should be enabled after its completed.

Here are some stats before optimization

Screenshot 2024-10-13 at 8 30 45 AM Screenshot 2024-10-13 at 8 29 37 AM

After Optimization

Screenshot 2024-10-13 at 8 32 49 AM

Used space

Screenshot 2024-10-13 at 8 35 09 AM Screenshot 2024-10-13 at 8 37 10 AM

@mogoll92
Copy link

mogoll92 commented Oct 15, 2024

All of these tuning steps (other content is mostly the same as I can see) have been added to the documentation, along with remarks and instructions on how to use the sync=disabled setting properly (only during the sync).

Could you also add info before creating a pool on ashift and hardware to you doc, please? I dropped it here. #1694 (comment)
It's crucial to have the pool with ashift 12 as less values could lead to performance degradation and huge iowait.

Added, no concerns from my side about this settings (we have ashift=12 and recordsize=128K by default).

you have ashift=12 by default because your disks support 4k physical sector size, in this case zfs will try to create a pool with most suitable ashift, otherwise it will be created with less ashift, so important to check NVMe(s) beforehand.
As an example, one of your disk you dropped in discord.

Disk /dev/nvme4n1: 6.99 TiB, 7681501126656 bytes, 15002931888 sectors
Disk model: SAMSUNG MZQL27T6HBLA-00A07              
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 131072 bytes / 131072 bytes
Disklabel type: gpt
Disk identifier: F45A227A-AA47-5A42-8DCF-30657FF2B46E

@jcstein
Copy link
Member

jcstein commented Oct 30, 2024

would you also recommend we close this @sysrex ?

@kinrokinro
Copy link
Contributor Author

kinrokinro commented Oct 30, 2024

I would strongly disagree with closing this, I'm pretty sure it would be useful even with shwap and we use ZFS for both our bridge nodes (mainnet and testnet) with no issues.

@mogoll92
Copy link

Let's keep it open for now. There is a lot of useful information on running with ZFS. It's great that the team has released improvements in storage optimization, but Celestia DA is growing quickly, so this could become relevant sooner than we think.

@sysrex
Copy link
Contributor

sysrex commented Nov 1, 2024

@jcstein @kinrokinro @mogoll92 - while this looks awesome and the data show that pre-shwap was adding quite a bit of value, I feel this is a niche probably setup and maybe not suited for the main docs. if you guys write a tutorial on storage optimisation with ZFS maybe we can find a way to reference it but it's not something we would want in the main docs.

@sysrex sysrex closed this Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants