Scaling Boost / Refactoring Boost into a stateless service #555

nonsense · 2022-06-02T15:36:38Z

nonsense
Jun 2, 2022
Maintainer

Storage providers’ questionnaire

Storage providers, please reply to this thread with answers to the questionnaire

Could you describe your current infrastructure:

Total number of hosts - sealer worker hosts for PC1/2 C1/2, etc.
Hosts’ configuration (CPU / RAM / disks (HDD, SSD) / GPU)?
Number of sealer nodes?
Current raw byte power?
Planned raw byte power?
How do you store your unsealed and sealed data? Do you have NFS?
What is your public/external network bandwidth?
What is your internal network bandwidth?
How many deals and how much data (GiB or TiB) do you onboard per day today?
How many deals and how much data (GiB or TiB) do you have capacity for and want to onboard per day if possible in an ideal world?
What bottlenecks do you see with Lotus and Boost today? If you are not running Boost, describe lotus-markets.
Does your organization already run a distributed key/value store for other services? Do you have a preferred key/value store?

Background

At the moment we are in a pre-release stage with the Boost project. Boost is an in-place replacement for Lotus markets sub-system. Storage providers included in alpha testing cohorts have already reported accepting deals with speed increases of up to 20x.

Having improved deal acceptance and data transfer rates, we are seeing a few bottlenecks with single-process Boost deployments, backed by tens/hundreds of sealing workers.

This document aims to document these bottlenecks and describe a way forward to address them given recent conversations within the Boost team, and start a conversation with the community storage providers, so that we better understand our users’ needs.

Problem definition / Requirements

Storage providers would like to be able to onboard data at petabytes scale.

During conversations with the PL Product team, as well as first hand interviews with SPs, one of the requirements we need to address in the coming months is that SPs are waiting on the sidelines and would like to onboard deals at rates from 200 TiB per day, up to PiBs per day.

Ease of upgradeability / reliability / no single-points-of-failure

At the moment it is possible to run boostd only as a single process. It should be possible to run a boostd upgrade with zero-downtime, and high reliability.

Support for multi-size storage providers

We must avoid pushing complexity to small storage providers. Storage providers should be able to continue to run Lotus / Boost easily, without too many dependencies, on single hosts if vertical scaling allows for that.

Bottlenecks

CommP calculation

CommP calculation is necessary in order to confirm that data for a deal sent by a client matches the data on the provider side, before that deal is published on-chain.

CommP calculating is rather intensive, and depending on hardware runs at speeds between 300-700 MBps. (or said differently, for a 32GiB piece, it takes anywhere from 50-90sec.)

If we assume a CommP calculation speed of 500MBps (476MiBps), we come up with an upper-bound of 39TiB worth of deals that we can onboard per day.

Storage providers would like to be able to onboard data at petabytes scale, about 100x that.

DAGStore indices

At the moment DAGStore is backed by an embedded database on a single machine. It is growing at a rate of 100GB per PiB.

If storage providers are onboarding petabytes of deals per day, we would need to be able to provide a solution for a sharded DAGStore across multiple machines. Furthermore given 1 (CommP calculation bottleneck) we need to be able to read/write from it from multiple processes. Given the required scale, we cannot be limited by a single user-land process both for redundancy as well as for throughput reasons.

Proposals

Make `boostd` a stateless service

In order to scale Boost, we are proposing to refactor boostd into a stateless service, with all global data stored separately, and accessible via a sidecar data service, named boostd-data. Small SPs would run the data service with an embedded store, such as LevelDB or Badger, in order to avoid the complexity of maintaining a separate database, while their data fits on a single host. For larger SPs, we’ll provide an implementation of boostd-data targeting a distributed key/value store.

For large SPs: introduce a distributed key/value store for all global state

If Boost state doesn’t fit on a single host, SPs should be able to easily transition to a distributed sharded datastore and use it as a backend for all boostd instances.

stuberman · 2022-06-02T15:54:31Z

stuberman
Jun 2, 2022

This is for my personal miner, not our PiKNiK system:

Total number of hosts - 1 miner/lotus node; 1 boost node; and 1 seal worker.
Hosts’ configuration (CPU / RAM / disks (HDD, SSD) / GPU)? See this discussion
Number of sealer nodes? 1
Current raw byte power? 112 TiB
Planned raw byte power? 892 TiB
How do you store your unsealed and sealed data? All on same ZFS array. Do you have NFS? No.
What is your public/external network bandwidth? 1Gbps up/down
What is your internal network bandwidth? Same, switched 1Gbps network
How many deals and how much data (GiB or TiB) do you onboard per day today? 3704 deals, 86 TiB
How many deals and how much data (GiB or TiB) do you have capacity for and want to onboard if possible in an ideal world? As much as I can afford in FIL and HDDs. Currently have 50 TiB raw capacity free and enough FIL for another 37 TiB verified deals. (Currently storing 80 TiB verified deals)
What bottlenecks do you see with Lotus and Boost today? If you are not running Boost, describe lotus-markets. For my casual needs, Boost has enough throughput, however it is clear with Evergreen that good retrieval throttling mechanisms are needed to protect the system from undue load.
Does your organization already run a distributed key/value store for other services? Do you have a preferred key/value store? N/A

0 replies

scaseye · 2022-06-02T16:29:39Z

scaseye
Jun 2, 2022

Current onboarding 1-3 TiB per 24 hour of legacy deal via slingshot evergreen. Less than 6 deals per 24 hours in v2 boost deals.

Ideal able to max out my 24 hour sealing capacity. 6TiB per 24 hour or 90% of external bandwidth.

Bottlenecks
— FIL
— downloading and managing car files for deals.
— importing deals. Currently, only import 1 at a time. I know you can do more than one at a time, but errors out if too many.. have not had time to investigate.
— retrievals buggy on all levels.
— PC1 is my overall sealing bottleneck.

No I do not run a distributed key/value store for other services.

Miner: f01611097

Power: 2.14 Pi / 17.1 Ei (0.0122%)
Raw: 234.3 TiB / 16.35 EiB (0.0014%)
Committed: 234.3 TiB
Proving: 234.3 TiB

Workers: Seal(13) WdPoSt(1) WinPoSt(1)

Storage Deals: 7769, 210.2 TiB
Active: 7769 210.2 TiB (Verified: 7768 210.2 TiB)

Retrieval Deals (complete): 54948, 17.11 TiB

lotus version 1.15.2+mainnet+git.518dc962e
boostd version 0.0.1+git.c3e145d.dirty

Hardware
Rack: 1x Rack | 42U 600mm Wide x 1200mm Deep
PDU: 2x Rack PDU, Basic, Zero U, 30A, 200/208V, (20)C13 & (4)C19 with NEMA L6-30P Input Connections
Ethernet Switch: 48 PORT 10G BASE Switch with 6x 40G QSFP+ ports
Broadband Connection Primary (fiber): 1 Gps up, 1 Gps down
Broadband Connection Backup (cable): 30 Mbs up, 300 Mbs down

3 Node Mining Cluster:
— Sealing time start to finalized: 4 hour 55 minutes up to 6 hours 15 mins.
— Current Max is 6TiB per 24 hours.
— Mining nodes use 40G network ports.

— Node 1 details

Ubuntu 20.04 (Focal Fossa):
Processor: 2x AMD EPYC 7H12 (64 cores)
GPUs: 1x RTX A6000 | 48 GB
Memory: 2TB (32x 64 GB RDIMM ECC RAM)
NVME drives: 2x | U.2 | Gen4 | NVMe
- Disk1 - boot/os/proofs/chain/lotus ( 3TB )
- Disk2 - boost/pc1/ap( 15TB )
Services Running
- Daemon
- Miner
- Boost
- Worker1: PC1
- Worker2: WDPost
- Worker3: WinPost
- Worker4: AP
- Bidbot
- filswan

— Node 2 details

Ubuntu 20.04 (Focal Fossa):
Processor: 2x AMD EPYC 7H12 (64 cores)
Memory: 2TB (32x 64 GB RDIMM ECC RAM)
NVME drives: 3x | U.2 | Gen4 | NVMe
- Disk1 - boot/os/proofs/( 3TB )
- Disk2 - pc1( 15TB )
- Disk3 - pc1( 15TB )
Services Running
- Worker1: PC1
- Worker2: AP

— Node 3 details

Ubuntu 20.04 (Focal Fossa):
Processor: 2x AMD EPYC 7H12 (64 cores)
GPUs: 2x RTX A6000 | 48 GB
Memory: 2TB (32x 64 GB RDIMM ECC RAM)
NVME drives: 3x | U.2 | Gen4 | NVMe
- Disk1 - boot/os/proofs/( 3TB )
- Disk2 - workers1-4( 15TB )
- Disk3 - workers5-8( 15TB )
Services Running
- Worker1: PC2-1,RU-1,RSK path:nvme1, tmp:nvme0, GPU:0
- Worker2: PC2-3,RU-2,RSK path:nvme1, tmp:nvme0, GPU:0
- Worker3: C2-2,PR2-1,RSK path:nvme1, tmp:nvme0, GPU:0
- Worker4: C2-4,PR2-2,RSK path:nvme1, tmp:nvme0, GPU:0
- Worker5: PC2-2,RU-3,RSK path:nvme2, tmp:nvme0, GPU:1
- Worker6: PC2-4,RU-4,RSK path:nvme2, tmp:nvme0, GPU:1
- Worker7: C2-1,PR2-3,RSK path:nvme2, tmp:nvme0, GPU:1
- Worker8: C2-3,PR2-4,RSK path:nvme2, tmp:nvme0, GPU:1

4 Node Ceph Cluster for Storage
— 4+2 Erasure Coding profile
— 1.3PiB max capacity
— current 210 TiB of verified deals that expands to about 470tib of space, which expands to 717tibs in ceph.
— use 10G network ports

1 reply

scaseye Jun 3, 2022

controlling sealing pipeline to avoid bottlenecks would be great.

kernelogic · 2022-06-02T19:36:47Z

kernelogic
Jun 2, 2022

Total number of hosts - sealer worker hosts for PC1/2 C1/2, etc.

3 hosts + 1 storage server

Hosts’ configuration (CPU / RAM / disks (HDD, SSD) / GPU)?

Host #1: Daemon + Miner: 3960X + 256GB + A4000 + 4TB NVME
Host #2: Market: 3960X + 256GB + NO GPU + 4TB NVME
Host #3: Worker: 3955X + 512GB + 3090 + 4TB NVME
Storage: Storinator Q30

Number of sealer nodes?

Current raw byte power?

108T

Planned raw byte power?

No plan unless provided Fil, then 200T raw power.

How do you store your unsealed and sealed data? Do you have NFS?

On Storinator Q30 via NFS.

What is your public/external network bandwidth?

1G duplex

What is your internal network bandwidth?

10G

How many deals and how much data (GiB or TiB) do you onboard per day today?

0.1 TiB

How many deals and how much data (GiB or TiB) do you have capacity for and want to onboard if possible in an ideal world?

1 TiB

What bottlenecks do you see with Lotus and Boost today? If you are not running Boost, describe lotus-markets.

Nothing, bottlenecked at my available Fil, basement space and hardware.

Does your organization already run a distributed key/value store for other services? Do you have a preferred key/value store?

I run redis.

0 replies

SBudo · 2022-06-03T00:47:36Z

SBudo
Jun 3, 2022

Total number of hosts - sealer worker hosts for PC1/2 C1/2, etc.
12 nodes:
2x Lotus nodes
1x Miner node
1x WinningPoSt/WindowPoSt node
1x Boost node
3x PC1 nodes
4x PC2/C1/C2 nodes

Hosts’ configuration (CPU / RAM / disks (HDD, SSD) / GPU)?

Node	CPU	RAM	Disks	GPU
Lotus node (x2) (VM)	1x AMD EPYC 7402P 24C (24 vCores)	64GB	3.2TB NVMe	N/A
Miner node	1x AMD EPYC 7302 16C	256GB	800GB SSD (OS), 950TiB HDD	1x Nvidia T4
WinPoSt/WndPoSt node	1x AMD EPYC 7302 16C	256GB	800GB SSD (OS)	2x Nvidia T4
Boost node (VM)	1x AMD EPYC 7402P 24C (16 vCores)	128GB	128GB SSD (OS), 10TiB HDD	N/A
PC1 nodes (x3)	2x AMD EPYC 7262 8C	1TB	800GB SSD (OS), 3x 3.2TB NVMe	N/A
PC2/C1/C2 nodes (x2)	2x AMD EPYC 7302 16C	512GB	800GB SSD (OS), 2x 3.2TB NVMe	2x Nvidia Quadro RTX 8000
PC2/C1/C2 nodes (x2)	1x AMD EPYC 7302 16C	256GB	800GB SSD (OS), 1x 3.2TB NVMe	1x Nvidia T4

Number of sealer nodes?
7 sealer nodes

Current raw byte power?
230TiB

Planned raw byte power?
~1PiB

How do you store your unsealed and sealed data?
Seagate iSCSI enclosures

Do you have NFS?
No

What is your public/external network bandwidth?
200Mbps (planned upgrade to 1Gbps Q3-22)

What is your internal network bandwidth?
10Gbps (planned upgrade to 25Gbps Q3-22)

How many deals and how much data (GiB or TiB) do you onboard per day today?
~1TiB raw of verified deals per day

How many deals and how much data (GiB or TiB) do you have capacity for and want to onboard if possible in an ideal world?
~10TiB raw of verified deals per day (sealing capacity, but not enough FILs)

What bottlenecks do you see with Lotus and Boost today? If you are not running Boost, describe lotus-markets.
Import of deal data is very slow (~3mins per sector), commp calculation is too slow (~2mins per sector)

Does your organization already run a distributed key/value store for other services? Do you have a preferred key/value store?
N/A

0 replies

tmyuu · 2022-06-03T07:28:11Z

tmyuu
Jun 3, 2022

Total number of hosts - sealer worker hosts for PC1/2 C1/2, etc.

- lotus daemon x1
- boostd x1
- lotus-miner x1
- lotus-worker (PC1/2 + C1/2) x2

Hosts’ configuration (CPU / RAM / disks (HDD, SSD) / GPU)?

Node	CPU	RAM	Disks	GPU
lotus daemon (VM)	Intel Xeon D-1548	128GB (using 32GB)	1TB	N/A
boostd (VM)	Intel Xeon W-1270	128GB (using 100GB)	1TB	N/A
lotus-miner	AMD EPYC 7402P	128GB	1TB + 200TB iSCSI attached (HDD)	NVIDIA T4
lotus-worker	AMD EPYC 7402P	512GB	1TB + 4TB iSCSI attached (SSD)	NVIDIA T4
lotus-worker	AMD EPYC 7402P	384GB	1TB + 4TB iSCSI attached (SSD)	NVIDIA T4

Number of sealer nodes?

Current raw byte power?

33TiB

Planned raw byte power?

100TiB

How do you store your unsealed and sealed data? Do you have NFS?

iSCSI

What is your public/external network bandwidth?

10Gbps

What is your internal network bandwidth?

10Gbps (SFP+)

How many deals and how much data (GiB or TiB) do you onboard per day today?

18.05 TiB

How many deals and how much data (GiB or TiB) do you have capacity for and want to onboard if possible in an ideal world?

100TiB

What bottlenecks do you see with Lotus and Boost today? If you are not running Boost, describe lotus-markets.

snap-deals are sometimes not working correctly because FinalizeReplicaUpdateFailed occur, so it leads to Window PoSt failure.

Does your organization already run a distributed key/value store for other services? Do you have a preferred key/value store?

No

0 replies

benjaminh83 · 2022-06-03T08:18:30Z

benjaminh83
Jun 3, 2022

Total number of hosts - sealer worker hosts for PC1/2 C1/2, etc.

lotus daemon x1
boostd x1
lotus-miner x1
lotus-worker (AP/PC1/2) x2
lotus-worker (C1/2) x2

Hosts’ configuration (CPU / RAM / disks (HDD, SSD) / GPU)?

Node	CPU	RAM	Disks	GPU
lotus daemon + lotus miner	AMD EPYC 7302P	256GB	1.92TB RAID1	A4000
boostd + lotus-worker (C1+C2)	AMD EPYC 7282	512GB	8TB RAID0	2x A2000
lotus-worker (PC1+PC2)	AMD EPYC 7702P	1024GB	3x3.84TB	A2000
lotus-worker (PC1+PC2) + lotus-worker (C1+C2)	2x AMD EPYC 7413	2048GB	16TB RAID0	2x A5000

Current raw byte power?

200TiB

Planned raw byte power?

950TiB

How do you store your unsealed and sealed data? Do you have NFS?

Only direct attached storage on the lotus-miner server. Running 4 individual RAID6 arrays.

What is your public/external network bandwidth?

1 GbE

What is your internal network bandwidth?

10GbE/25GbE

How many deals and how much data (GiB or TiB) do you onboard per day today?

Evergreen and Estuary. Up to 1-3TiB/day raw

How many deals and how much data (GiB or TiB) do you have capacity for and want to onboard if possible in an ideal world?

200 TiB

What bottlenecks do you see with Lotus and Boost today? If you are not running Boost, describe lotus-markets.

No big issues on a small scale miner like this. On bigger scale, I see issues with doing CommP, even on 64 cores CPUs, when running many in parallel to get up to 100 sectors filled per hour. These systems typically have dual 100GbE, but will be facing bottlenecks in the compute part. Also there is no good way to deliver QoS for clients on retrieval. We can limit one retrieval per client, but thats just an unknown peer ID. It very difficult to limit supply to misuse, while keeping it open for our actual genuine clients. We would not like the real clients to get a "service not available" when trying to retrieve data. On the other hand we want to limit the evergreen scavengers, trying to download as much as possible in parallel.

Does your organization already run a distributed key/value store for other services? Do you have a preferred key/value store?

No

3 replies

dirkmc Jun 6, 2022

Thanks for the feedback 🙏
Note that we have implemented a fix whereby you will be able to perform commp in the sealing subsystem instead of on the boost node. We just need to wait until v1.16.x of lotus to release it.

yhio Jun 14, 2022

lotus v1.15.3 already release. why need to wait v1.16.x?

dirkmc Jun 14, 2022

The v1.16.x release contains a specs-actors change that requires a boost database migration. In order to make sure all the dependencies are aligned between boost, lotus and go-fil-markets, we need to wait for v1.16.x to deploy the commp fix. The boost PR targeting v1.16.x contains more details about the specs-actors change: #578

Angelo-gh3990 · 2022-06-03T18:42:03Z

Angelo-gh3990
Jun 3, 2022

Total number of hosts - sealer worker hosts for PC1/2 C1/2, etc.

lotus daemon on vmware
boostd on vmware
lotus-miner physical
lotus-worker (AP/PC1) - 2
lotus-worker (PC2) - 2

Hosts’ configuration (CPU / RAM / disks (HDD, SSD) / GPU)?

Mix of Intel Xeon / AMD Epyc / SSD and NVME / RTX3090

Number of sealer nodes?

4

Current raw byte power?

120TiB

Planned raw byte power?

1PiB

How do you store your unsealed and sealed data? Do you have NFS?

Mix of ZFS accessed over NFS

What is your public/external network bandwidth?

2Gbps

What is your internal network bandwidth?

Every server has multiple 10Gbps links

How many deals and how much data (GiB or TiB) do you onboard per day today?

1TiB's

How many deals and how much data (GiB or TiB) do you have capacity for and want to onboard per day if possible in an ideal world?

2TiB's

What bottlenecks do you see with Lotus and Boost today? If you are not running Boost, describe lotus-markets.

Large datasets are being ingested and you want control on how that is being handled in the back-end (sealing pipeline).
What I like is that Boost downloads it right away over http pretty fast. But why not have a scheduler build into Boost to have the downloaded deals being processed at a certain time (Like you have the publish button now). That would solve a large part of it. Set the publishing to 48 hours or disabled and have a scheduler start x amount of data or x amount of sectors at a certain set time through a planner/scheduler ... Thus adding to the sealing pipeline in a controlled manner.
So I see an impact / bottleneck on a running sealing pipeline where large scale storage providers will have an impact on deals that are added to the sealing pipeline without control when they start being processed after the publish window.

Does your organization already run a distributed key/value store for other services? Do you have a preferred key/value store?

no

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling Boost / Refactoring Boost into a stateless service #555

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Scaling Boost / Refactoring Boost into a stateless service #555

nonsense Jun 2, 2022 Maintainer

Storage providers’ questionnaire

Background

Problem definition / Requirements

Storage providers would like to be able to onboard data at petabytes scale.

Ease of upgradeability / reliability / no single-points-of-failure

Support for multi-size storage providers

Bottlenecks

CommP calculation

DAGStore indices

Proposals

Make boostd a stateless service

For large SPs: introduce a distributed key/value store for all global state

Replies: 7 comments · 4 replies

nonsense
Jun 2, 2022
Maintainer

Make `boostd` a stateless service

Replies: 7 comments 4 replies