Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etl experiment report #1360

Merged
merged 9 commits into from
Dec 31, 2022
Merged

etl experiment report #1360

merged 9 commits into from
Dec 31, 2022

Conversation

wycccccc
Copy link
Collaborator

@wycccccc wycccccc commented Dec 28, 2022

#1296
這份報告我沒有把架構加進來,感覺應該放到etl readme中。我會再發一隻pr把etl的文檔也順便一起寫掉。

@wycccccc wycccccc requested a review from chia7712 December 28, 2022 17:23
docs/etl/README.md Outdated Show resolved Hide resolved
docs/etl/README.md Outdated Show resolved Hide resolved
docs/etl/README.md Outdated Show resolved Hide resolved
docs/etl/README.md Outdated Show resolved Hide resolved
docs/etl/README.md Outdated Show resolved Hide resolved
@wycccccc
Copy link
Collaborator Author

上述問題已經都被訂正

@chia7712
Copy link
Contributor

@wycccccc 後來數據有再更新嗎

@wycccccc
Copy link
Collaborator Author

wycccccc commented Dec 30, 2022

@chia7712 那段話是我半夜腦子短路了 所以我偷偷刪掉了,實驗過後,實驗數據沒有問題。


### 不平衡情景

在該情景下會用到上述的全部六臺機器,同時B1, B2, B3的網路頻寬將被設置爲2.5G以確保etl效能的變化在叢集高負載的情況下會有較明顯的體現。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我們可能需要呈現一下“問題”,也就是當有一個節點不穩或是忙碌時,其吞吐量的表現。例如我們可以將各節點拿到的資料量和頻寬呈現出來,說明當某個節點已經很忙很不穩了,default partitioner 依然嘗試放這麼多資料過去

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好 已添加對比實驗來說明這一問題

Copy link
Contributor

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wycccccc 感謝持續修改報告,剩一個建議調整完就可合併了


在普通情景下,擁有兩個worker的spark cluster中,使用standalone mode 啓動 astraea etl ,處理資料的平均速率爲58.5MB/s。

在不平衡情境下,替換partitioner後的效能對比。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

上面有提到不平衡的叢集造成的效能下降,麻煩在結論也要提到

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感謝建議,已經修改完畢


圖中左側爲不平衡情景,右側爲普通情景,方便直觀感受差別

左側實驗開始時先向costTopic發送資料,使其到達節點的頻寬上線。在一段時間後啓動etl,可以看到因爲etl發送資料分走了原先costTopic所佔據的頻寬,造成其效能下降。等到etl運行完畢costTopic的效能恢復到開始狀態。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

想再確認一下情境,這段話主要目的是要說當發送的節點中有一個較忙碌時,預設的 partitioner 並不會跳過該節點,一樣有可能朝該節點推送資料,導致整體吞吐量/延遲受到影響。

因此我想確認一下costTopictestTopic各自的分佈是什麼?另外圖下方的三個不同顏色的testTopic代表什麼意思?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

costTopic 接受使一個節點較忙碌的資料。它只分布在B1上。
testTopic etl產生的資料會發往該topic。他分布在B1 B2 B3上。
testTopic有三個是因為它顯示了該topic在三個節點中各自的流量。
而costTopic之所以只有一個是因為只有B1一個節點接收到資料。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wycccccc 感謝回應,可否把這段文字也加上去?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好 已經添加上去了

@chia7712 chia7712 merged commit 9bb0515 into opensource4you:main Dec 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants