-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etl experiment report #1360
etl experiment report #1360
Conversation
上述問題已經都被訂正 |
@wycccccc 後來數據有再更新嗎 |
@chia7712 那段話是我半夜腦子短路了 所以我偷偷刪掉了,實驗過後,實驗數據沒有問題。 |
docs/etl/experiments/etl_1.md
Outdated
|
||
### 不平衡情景 | ||
|
||
在該情景下會用到上述的全部六臺機器,同時B1, B2, B3的網路頻寬將被設置爲2.5G以確保etl效能的變化在叢集高負載的情況下會有較明顯的體現。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我們可能需要呈現一下“問題”,也就是當有一個節點不穩或是忙碌時,其吞吐量的表現。例如我們可以將各節點拿到的資料量和頻寬呈現出來,說明當某個節點已經很忙很不穩了,default partitioner 依然嘗試放這麼多資料過去
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好 已添加對比實驗來說明這一問題
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wycccccc 感謝持續修改報告,剩一個建議調整完就可合併了
docs/etl/experiments/etl_1.md
Outdated
|
||
在普通情景下,擁有兩個worker的spark cluster中,使用standalone mode 啓動 astraea etl ,處理資料的平均速率爲58.5MB/s。 | ||
|
||
在不平衡情境下,替換partitioner後的效能對比。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
上面有提到不平衡的叢集造成的效能下降,麻煩在結論也要提到
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感謝建議,已經修改完畢
|
||
圖中左側爲不平衡情景,右側爲普通情景,方便直觀感受差別 | ||
|
||
左側實驗開始時先向costTopic發送資料,使其到達節點的頻寬上線。在一段時間後啓動etl,可以看到因爲etl發送資料分走了原先costTopic所佔據的頻寬,造成其效能下降。等到etl運行完畢costTopic的效能恢復到開始狀態。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
想再確認一下情境,這段話主要目的是要說當發送的節點中有一個較忙碌時,預設的 partitioner 並不會跳過該節點,一樣有可能朝該節點推送資料,導致整體吞吐量/延遲受到影響。
因此我想確認一下costTopic
和testTopic
各自的分佈是什麼?另外圖下方的三個不同顏色的testTopic
代表什麼意思?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
costTopic 接受使一個節點較忙碌的資料。它只分布在B1上。
testTopic etl產生的資料會發往該topic。他分布在B1 B2 B3上。
testTopic有三個是因為它顯示了該topic在三個節點中各自的流量。
而costTopic之所以只有一個是因為只有B1一個節點接收到資料。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wycccccc 感謝回應,可否把這段文字也加上去?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好 已經添加上去了
#1296
這份報告我沒有把架構加進來,感覺應該放到etl readme中。我會再發一隻pr把etl的文檔也順便一起寫掉。