Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Billboard: all chaos test to do ( welcome new ideas!😆) #3449

Closed
9 of 10 tasks
Yiyiyimu opened this issue Jan 28, 2021 · 15 comments
Closed
9 of 10 tasks

Billboard: all chaos test to do ( welcome new ideas!😆) #3449

Yiyiyimu opened this issue Jan 28, 2021 · 15 comments
Labels
chaos chaos scenario to do discuss stale

Comments

@Yiyiyimu
Copy link
Member

Yiyiyimu commented Jan 28, 2021

Background

See #2757

Todo List

Architecture

Chaos Test
overall related:

  • The skywalking server is down: heartbeat in skywalking-nginx-lua would throw error at interval of 3s, so nothing to worry about
  • The oauth 2.0 server is down
  • Kafka server down/broker failure
  • cpu/memory stressed out and see how apisix works
  • The disk of apisix is full

etcd related:

Could not be tested now:
DP CP have not been separated yet

  • Part or all of the dp and cp networks are disconnected
  • The cp node is down randomly
  • The dp node is attacked by ddos

confirmed need hard work but not much benefit

  • restart etcd then apisix and see how it works (resend from memory to etcd)

Welcome more ideas!😆

@Yiyiyimu Yiyiyimu added the chaos chaos scenario to do label Jan 28, 2021
@tokers
Copy link
Contributor

tokers commented Jan 29, 2021

@Yiyiyimu Have you tried to make the network partition for the etcd cluster? To see the behavior of APISIX.

@Yiyiyimu
Copy link
Member Author

@Yiyiyimu Have you tried to make the network partition for the etcd cluster? To see the behavior of APISIX.

Thanks for the suggestion!! Added to todo and I'll try it later

@idbeta
Copy link
Contributor

idbeta commented Jan 29, 2021

two ideas

  1. kill one or two pods of etcd cluster , then check CRUD for routes, and then start the etcd pods again,check CRUD for routes again.
  2. when the network connected to etcd is relatively stuck, such as serious packet loss, create 1000 normal routes and check whether the routes are accessed.

@Yiyiyimu
Copy link
Member Author

Hi @idbeta Thanks for the suggestions! The second one is great, but I failed to get what we could benefit from the first idea. Could you explain a little more

@idbeta
Copy link
Contributor

idbeta commented Jan 29, 2021

Hi @idbeta Thanks for the suggestions! The second one is great, but I failed to get what we could benefit from the first idea. Could you explain a little more

the current case is kill all etcd, I think should test the situation where some etcd instances are down.
the idea comes from:

我etcd是个集群,但是我宕一台etcd好像apisix重启就报错, 宕的是 Leader

@Yiyiyimu
Copy link
Member Author

Hi @idbeta thx for explaination! Added to todo and I'll test it later. Is it from certain issue so I could have some context to reproduce

@sysulq
Copy link
Contributor

sysulq commented Jan 29, 2021

@Yiyiyimu delete all etcd data to see apisix's behavior.

@tzssangglass
Copy link
Member

update data directly in etcd with high frequency
update the same route at high frequency

@Yiyiyimu
Copy link
Member Author

Many thanks @hnlq715 @tzssangglass !!

@moonming
Copy link
Member

I need to remind that we test the stability of apisix, not etcd.

We need to look at this issue from a higher perspective.

@moonming
Copy link
Member

@Yiyiyimu I think you should look at the stability of the apisix cluster from the perspective of the overall architecture, and find out the weaknesses of the system one by one. Instead of randomly shoot.

@moonming
Copy link
Member

  1. Part or all of the dp and cp networks are disconnected 2. The cp node is down randomly 3. The dp node is attacked by ddos ​​4. The disk of the dp node is full 5. The sykwalking server is down 6. The oauth 2.0 server is down 7. Kafka server ia down etc.

@Yiyiyimu
Copy link
Member Author

@moonming Thanks for the suggestions. That would be so helpful! I'll focus on architecture weakness first and then do some random shoots

@github-actions
Copy link

This issue has been marked as stale due to 350 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@apisix.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label May 27, 2022
@github-actions
Copy link

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chaos chaos scenario to do discuss stale
Projects
None yet
Development

No branches or pull requests

6 participants