-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-23085: Network and Data related Actions #675
Conversation
Add monkey actions: - manipulate network packages with tc (reorder, loose,...) - add CPU load - fill the disk - corrupt or delete regionserver data files Create monkey factories for the new actions Extend HBaseClusterManager to allow sudo calls Fix a copy/paste issue wih monkey constants in some factories
🎊 +1 overall
This message was automatically generated. |
hbase-it/src/test/java/org/apache/hadoop/hbase/HBaseClusterManager.java
Outdated
Show resolved
Hide resolved
hbase-it/src/test/java/org/apache/hadoop/hbase/HBaseClusterManager.java
Outdated
Show resolved
Hide resolved
hbase-it/src/test/java/org/apache/hadoop/hbase/HBaseClusterManager.java
Outdated
Show resolved
Hide resolved
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/AddCPULoadAction.java
Outdated
Show resolved
Hide resolved
//This will always happen. We use timeout to kill a continously running process | ||
//after the duration expires |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add /bin/true
at the end of the command line. We don't know at this point whether we got a network error for example or the script returned with error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@meszibalu As discussed for /bin/true to make a difference we should increase the outer timeout and that would not be much better so I will leave it as it is.
"seq 1 %s | xargs -I{} -n 1 -P %s timeout %s dd if=/dev/urandom of=/dev/null bs=1M " + | ||
"iflag=fullblock"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%s is used but numbers are added to the String.format arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is intentional. %s uses Integer.toString, which is predictable while %d uses locale specific formatting that might change.
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/CommandAction.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
private String getCommand(String operation){ | ||
return String.format("tc qdisc %s dev eth0 root netem corrupt %s%%", operation, ratio * 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please format floats with %f.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the advantage? For a ration of 0.35f, the generated string in it's current form returns "35.0%", if I would replace it with "%f%%", the output would change to "35.000000%". Would you prefer a specific format, like round to 2 decimals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand it. It is possible to set the precision as well. If the floating point number is too small for example, Double.toString will return 0.53242e-17 for example. Is it also supported by tc? If you want to simply concatenate strings, String.format is not the fastest approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@meszibalu As discussed it is not a performance sensitive part of the application and to conform to existing code I'll leave it as it is.
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/DelayPackagesCommandAction.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
private String getCommand(String operation){ | ||
return String.format("tc qdisc %s dev eth0 root netem delay %sms reorder %s%% 50%", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we parameterize the name of the interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat stuff, @BukrosSzabolcs !
What kind of cluster testing have you done with these so far?
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/CommandAction.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public void perform() throws Exception { | ||
if(clusterManager == null){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not have this in init()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way if it runs on a non-distributed cluster it just does nothing. It might be better in case someone mixes them with older actions
/** | ||
* Action corrupts region server data. | ||
*/ | ||
public class CorruptDataFilesAction extends Action { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat idea, but how could we tell via automation when a file was expectedly corrupted vs. unexpectedly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't as far as I'm aware. It's not clear to me what is the intended use of these tests, but they were requested by stack so I added them. They are so destructive I couldn't eve restart hbase after running them and had to delete every hbase related data from hdfs and zokeeper to be able to run hbase on the cluster again.
My best guess is to use them for active testing and run them in the background while monitoring hbase status/behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried with corrupting something critical like hbase:meta, a table descriptor, or something like that. I think corrupting a single hfile for a user-table is a more "reasonable" failure condition which wouldn't have long-lasting impact on the ability for HBase to keep working.
@saintstack that jive with what you were thinking or you have something else in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolving this -- defaulting to $hbase.root/data/default
is a nice compromise! Thanks for changing Szabolcs!
} | ||
|
||
private String getCommand(String operation){ | ||
return String.format("tc qdisc %s dev eth0 root netem corrupt %s%%", operation, ratio * 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably need to allow the network device to be specified, both for predictable network interface names and bonds.
https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/DeleteDataFilesAction.java
Show resolved
Hide resolved
rename base class for new actions to better reflect it's role make network iterface configurable for tc commands fix typos and logging
💔 -1 overall
This message was automatically generated. |
...it/src/test/java/org/apache/hadoop/hbase/chaos/factories/DistributedIssuesMonkeyFactory.java
Outdated
Show resolved
Hide resolved
fix checkstyle
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to summarize:
- Javadoc needs to be updated (left comments) for the new network interface configuration (the change looks good otherwise! Thanks)
- Need feedback from @saintstack on how he was expecting HFile corruption to work. I would almost certainly expect he meant corruption on user table files only, but we should check.
- A couple of configs (network, disk fill size and timeout) would need to be tweaked by a user. Do we need docs somewhere to describe how these monkies are configured? I'm not sure how this gets wired up off the top of my head.
- Two questions/requests by Balazs -- not sure if these reflect work to do or if they just need an ACK from Balazs as to why you're not incorporating that change.
Thanks Szabolcs!
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/CorruptPackagesCommandAction.java
Show resolved
Hide resolved
...e-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/DuplicatePackagesCommandAction.java
Show resolved
Hide resolved
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/LosePackagesCommandAction.java
Show resolved
Hide resolved
restrict file based monkeys to HFiles extend javadoc make sure new monkey properties are loaded from generic properties
Hi @joshelser
Thanks for the review! |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, looks like we're waiting on Balazs' review, too?
/** | ||
* Action corrupts region server data. | ||
*/ | ||
public class CorruptDataFilesAction extends Action { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolving this -- defaulting to $hbase.root/data/default
is a nice compromise! Thanks for changing Szabolcs!
Add monkey actions:
Create monkey factories for the new actions
Extend HBaseClusterManager to allow sudo calls
Fix a copy/paste issue wih monkey constants in some factories