Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-23085: Network and Data related Actions #675

Merged
merged 4 commits into from
Nov 6, 2019

Conversation

BukrosSzabolcs
Copy link
Contributor

Add monkey actions:

  • manipulate network packages with tc (reorder, loose,...)
  • add CPU load
  • fill the disk
  • corrupt or delete regionserver data files

Create monkey factories for the new actions
Extend HBaseClusterManager to allow sudo calls
Fix a copy/paste issue wih monkey constants in some factories

Add monkey actions:
- manipulate network packages with tc (reorder, loose,...)
- add CPU load
- fill the disk
- corrupt or delete regionserver data files

Create monkey factories for the new actions
Extend HBaseClusterManager to allow sudo calls
Fix a copy/paste issue wih monkey constants in some factories
@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
💙 reexec 1m 21s Docker mode activated.
_ Prechecks _
💚 dupname 0m 0s No case conflicting files found.
💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
💚 @author 0m 0s The patch does not contain any @author tags.
💚 test4tests 0m 0s The patch appears to include 19 new or modified test files.
_ master Compile Tests _
💚 mvninstall 5m 42s master passed
💚 compile 0m 30s master passed
💚 checkstyle 0m 20s master passed
💚 shadedjars 4m 50s branch has no errors when building our shaded downstream artifacts.
💚 javadoc 0m 16s master passed
💙 spotbugs 5m 8s Used deprecated FindBugs config; considering switching to SpotBugs.
💚 findbugs 0m 0s master passed
_ Patch Compile Tests _
💚 mvninstall 5m 14s the patch passed
💚 compile 0m 33s the patch passed
💚 javac 0m 33s the patch passed
💚 checkstyle 0m 18s the patch passed
💚 whitespace 0m 0s The patch has no whitespace issues.
💚 shadedjars 4m 46s patch has no errors when building our shaded downstream artifacts.
💚 hadoopcheck 16m 15s Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2.
💚 javadoc 0m 14s the patch passed
💚 findbugs 0m 0s the patch passed
_ Other Tests _
💚 unit 0m 59s hbase-it in the patch passed.
💚 asflicense 0m 14s The patch does not generate ASF License warnings.
47m 34s
Subsystem Report/Notes
Docker Client=19.03.2 Server=19.03.2 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/1/artifact/out/Dockerfile
GITHUB PR #675
JIRA Issue HBASE-23085
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname Linux 43e0d0192ec3 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 GNU/Linux
Build tool maven
Personality /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-675/out/precommit/personality/provided.sh
git revision master / ca0d9f3
Default Java 1.8.0_181
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/1/testReport/
Max. process+thread count 400 (vs. ulimit of 10000)
modules C: hbase-it U: hbase-it
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/1/console
versions git=2.11.0 maven=2018-06-17T18:33:14Z)
Powered by Apache Yetus 0.11.0 https://yetus.apache.org

This message was automatically generated.

Comment on lines +60 to +61
//This will always happen. We use timeout to kill a continously running process
//after the duration expires
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add /bin/true at the end of the command line. We don't know at this point whether we got a network error for example or the script returned with error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@meszibalu As discussed for /bin/true to make a difference we should increase the outer timeout and that would not be much better so I will leave it as it is.

Comment on lines +34 to +35
"seq 1 %s | xargs -I{} -n 1 -P %s timeout %s dd if=/dev/urandom of=/dev/null bs=1M " +
"iflag=fullblock";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%s is used but numbers are added to the String.format arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is intentional. %s uses Integer.toString, which is predictable while %d uses locale specific formatting that might change.

}

private String getCommand(String operation){
return String.format("tc qdisc %s dev eth0 root netem corrupt %s%%", operation, ratio * 100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format floats with %f.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the advantage? For a ration of 0.35f, the generated string in it's current form returns "35.0%", if I would replace it with "%f%%", the output would change to "35.000000%". Would you prefer a specific format, like round to 2 decimals?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand it. It is possible to set the precision as well. If the floating point number is too small for example, Double.toString will return 0.53242e-17 for example. Is it also supported by tc? If you want to simply concatenate strings, String.format is not the fastest approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@meszibalu As discussed it is not a performance sensitive part of the application and to conform to existing code I'll leave it as it is.

}

private String getCommand(String operation){
return String.format("tc qdisc %s dev eth0 root netem delay %sms reorder %s%% 50%",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we parameterize the name of the interface?

Copy link
Member

@joshelser joshelser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat stuff, @BukrosSzabolcs !

What kind of cluster testing have you done with these so far?


@Override
public void perform() throws Exception {
if(clusterManager == null){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not have this in init()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way if it runs on a non-distributed cluster it just does nothing. It might be better in case someone mixes them with older actions

/**
* Action corrupts region server data.
*/
public class CorruptDataFilesAction extends Action {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat idea, but how could we tell via automation when a file was expectedly corrupted vs. unexpectedly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't as far as I'm aware. It's not clear to me what is the intended use of these tests, but they were requested by stack so I added them. They are so destructive I couldn't eve restart hbase after running them and had to delete every hbase related data from hdfs and zokeeper to be able to run hbase on the cluster again.
My best guess is to use them for active testing and run them in the background while monitoring hbase status/behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried with corrupting something critical like hbase:meta, a table descriptor, or something like that. I think corrupting a single hfile for a user-table is a more "reasonable" failure condition which wouldn't have long-lasting impact on the ability for HBase to keep working.

@saintstack that jive with what you were thinking or you have something else in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving this -- defaulting to $hbase.root/data/default is a nice compromise! Thanks for changing Szabolcs!

}

private String getCommand(String operation){
return String.format("tc qdisc %s dev eth0 root netem corrupt %s%%", operation, ratio * 100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to allow the network device to be specified, both for predictable network interface names and bonds.

https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/

rename base class for new actions to better reflect it's role
make network iterface configurable for tc commands
fix typos and logging
@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
💙 reexec 0m 35s Docker mode activated.
_ Prechecks _
💚 dupname 0m 1s No case conflicting files found.
💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
💚 @author 0m 0s The patch does not contain any @author tags.
💚 test4tests 0m 0s The patch appears to include 20 new or modified test files.
_ master Compile Tests _
💚 mvninstall 5m 28s master passed
💚 compile 0m 29s master passed
💚 checkstyle 0m 20s master passed
💚 shadedjars 4m 40s branch has no errors when building our shaded downstream artifacts.
💚 javadoc 0m 18s master passed
💙 spotbugs 5m 1s Used deprecated FindBugs config; considering switching to SpotBugs.
💚 findbugs 0m 0s master passed
💛 patch 5m 11s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
💚 mvninstall 5m 3s the patch passed
💚 compile 0m 31s the patch passed
💚 javac 0m 31s the patch passed
💔 checkstyle 0m 18s hbase-it: The patch generated 4 new + 67 unchanged - 0 fixed = 71 total (was 67)
💚 whitespace 0m 0s The patch has no whitespace issues.
💚 shadedjars 4m 38s patch has no errors when building our shaded downstream artifacts.
💚 hadoopcheck 15m 50s Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2.
💚 javadoc 0m 15s the patch passed
💚 findbugs 0m 0s the patch passed
_ Other Tests _
💚 unit 0m 51s hbase-it in the patch passed.
💚 asflicense 0m 13s The patch does not generate ASF License warnings.
45m 53s
Subsystem Report/Notes
Docker Client=19.03.3 Server=19.03.3 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/2/artifact/out/Dockerfile
GITHUB PR #675
JIRA Issue HBASE-23085
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname Linux bae63414ab9e 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 GNU/Linux
Build tool maven
Personality /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-675/out/precommit/personality/provided.sh
git revision master / fdac2dd
Default Java 1.8.0_181
checkstyle https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/2/artifact/out/diff-checkstyle-hbase-it.txt
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/2/testReport/
Max. process+thread count 401 (vs. ulimit of 10000)
modules C: hbase-it U: hbase-it
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/2/console
versions git=2.11.0 maven=2018-06-17T18:33:14Z)
Powered by Apache Yetus 0.11.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
💙 reexec 1m 14s Docker mode activated.
_ Prechecks _
💚 dupname 0m 0s No case conflicting files found.
💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
💚 @author 0m 0s The patch does not contain any @author tags.
💚 test4tests 0m 0s The patch appears to include 20 new or modified test files.
_ master Compile Tests _
💚 mvninstall 6m 29s master passed
💚 compile 0m 30s master passed
💚 checkstyle 0m 21s master passed
💚 shadedjars 4m 43s branch has no errors when building our shaded downstream artifacts.
💚 javadoc 0m 17s master passed
💙 spotbugs 5m 2s Used deprecated FindBugs config; considering switching to SpotBugs.
💚 findbugs 0m 0s master passed
💛 patch 5m 11s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
💚 mvninstall 5m 5s the patch passed
💚 compile 0m 28s the patch passed
💚 javac 0m 28s the patch passed
💚 checkstyle 0m 19s the patch passed
💚 whitespace 0m 0s The patch has no whitespace issues.
💚 shadedjars 4m 44s patch has no errors when building our shaded downstream artifacts.
💚 hadoopcheck 16m 11s Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2.
💚 javadoc 0m 15s the patch passed
💚 findbugs 0m 0s the patch passed
_ Other Tests _
💚 unit 0m 59s hbase-it in the patch passed.
💚 asflicense 0m 13s The patch does not generate ASF License warnings.
48m 14s
Subsystem Report/Notes
Docker Client=19.03.2 Server=19.03.2 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/3/artifact/out/Dockerfile
GITHUB PR #675
JIRA Issue HBASE-23085
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname Linux 561b327c4a88 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:06:04 UTC 2019 x86_64 GNU/Linux
Build tool maven
Personality /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-675/out/precommit/personality/provided.sh
git revision master / ba12d5b
Default Java 1.8.0_181
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/3/testReport/
Max. process+thread count 406 (vs. ulimit of 10000)
modules C: hbase-it U: hbase-it
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/3/console
versions git=2.11.0 maven=2018-06-17T18:33:14Z)
Powered by Apache Yetus 0.11.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Member

@joshelser joshelser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to summarize:

  • Javadoc needs to be updated (left comments) for the new network interface configuration (the change looks good otherwise! Thanks)
  • Need feedback from @saintstack on how he was expecting HFile corruption to work. I would almost certainly expect he meant corruption on user table files only, but we should check.
  • A couple of configs (network, disk fill size and timeout) would need to be tweaked by a user. Do we need docs somewhere to describe how these monkies are configured? I'm not sure how this gets wired up off the top of my head.
  • Two questions/requests by Balazs -- not sure if these reflect work to do or if they just need an ACK from Balazs as to why you're not incorporating that change.

Thanks Szabolcs!

restrict file based monkeys to HFiles
extend javadoc
make sure new monkey properties are loaded from generic properties
@BukrosSzabolcs
Copy link
Contributor Author

BukrosSzabolcs commented Oct 30, 2019

Hi @joshelser

  • Updated the javadoc as requested
  • It makes sense to only corrupt HFiles. It should also make the monkey much less destructive. I went ahead and restricted the functionality.
  • I'm not sure if additional documentation is necessary. I created separate factories for the new monkey actions so they would not affect existing setups. Also if someone choose to use them customizing the values is very straightforward. We check regular application properties for monkey related ones. If the integration test is run from the command line it is also possible to add a "monkeyProps" option to specify a property file to use for this run on top of the regular properties. There is nothing new here.
  • The only thing I would like to stress is that the TC actions require passwordless ssh between the machines in the cluster. It is mentioned in the comments, but if you think that is not enough please let me know where else I should mention it.
  • Discussed those points with Balázs and he said they are fine like this. I'll ask him to leave a comment

Thanks for the review!

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
💙 reexec 1m 9s Docker mode activated.
_ Prechecks _
💚 dupname 0m 0s No case conflicting files found.
💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
💚 @author 0m 0s The patch does not contain any @author tags.
💚 test4tests 0m 0s The patch appears to include 20 new or modified test files.
_ master Compile Tests _
💚 mvninstall 5m 52s master passed
💚 compile 0m 30s master passed
💚 checkstyle 0m 19s master passed
💚 shadedjars 4m 59s branch has no errors when building our shaded downstream artifacts.
💚 javadoc 0m 14s master passed
💙 spotbugs 5m 16s Used deprecated FindBugs config; considering switching to SpotBugs.
💚 findbugs 0m 0s master passed
💛 patch 5m 24s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
💚 mvninstall 5m 34s the patch passed
💚 compile 0m 29s the patch passed
💚 javac 0m 29s the patch passed
💚 checkstyle 0m 17s the patch passed
💚 whitespace 0m 0s The patch has no whitespace issues.
💚 shadedjars 5m 1s patch has no errors when building our shaded downstream artifacts.
💚 hadoopcheck 17m 17s Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2.
💚 javadoc 0m 12s the patch passed
💚 findbugs 0m 0s the patch passed
_ Other Tests _
💚 unit 0m 54s hbase-it in the patch passed.
💚 asflicense 0m 13s The patch does not generate ASF License warnings.
49m 43s
Subsystem Report/Notes
Docker Client=19.03.4 Server=19.03.4 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/4/artifact/out/Dockerfile
GITHUB PR #675
JIRA Issue HBASE-23085
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname Linux f4bf656787d5 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 GNU/Linux
Build tool maven
Personality /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-675/out/precommit/personality/provided.sh
git revision master / 2a969e1
Default Java 1.8.0_181
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/4/testReport/
Max. process+thread count 393 (vs. ulimit of 10000)
modules C: hbase-it U: hbase-it
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-675/4/console
versions git=2.11.0 maven=2018-06-17T18:33:14Z)
Powered by Apache Yetus 0.11.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Member

@joshelser joshelser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, looks like we're waiting on Balazs' review, too?

/**
* Action corrupts region server data.
*/
public class CorruptDataFilesAction extends Action {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving this -- defaulting to $hbase.root/data/default is a nice compromise! Thanks for changing Szabolcs!

@meszibalu meszibalu merged commit d2142a8 into apache:master Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants