-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-18258. Merging of S3A Audit Logs #4383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
When I run the test in trunk using same command as above i.e, 'mvn clean verify -Dparallel-tests -DtestsThreadCount=4 -Dscale' command, then also I got errors which are almost similar to the errors that I got when I run test in HADOOP-18258 branch. |
mehakmeet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start, need more commenting on the AuditTool, make some of the hot paths log into DEBUG logs.
- Creation of the Local Directory for the audit logs needs to be unique for the different logging directories provided in the argument.
- We should be cleaning up the local files after we merge them up? since it'll become a large issue later on with big audit files.
- There should be some validation that needs to be done for the audit files? For eg, if we provide a directory that doesn't contain any audit files, what should happen?
aecc77f to
d8ba6da
Compare
mehakmeet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/S3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/S3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestS3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestS3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestS3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestS3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestS3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
|
few small nits here
After these, this LGTM. |
|
Yeah changed the javadocs and access modifiers. |
steveloughran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good.
- make a subcommand of s3guard
- needs a test of the tool itself
- that transformation to parquet should be happening during the merge...this is going to break all the test assertions.
i don't think there is any need to download locally before the merge...you can just read the remote files line by line and convert and then save
this is essentially a very small map phase in a mapreduce job
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- best to read in say 2MB byte array and save to output stream passed in
- you can use readFully, but this doesn't actually handle that final block of data.
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/S3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/S3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestS3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestS3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
mukund-thakur
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start. Added some comments.
One thing to keep in mind while writing code for this is:
Not write only for S3A, think what if we have to add similar stuff for Azure, so what all methods and classes can be reused in future and add unit tests for those.
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge files should return true/false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is going to cause OOM for bigger files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if we are first copying all the files from S3 to local fs, we can try to parallelize this as well.
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/S3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/S3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/S3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestS3AAuditLogMerger.java
Outdated
Show resolved
Hide resolved
e228a0c to
c622c55
Compare
|
pushed the changes using -f to resolve conflicts |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
...ools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/S3AAuditLogMergerAndParser.java
Outdated
Show resolved
Hide resolved
...ools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/S3AAuditLogMergerAndParser.java
Outdated
Show resolved
Hide resolved
...ools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/S3AAuditLogMergerAndParser.java
Outdated
Show resolved
Hide resolved
...ools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/S3AAuditLogMergerAndParser.java
Outdated
Show resolved
Hide resolved
...ools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/S3AAuditLogMergerAndParser.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/AbstractAuditToolTest.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestAuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestAuditTool.java
Outdated
Show resolved
Hide resolved
.../hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestS3AAuditLogMergerAndParser.java
Outdated
Show resolved
Hide resolved
|
💔 -1 overall
This message was automatically generated. |
|
this isn't a bug, just inefficient, as the instanceof is always false if the argument is null. because this is generated code, it's not something which can be fixed. so instead we tell it to ignore this possible bug NP_NULL_INSTANCEOF
<Match>
<Class name="org.apache.hadoop.fs.s3a.audit.AvroDataRecord"/>
<Bug pattern="NP_NULL_INSTANCEOF"/>
</Match>you can check the module without having to wait for yetus to do it. mvn spotbugs:spotbugs |
|
Yeah okay then will ignore auto-generated class by adding suggested XML element in hadoop-tools/hadoop-aws/dev-support/findbugs-exclude.xml. |
|
🎊 +1 overall
This message was automatically generated. |
steveloughran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. some minor changes requested, but otherwise its ready to go in
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/avro/AvroDataSchema.avsc
Outdated
Show resolved
Hide resolved
...p-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/mapreduce/S3AAuditLogMergerAndParser.java
Show resolved
Hide resolved
...p-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/mapreduce/S3AAuditLogMergerAndParser.java
Outdated
Show resolved
Hide resolved
...p-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/mapreduce/S3AAuditLogMergerAndParser.java
Show resolved
Hide resolved
| int fileLength = (int) fileStatus.getLen(); | ||
| byte[] byteBuffer = new byte[fileLength]; | ||
| try (FSDataInputStream fsDataInputStream = | ||
| fileSystem.open(fileStatus.getPath())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use the new openFile builder; there are examples such as
org.apache.hadoop.fs.AvroFSInput#AvroFSInput(org.apache.hadoop.fs.FileContext, org.apache.hadoop.fs.Path) to show the basic strategy.
if you an add the filestatus in .withFileStatus() we can save a HEAD request opening files on s3 and abfs, for faster reads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified as suggested
...p-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/mapreduce/S3AAuditLogMergerAndParser.java
Outdated
Show resolved
Hide resolved
...p-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/mapreduce/S3AAuditLogMergerAndParser.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestAuditTool.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestAuditTool.java
Outdated
Show resolved
Hide resolved
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Description of PR
Merging audit log files containing huge number of audit logs collected from a job like Hive or Spark job containing various S3 requests like list, head, get and put requests.
How was this patch tested?
Region : AP-South-1
Command used : mvn clean verify -Dparallel-tests -DtestsThreadCount=4 -Dscale
Getting these two errors while testing
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?