Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Azure Data Lake Gen2 connector for PinotFS #5116

Merged
merged 2 commits into from
Mar 17, 2020
Merged

Conversation

snleee
Copy link
Contributor

@snleee snleee commented Mar 5, 2020

  1. Testing have been done by attaching ADLS Gen2 to the local deployment.
  2. move() is implemented by copy & delete because of azure sdk issue with rename() API.
    [BUG] srcUri is decoded and not encoded back for rename API (ADLS Gen2) Azure/azure-sdk-for-java#8761

@snleee snleee requested review from jackjlli and mayankshriv March 5, 2020 09:58
Copy link
Contributor

@mayankshriv mayankshriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add tests using Mockito? If not, we can add tests for utility functions such as convertAzureStylePathToUriStylePath.

@snleee snleee force-pushed the adls-gen2 branch 3 times, most recently from 6112a99 to 4d2a8ff Compare March 10, 2020 03:34
@codecov-io
Copy link

codecov-io commented Mar 10, 2020

Codecov Report

Merging #5116 into master will decrease coverage by 0.76%.
The diff coverage is 66.66%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #5116      +/-   ##
============================================
- Coverage     58.32%   57.55%   -0.77%     
  Complexity       12       12              
============================================
  Files          1209     1184      -25     
  Lines         64541    62424    -2117     
  Branches       9484     9143     -341     
============================================
- Hits          37643    35929    -1714     
+ Misses        24143    23847     -296     
+ Partials       2755     2648     -107
Impacted Files Coverage Δ Complexity Δ
.../java/org/apache/pinot/spi/filesystem/PinotFS.java 72.22% <66.66%> (-6.35%) 0 <0> (ø)
...ation/function/AggregationFunctionVisitorBase.java 0% <0%> (-96%) 0% <0%> (ø)
...ommon/lineage/SegmentMergeLineageAccessHelper.java 0% <0%> (-80%) 0% <0%> (ø)
...n/java/org/apache/pinot/common/utils/LLCUtils.java 0% <0%> (-75%) 0% <0%> (ø)
.../org/apache/pinot/common/config/RoutingConfig.java 0% <0%> (-70%) 0% <0%> (ø)
...org/apache/pinot/common/metrics/BrokerMetrics.java 44.44% <0%> (-44.45%) 0% <0%> (ø)
.../org/apache/pinot/client/PinotClientException.java 33.33% <0%> (-33.34%) 0% <0%> (ø)
...che/pinot/core/startree/v2/StarTreeV2Metadata.java 62.5% <0%> (-29.17%) 0% <0%> (ø)
...a/manager/realtime/RealtimeSegmentDataManager.java 50% <0%> (-25%) 0% <0%> (ø)
...rg/apache/pinot/core/transport/ServerInstance.java 53.57% <0%> (-21.43%) 0% <0%> (ø)
... and 239 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5c8398...67feb10. Read the comment docs.

@snleee
Copy link
Contributor Author

snleee commented Mar 12, 2020

md5 computation time benchmark (on macbook with ssd):

(md5 compute/ total time) <- total time = IO(read contents from file) + md5 hash computation
1 mb: 17 / 20 ms
10 mb: 54 / 62 ms
100 mb: 402 / 462 ms
1000 mb: 2998 / 3497 ms

So, computing md5 hash on 1GB file adds about 3 seconds. Since this adds non-trivial overheads (depending on the use case), I updated the code to make md5 check configurable.

1. Testing have been done by attaching ADLS Gen2 to the local deployment.
2. move() is implemented by copy & delete because of azure sdk issue with rename() API.
   Azure/azure-sdk-for-java#8761
_blobServiceClient =
new BlobServiceClientBuilder().credential(sharedKeyCredential).endpoint(blobServiceEndpointUrl).buildClient();
_fileSystemClient = serviceClient.getFileSystemClient(fileSystemName);
LOGGER.error("AzureGen2PinotFS is initialized (accountName={}, fileSystemName={}, dfsServiceEndpointUrl={}, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this an error log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I was debugging and forgot to turn it back.

if (e.getStatusCode() == ALREADY_EXISTS_STATUS_CODE && e.getErrorCode().equals(PATH_ALREADY_EXISTS_ERROR_CODE)) {
return true;
}
LOGGER.error("Exception thrown while calling mkdir (uri = {})", uri, e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to print the error status code here.

Copy link
Contributor Author

@snleee snleee Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm including e, which is exception object to the log. This should include the status code information as part of the exception stack. Do you think it's better to add status code explicitly along with uri?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether the status code is included in the exception, but it'd be good to show it in the log. :)

mkdir(newDst);
} else {
// If src is a file, we need to copy.
copySucceeded |= copySrcToDst(currentSrc, newDst);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if part of the files failed? It will still return true, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch :) I updated. I also made the change to GcsPinotFS, which had the similar issue.

URI parentUri = Paths.get(dstUri).getParent().toUri();
mkdir(parentUri);
try {
Path parentPath = Paths.get(dstUri.getPath()).getParent();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for testing the case when the scheme doesn't match then throw the exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test to cover this case to LocalPinotFSTest

/**
* Azure Data Lake Storage Gen2 implementation for the PinotFS interface.
*/
public class AzureGen2PinotFS extends PinotFS {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any test for this class? We've noticed the code coverage is only 50% though.

Copy link
Contributor Author

@snleee snleee Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a good way to test. I've been testing this by hooking this up to the live ADLS Gen2. One way is to mock every single Azure SDK API that i'm calling using Mockhito but this doesn't really check much.

Another potential approach is to create the integration test by incorporating Azurite https://github.com/Azure/Azurite, which is Azure storage service emulator. But, this doesn't support Azure Datalake Gen2.

By the way, I did verify all the functions by hooking up the live Data Lake Gen2.

Copy link
Contributor

@mayankshriv mayankshriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address comments from @jackjlli before merging.

Copy link
Member

@jackjlli jackjlli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments! In terms of the tests, it'd still be good to have it. Even if the code works fine today, somebody else may change it in the future. And it'd be hard for another guy to test his change without following the tests you've done. You can add them later or just add a todo there. You decide. Thanks!

@snleee
Copy link
Contributor Author

snleee commented Mar 17, 2020

@jackjlli Thanks for the comment. I added TODO comment on the test. I will address it with the separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants