-
Notifications
You must be signed in to change notification settings - Fork 3k
Add DataLakeFileSystemClient constructor in ADLSFileIO #14966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add DataLakeFileSystemClient constructor in ADLSFileIO #14966
Conversation
|
@rymurr can you take a look at this PR? |
| */ | ||
| public ADLSFileIO(SerializableSupplier<DataLakeFileSystemClient> clientSupplier) { | ||
| this.clientSupplier = clientSupplier; | ||
| this.properties = SerializableMap.copyOf(Maps.newHashMap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we do this when it isn't done in teh no-op constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted this constructor to be able to be used without calling initialize initialize does a bunch of stuff we don't need to do if we are passing in our own client. To use this class we need azureProperties to be initialized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API contract is to call initialize though right? Are we sayign this constructor doesn't follow the contract?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constructor for example does not follow the contract either. The contract in my eyes is
- If you use the no arg constructor, you have to call initialize before using it
- If you use a constructor which has arguments you do not need to call initialize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that is the case would it be better to call initialize from the constructor. To make the contract more clear and ensure if we change initialize we don't have to change the constructo
| new DataLakeFileSystemClientBuilder().httpClient(HTTP); | ||
| if (clientSupplier != null) { | ||
| return clientSupplier.get(); | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the else is redundant here and just causes unneccessary whitespace changes below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about constructors which call this directly instead of via the public method above . I could just remove it from the constructor above though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to removing else, and from the constructor above.
lets keep one place where the clientSupplier is used since everything else calls this method
| public DataLakeFileSystemClient client(String path) { | ||
| ADLSLocation location = new ADLSLocation(path); | ||
| return client(location); | ||
| if (clientSupplier != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned about the lack of caching here and below for hte client. In AWSs IO we cache the client whereas here we create a new one every time. That could be very expensive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching can be implemented on the client side with something like () -> someClient
Where someClient is is the same throughout the entire lifecycle of AdlsFileIO.
If you look at how this class used to work, it created a new client every time that we call client
IMO it makes sense to leave it as is and leave caching to the client, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its cached in the other FileIOs though. Its not clear from the API contract that the user is responsible for the caching and its different from the S3FileIO in that regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like all the classes which have supplier have caching so it makes sense to add it here. I pushed a commit which caches just the supplier client and returns it on subsequent gets.
I did not add caching to the existing code and I did not follow the same map approach used in the other fileIo classes.
Reasoning for this is scope creep.
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this, left a few comments
| new DataLakeFileSystemClientBuilder().httpClient(HTTP); | ||
| if (clientSupplier != null) { | ||
| return clientSupplier.get(); | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to removing else, and from the constructor above.
lets keep one place where the clientSupplier is used since everything else calls this method
| * <p>Calling {@link ADLSFileIO#initialize(Map)} will overwrite the properties and azureProperties | ||
| * set in this constructor, but the clientSupplier will continue to be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not sure if we need all this explanation since both properties and azureProperties are just initialized with empty value
| */ | ||
| public ADLSFileIO(SerializableSupplier<DataLakeFileSystemClient> clientSupplier) { | ||
| this.clientSupplier = clientSupplier; | ||
| this.properties = SerializableMap.copyOf(Maps.newHashMap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that is the case would it be better to call initialize from the constructor. To make the contract more clear and ensure if we change initialize we don't have to change the constructo
|
|
||
| @VisibleForTesting | ||
| DataLakeFileSystemClient client(ADLSLocation location) { | ||
| if (clientSupplier != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means you can't use the client with multiple locations now. I think you have to handle a single client per location
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference here an aws/gcp is that one client can be used for multiple locations/buckets. In azure the client is tied to a certain container/path. I will explore creating a client with A Function<Location, ADLSClient> so that we can specify multiple containers with this class
spotless Add tests/cleanup remove redundant null check remove redundant comment Add caching just for supplier method Cache ADLS clients and call initialize in constructor use host for key caching
0751289 to
ead6304
Compare
Summary
Added a constructor to
ADLSFileIOwhich allows the creator to pass in a customDataLakeFileSystemClientas opposed to using the default constructor/auth supported by the libraryReasoning
If an application is constructing
ADLSFileIOthey may have internal auth mechanisms which cannot be contained in the library. The AWS and GCP implementations ofDelegateFileIOhave this functionality so its seems fitting to add it to the azure implementation as well