Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hadoop FS API usage #44

Merged
merged 6 commits into from
Nov 5, 2020
Merged

Hadoop FS API usage #44

merged 6 commits into from
Nov 5, 2020

Conversation

dk1844
Copy link
Collaborator

@dk1844 dk1844 commented Oct 27, 2020

This feature introduces S3 file access using the means of the Hadoop FS API.

  • basically a org.apache.hadoop.fs.FileSystem is created based on the string path of the file (for of s3://bucketName/path/on/s3 for s3 implementation or HDFS otherwise) and then we work the FS as we would "normally"
  • special attention has been given to the fact that the filesystems used may differ for reading and writing (in terms of HDFS+s3 mixing or different-s3-bucket-based FSs) - that is why the signature of the methods now differentiates between inputFs and outputFs where applicable.
  • Sdk S3 file access is kept, but the methods now are denoted by SdkS3 not just S3.
  • s3 location now includes protocol (supported: s3, s3n, and s3a) and such protocol is respected when hadoop FS is being created for S3 locations.
  • sparkSession.enableControlMeasuresTracking() does not need implicit fs: FileSystem (anymore), the fs is now part of the storer and reused for the spark listener if needed.

Expected release version 3.1.0

Closes #43

todo: s3a future support?

Migration notes:
 - instead of `myS3String.toS3Location(region1)` use `myS3String.toS3LocationOrFail.withRegion(region1)` or `myS3String.toS3Location.map(_.withRegion).get` (because toS3Location on its own now return an Option)
 - implicit FileSystem is usually needed to signify difference between hdfs and s3 over hadoop fs api
todo: s3a future support?

Migration notes:
 - instead of `myS3String.toS3Location(region1)` use `myS3String.toS3LocationOrFail.withRegion(region1)` or `myS3String.toS3Location.map(_.withRegion).get` (because toS3Location on its own now return an Option)
 - implicit FileSystem is usually needed to signify difference between hdfs and s3 over hadoop fs api
…usage.

// todo fs.asOutputFs | fs.asInputFs
@dk1844 dk1844 self-assigned this Oct 27, 2020
@dk1844 dk1844 changed the title Feature/43 emrfs fs api Hadoop FS API usage Nov 2, 2020
Copy link
Collaborator

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure my approve should count but I checked the code at least

Copy link
Contributor

@AdrianOlosutean AdrianOlosutean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reviewed. LGTM

@dk1844 dk1844 merged commit d331936 into master Nov 5, 2020
@dk1844 dk1844 deleted the feature/43-emrfs-fs-api branch November 5, 2020 11:40
@dk1844
Copy link
Collaborator Author

dk1844 commented Nov 5, 2020

Released as 3.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hadoop FS API implementation
3 participants