Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 file access PoC using Hadoop FS API #1556

Closed
dk1844 opened this issue Oct 15, 2020 · 2 comments
Closed

S3 file access PoC using Hadoop FS API #1556

dk1844 opened this issue Oct 15, 2020 · 2 comments
Assignees
Labels
feature New feature priority: undecided Undecided priority to be assigned after discussion under discussion Requires consideration before a decision is made whether/how to implement

Comments

@dk1844
Copy link
Contributor

dk1844 commented Oct 15, 2020

Background

An interesting point has been raised by Tony in regard of the S3 PoC file access, currently written in terms of AWS SDK for S3.

Any reason for going through the pain of lower level AWS Java SDK libraries instead of using the Hadoop FileSystem API methods of emrfs?
If we’re directly using the AWS Java SDK libraries then we only ever have eventual consistency, there is no possibility to use any consistency features such as s3guard (for s3a) or emrfs consistent view.

Feature

A PoC attempt should be made to use Hadoop FS API to gain access to consistency features.

@dk1844 dk1844 added feature New feature under discussion Requires consideration before a decision is made whether/how to implement priority: undecided Undecided priority to be assigned after discussion labels Oct 15, 2020
@dk1844 dk1844 self-assigned this Oct 16, 2020
@dk1844
Copy link
Contributor Author

dk1844 commented Oct 20, 2020

Inteding to base implementation of this feature on https://github.com/AbsaOSS/spark-s3-writer-poc/pull/3 both in Atum and here in Enceladus

@dk1844
Copy link
Contributor Author

dk1844 commented Nov 9, 2020

Resolved in #1586

@dk1844 dk1844 closed this as completed Nov 9, 2020
benedeki added a commit that referenced this issue Jan 29, 2021
#1422 and 1423 Remove HDFS and Oozie from Menas

#1422 Fix HDFS location validation

#1424 Add Menas Dockerfile

#1416 hadoop-aws 2.8.5 + s3 aws sdk 2.13.65 compiles.

#1416 - enceladus on S3:
* - all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking)
# Add menasfargate into hosts
# paste
# save & exit (ctrl+O, ctrl+X)

#1416 - enceladus on S3 - (crude) conformance works on s3 (s3 std input, s3 conf output)
* Merge spline 0.5.3 into aws-poc
* Update spline to 0.5.4 for AWS PoC

#1503 Remove HDFS url Validation
* New dockerfile - smaller image
* s3 persistence (atum, sdk fs usage, ...) (#1526)

#1526 
* FsUtils divided into LocalFsUtils & HdfsUtils
* PathConfigSuite update
* S3FsUtils with tail-recursive pagination accumulation - now generic with optional short-circuit breakOut
* TestRunnerJob updated to manually cover the cases - should serve as a basis for tests
* HdfsUtils replace by trait DistributedFsUtils (except for MenasCredentials loading & nonSplittable splitting)
* using final version of s3-powered Atum (3.0.0)
* mockito-update version update, scalatest version update
* S3FsUtilsSuite: exists, read, sizeDir(hidden, non-hidden, reucursive), non-splittable (simple, recursive with breakOut), delete (recursive), version find (simple - empty, recursive)
* explicit stubbing fix for hyperdrive

#1556 file access PoC using Hadoop FS API (#1586)
* s3 using hadoop fs api
* s3 sdk usage removed (pom, classes, tests)
* atum final version 3.1.0 used
* readStandardizationInputData(... path: String)(implicit ... fs: FileSystem) -> readStandardizationInputData(input: PathWithFs)


#1554 Tomcat with TLS container in Docker container

#1554 Added envoy config + enabling running unencrypted container

#1499 Add authentication to /lineage + update spline to 0.5.5

#1618 - fixes failing spline 0.5.5 integration by providing compatible commons library version. Test-ran on EMR. (#1619)

#1434 Add new way of serving properties to Docker

#1622: Merge of aws-poc to develop brach
* put back HDFS browser
* put back Oozie
* downgraded Spline
* Scopt 4.0.0
* AWS SDK Exclusion
* ATUM version 3.2.2

Co-authored-by: Saša Zejnilović <zejnils@gmail.com>
Co-authored-by: Daniel Kavan <dk1844@gmail.com>
Co-authored-by: Adrian Olosutean <adi.olosutean@gmail.com>
Co-authored-by: Adrian Olosutean <adrian.olosutean@absa.africa>
Co-authored-by: Jan Scherbaum <kmoj02@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature priority: undecided Undecided priority to be assigned after discussion under discussion Requires consideration before a decision is made whether/how to implement
Projects
None yet
Development

No branches or pull requests

1 participant