Skip to content

Conversation

@flyrain
Copy link
Contributor

@flyrain flyrain commented Aug 23, 2024

Description

This is to support catalog backed by s3 in run_spark_sql.sh

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Tested locally with s3 and local file system. Both passed.

Checklist:

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear what "uses local filesystem" means in this context, we should probably describe what this means in the doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more context

Copy link
Member

@RussellSpitzer RussellSpitzer Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably do a real usage string rather than this,

usage: ./run_spark_sql.sh [s3-path aws-role]

s3-path: Location on S3 to .....
aws-role: arn?role? used for authenticating catalog?client?both?

When path and role are absent, local filesystem is used for ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more context to explain it is a role used by the catalog

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment here seems to suggest you can set these as ENV variables but I believe the above two lines would overwrite anything stored in the ENV

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, remove the environment to avoid confusion

@flyrain flyrain force-pushed the support-s3-run-spark branch from 4621135 to 5e0bad3 Compare August 26, 2024 16:50
@flyrain flyrain requested a review from takidau as a code owner August 26, 2024 16:50
@flyrain
Copy link
Contributor Author

flyrain commented Aug 26, 2024

Thanks @RussellSpitzer for the review. Resolved the comments. Please take another look.

# -----------------------------------------------------------------------------
#
# Usage:
# Without arguments: Runs against a catalog backed by the local filesystem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this should be a more traditional usage string

./run_spark_sql.sh [S3-location AWS-IAM-role]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

# You must run 'use polaris;' as your first query in the spark-sql shell.
# Arguments:
# [S3 location] - The S3 path to use as the default base location for the catalog.
# [AWS IAM role] - The AWS IAM role to assume when the catalog accessing the S3 location.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS IAM role for the catalog to assume when accessing the S3 Location

AWS_ROLE_ARN=$2
# Check if AWS variables are set
if [ -z "${AWS_BASE_LOCATION}" ] || [ -z "${AWS_ROLE_ARN}" ]; then
echo "AWS_BASE_LOCATION or/and AWS_ROLE_ARN not set. Please set them to create a catalog backed by S3."
Copy link
Member

@RussellSpitzer RussellSpitzer Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a bit confusing because these things are not "set" they are passed through. So I think it would better if we just check the number of args passed to the command. Then you can just say expected 2 or 0 args and got 1 or 3 or whatever.

if [ $# -eq 2 ]; then
    echo "run_spark_sql.sh only accepts 0 or 2 arguments"
    echo $USAGE
    exit 1
fi

Feel free to ignore copying the USAGE through if you like but that would be another nice feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the new commit

@flyrain
Copy link
Contributor Author

flyrain commented Aug 29, 2024

Thanks @RussellSpitzer for the review. Resolved all comments. Ready for another look.

@RussellSpitzer RussellSpitzer merged commit adc3e60 into apache:main Sep 3, 2024
@RussellSpitzer
Copy link
Member

Thanks @flyrain ! Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants