Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First blog post from external contributor about AWS EMR #64

Merged
merged 4 commits into from
May 5, 2023

Conversation

stichbury
Copy link
Contributor

Adding a new folder where I'll work on blog posts collaboratively when external authors want to contribute in markdown.

Adding a new file because there's an author working on a post about EMR and Kedro 💃

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>
@stichbury stichbury added the Blog post creation Blog posts (ideas and execution) label Apr 19, 2023
@stichbury stichbury self-assigned this Apr 19, 2023

## 2. Set up `CONF_ROOT`

By default, Kedro looks at the root `conf` folder for all its configurations (catalog, parameters, globals, credentials, logging) to run the pipelines. However, [this can be customized](https://docs.kedro.org/en/stable/kedro_project_setup/configuration.html#configuration-root) by changing `CONF_ROOT`  in `settings.py`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this more annoying in 0.18.7 we do package configuration as a separate tar.gz file which can be used in conjunction with then --conf-source flag

from proj_name.__main__ import main:

if __name__ == "__main__":
# params = [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this comment style is intuitive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll turn it into a triple quote with appropriate indentation and add notes for better understanding.


Upload the relevant files to an S3 bucket (EMR should have access to this bucket), in order to run the Spark Job. The following artifacts should be uploaded to S3:

- .egg [file created in step #3]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for engineers - egg is old, wheel is new. Should we use that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obliterate eggs from everywhere please! kedro-org/kedro#2273

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll remove the .egg occurrences, will need to do a sanity check if spark-submit --py-files works the same with .whl files too.

@stichbury stichbury changed the title Add folder for blog post collaboration and first blog post First blog post from external contributor about AWS EMR May 2, 2023
@stichbury stichbury merged commit c50735f into main May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blog post creation Blog posts (ideas and execution)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants