-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Commoncrawl pipeline] Add load from commoncrawl component #269
[Commoncrawl pipeline] Add load from commoncrawl component #269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shayorshay ! Looks like a very clean component ;)
Left a few minor comments but should be good to go
@@ -0,0 +1,3 @@ | |||
boto3==1.26.158 | |||
fondant | |||
pyarrow>=7.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pyarrow is installed with Fondant so no need to include it here
|
||
logger = logging.getLogger(__name__) | ||
|
||
S3_BASE_URL = "s3://commoncrawl/crawl-data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not being used anywhere
S3_COMMONCRAWL_BUCKET = "commoncrawl" | ||
|
||
|
||
def fetch_warc_file_from_s3(s3_bucket: str, s3_key) -> dd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a docstring for the argumens.
Would also prefer to change s3_key
to bucket_path
. Feels more intuitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s3_key
is the key of the object in the bucket not the bucket path. Would object_key
be a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, just read on it and it seems to be an AWS specific notation (still used to GCP). It's fine to leave it as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shayorshay, one minor comment since our latest version of the build script expects fondant dependencies to be defined from github.
@@ -0,0 +1,2 @@ | |||
boto3==1.26.158 | |||
fondant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fondant | |
git+https://github.com/ml6team/fondant@main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done :)
This component is the first part of the commoncrawl pipeline. Given an index, this component loads the corresponding index file from the AWS Public Data Sets (S3 bucket) and returns a list of its WARC segment file paths as a dataframe.
This component is the first part of the commoncrawl pipeline. Given an index, this component loads the corresponding index file from the AWS Public Data Sets (S3 bucket) and returns a list of its WARC segment file paths as a dataframe.