-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Starcoder example pipeline + base components #175
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Niels!
Left a few comments, I think there are a few things still GCP specific that need to be removed.
I;m guessing this code and components will only work for our personally created ML6 version of the stack dataset (loading it from json and dumping it to a parquet under a new dataset repo). What if the user wants to work with the big stack datasets? Can we generalize well with those components?
from fondant.component import TransformComponent | ||
from fondant.logger import configure_logging | ||
|
||
from pii_detection import scan_pii |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pii_detection import scan_pii | |
from pii_detection import scan_pii, redact_pii |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this one? I'm importing redact_pii
in the line below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a matter of personal preference but I think both are valid :)
) | ||
result.columns = ["code_secrets", "code_has_secrets", "code_number_secrets"] | ||
|
||
dataframe = dataframe.merge(result, left_index=True, right_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we merging? can't you extend the original dataframe with some columns in the apply()
method since we're working with the same subset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are merging since Dask does not support multiple columns assignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, @RobbeSneyders is that something we could resolve when we move to the pandas interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in pandas this works.
replacements=replacements, | ||
), | ||
axis=1, | ||
meta=(None, "str"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the first argument None
? usually it's an index
IP addresses: replace with one of n synthetic private IP addresses (IPv4 or IPv6) | ||
Keys: replace with one of n [sequence of 32 random characters/digits] | ||
|
||
TODO: add IPv6 and IPv4 separation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that for us or part of the source code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was taken from the BigCode project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not for this project but it would be interesting to have a way of working on updating components like this where the source code is taken from another repo. Right now it's just a snapshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be ideal if this was a library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NielsRogge! Left some comments
In general:
- Can we add a
code_
prefix to the names of these components? Or would they work out of the box on any text? I guess at least thefilter_comments
component wouldnt'. - I would remove anything related to the
run_locally
script and symlinks. This will be replaced by the local runner. - This PR has some conflicts with Custom component spec #191. We'll need to make sure everything is updated after merging both.
) | ||
result.columns = ["code_secrets", "code_has_secrets", "code_number_secrets"] | ||
|
||
dataframe = dataframe.merge(result, left_index=True, right_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in pandas this works.
IP addresses: replace with one of n synthetic private IP addresses (IPv4 or IPv6) | ||
Keys: replace with one of n [sequence of 32 random characters/digits] | ||
|
||
TODO: add IPv6 and IPv4 separation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be ideal if this was a library.
And can you also update the readme to include these components in the list? |
#191 is merged, so we should make the necessary updates here. |
c12bae4
to
a5cc882
Compare
@RobbeSneyders I've made the updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NielsRogge.
What do you think of my previous proposal:
Can we add a code_ prefix to the names of these components? Or would they work out of the box on any text? I guess at least the filter_comments component wouldnt'.
And left one more comment.
a5cc882
to
022800d
Compare
The pii redaction component would work on text as well, but the filter comments and filter metadata components are meant to be run on code. |
This PR: - adds a set of components aimed to process text/code datasets to the component registry. These components can be used to 1) filter code based on code to comments ratio 2) filter code based on line length 3) detect and redact (replace) PII or personal identifiable information from code. - adds an example pipeline that adds one specific component called `load_from_hub_stack` that loads a code dataset from the hub. Next, it uses the 3 components in sequence to process and filter the code dataset. --------- Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
This PR:
load_from_hub_stack
that loads a code dataset from the hub. Next, it uses the 3 components in sequence to process and filter the code dataset.