Add Starcoder example pipeline + base components #175

NielsRogge · 2023-05-31T08:43:02Z

This PR:

adds a set of components aimed to process text/code datasets to the component registry. These components can be used to 1) filter code based on code to comments ratio 2) filter code based on line length 3) detect and redact (replace) PII or personal identifiable information from code.
adds an example pipeline that adds one specific component called load_from_hub_stack that loads a code dataset from the hub. Next, it uses the 3 components in sequence to process and filter the code dataset.

components/filter_comments/src/utils/text_extraction.py

components/filter_metadata/README.md

components/filter_metadata/fondant_component.yaml

components/pii_redaction/src/main.py

components/pii_redaction/src/pii_redaction.py

PhilippeMoussalli

Thanks Niels!

Left a few comments, I think there are a few things still GCP specific that need to be removed.

I;m guessing this code and components will only work for our personally created ML6 version of the stack dataset (loading it from json and dumping it to a parquet under a new dataset repo). What if the user wants to work with the big stack datasets? Can we generalize well with those components?

components/filter_comments/fondant_component.yaml

components/filter_line_length/fondant_component.yaml

components/pii_redaction/build_image.sh

PhilippeMoussalli · 2023-06-01T14:12:47Z

components/pii_redaction/src/main.py

+from fondant.component import TransformComponent
+from fondant.logger import configure_logging
+
+from pii_detection import scan_pii


Suggested change

from pii_detection import scan_pii

from pii_detection import scan_pii, redact_pii

Can you clarify this one? I'm importing redact_pii in the line below

just a matter of personal preference but I think both are valid :)

examples/pipelines/starcoder/components/pii_redaction

PhilippeMoussalli · 2023-06-01T14:20:21Z

components/pii_redaction/src/main.py

+        )
+        result.columns = ["code_secrets", "code_has_secrets", "code_number_secrets"]
+
+        dataframe = dataframe.merge(result, left_index=True, right_index=True)


why are we merging? can't you extend the original dataframe with some columns in the apply() method since we're working with the same subset

We are merging since Dask does not support multiple columns assignment

interesting, @RobbeSneyders is that something we could resolve when we move to the pandas interface?

Yes, in pandas this works.

PhilippeMoussalli · 2023-06-01T14:20:59Z

components/pii_redaction/src/main.py

+                replacements=replacements,
+            ),
+            axis=1,
+            meta=(None, "str"),


why is the first argument None? usually it's an index

PhilippeMoussalli · 2023-06-01T14:26:40Z

components/pii_redaction/src/pii_redaction.py

+    IP addresses: replace with one of n synthetic private IP addresses (IPv4 or IPv6)
+    Keys: replace with one of n [sequence of 32 random characters/digits]
+
+    TODO: add IPv6 and IPv4 separation


Is that for us or part of the source code?

This was taken from the BigCode project

maybe not for this project but it would be interesting to have a way of working on updating components like this where the source code is taken from another repo. Right now it's just a snapshot

Would be ideal if this was a library.

examples/pipelines/starcoder/components/filter_comments

examples/pipelines/starcoder/components/filter_line_length

RobbeSneyders

Thanks @NielsRogge! Left some comments

In general:

Can we add a code_ prefix to the names of these components? Or would they work out of the box on any text? I guess at least the filter_comments component wouldnt'.
I would remove anything related to the run_locally script and symlinks. This will be replaced by the local runner.
This PR has some conflicts with Custom component spec #191. We'll need to make sure everything is updated after merging both.

examples/pipelines/starcoder/components/filter_comments

examples/pipelines/starcoder/run_locally.sh

components/pii_redaction/src/main.py

RobbeSneyders · 2023-06-09T08:39:32Z

components/pii_redaction/src/main.py

+        )
+        result.columns = ["code_secrets", "code_has_secrets", "code_number_secrets"]
+
+        dataframe = dataframe.merge(result, left_index=True, right_index=True)


Yes, in pandas this works.

RobbeSneyders · 2023-06-09T08:45:30Z

components/pii_redaction/src/pii_redaction.py

+    IP addresses: replace with one of n synthetic private IP addresses (IPv4 or IPv6)
+    Keys: replace with one of n [sequence of 32 random characters/digits]
+
+    TODO: add IPv6 and IPv4 separation


Would be ideal if this was a library.

examples/pipelines/starcoder/README.md

RobbeSneyders · 2023-06-09T09:45:37Z

And can you also update the readme to include these components in the list?

RobbeSneyders · 2023-06-09T11:56:27Z

#191 is merged, so we should make the necessary updates here.

NielsRogge · 2023-06-13T09:45:07Z

@RobbeSneyders I've made the updates

RobbeSneyders

Thanks @NielsRogge.

What do you think of my previous proposal:

Can we add a code_ prefix to the names of these components? Or would they work out of the box on any text? I guess at least the filter_comments component wouldnt'.

And left one more comment.

components/pii_redaction/src/gibberish_data/big.txt

NielsRogge · 2023-06-14T11:47:31Z

Can we add a code_ prefix to the names of these components? Or would they work out of the box on any text? I guess at least the filter_comments component wouldnt'.

The pii redaction component would work on text as well, but the filter comments and filter metadata components are meant to be run on code.

This PR: - adds a set of components aimed to process text/code datasets to the component registry. These components can be used to 1) filter code based on code to comments ratio 2) filter code based on line length 3) detect and redact (replace) PII or personal identifiable information from code. - adds an example pipeline that adds one specific component called `load_from_hub_stack` that loads a code dataset from the hub. Next, it uses the 3 components in sequence to process and filter the code dataset. --------- Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>

ChristiaensBert reviewed May 31, 2023

View reviewed changes

PhilippeMoussalli reviewed Jun 1, 2023

View reviewed changes

NielsRogge requested a review from RobbeSneyders June 9, 2023 07:38

RobbeSneyders reviewed Jun 9, 2023

View reviewed changes

NielsRogge force-pushed the add_text_components branch from c12bae4 to a5cc882 Compare June 9, 2023 13:06

RobbeSneyders reviewed Jun 13, 2023

View reviewed changes

components/pii_redaction/src/gibberish_data/big.txt Outdated Show resolved Hide resolved

Niels Rogge added 11 commits June 14, 2023 11:52

First draft

08810bd

Address comments

1c2a0dd

Fix precommit

edf0cdf

Remove script

c754ff1

Address more comments

322922a

More improvements

1d2984c

Update requirements

d195ba0

Remove symbolic links

20c6149

Address comments

f78dada

Update components

692a5f4

Remove big.txt

022800d

NielsRogge force-pushed the add_text_components branch from a5cc882 to 022800d Compare June 14, 2023 09:52

Merge branch 'main' into add_text_components

c3ae9ac

PhilippeMoussalli approved these changes Jun 14, 2023

View reviewed changes

PhilippeMoussalli merged commit bcdea6a into main Jun 14, 2023

PhilippeMoussalli deleted the add_text_components branch June 14, 2023 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Starcoder example pipeline + base components #175

Add Starcoder example pipeline + base components #175

NielsRogge commented May 31, 2023

PhilippeMoussalli left a comment

PhilippeMoussalli Jun 1, 2023

NielsRogge Jun 8, 2023

PhilippeMoussalli Jun 8, 2023

PhilippeMoussalli Jun 1, 2023

NielsRogge Jun 8, 2023 •

edited

Loading

PhilippeMoussalli Jun 8, 2023

RobbeSneyders Jun 9, 2023

PhilippeMoussalli Jun 1, 2023

PhilippeMoussalli Jun 1, 2023

NielsRogge Jun 8, 2023

PhilippeMoussalli Jun 8, 2023

RobbeSneyders Jun 9, 2023

RobbeSneyders left a comment

RobbeSneyders Jun 9, 2023

RobbeSneyders Jun 9, 2023

RobbeSneyders commented Jun 9, 2023

RobbeSneyders commented Jun 9, 2023

NielsRogge commented Jun 13, 2023

RobbeSneyders left a comment

NielsRogge commented Jun 14, 2023

	from pii_detection import scan_pii
	from pii_detection import scan_pii, redact_pii

Add Starcoder example pipeline + base components #175

Add Starcoder example pipeline + base components #175

Conversation

NielsRogge commented May 31, 2023

PhilippeMoussalli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge Jun 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobbeSneyders commented Jun 9, 2023

RobbeSneyders commented Jun 9, 2023

NielsRogge commented Jun 13, 2023

RobbeSneyders left a comment

Choose a reason for hiding this comment

NielsRogge commented Jun 14, 2023

NielsRogge Jun 8, 2023 •

edited

Loading