Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help in customizing python/dolma/taggers/c4.py #173

Open
mihara-bot opened this issue Jun 19, 2024 · 0 comments
Open

Need help in customizing python/dolma/taggers/c4.py #173

mihara-bot opened this issue Jun 19, 2024 · 0 comments

Comments

@mihara-bot
Copy link

Dear authors,
I tried to implement the rule on page 57 of your Dolma paper 'Remove documents with more than half of their line not ending in...'.
And I modified a few lines of code at python/dolma/taggers/c4.py to:
Line 107~ Line 130

        start = count = 0
        line_no_pending_punc_count = 0
        for sent in text.split("\n"):
            end = start + len(sent)
            if end != len(text):
                # account for the newline
                end += 1

            # strip any trailing whitespace
            sent = sent.strip()

            if not sent.endswith((".", "?", "!", '"')):
                spans.append(Span(start, end, type="lines_with_no_ending_punctuation"))
                line_no_pending_punc_count += 1

            if len(sent.split()) < MIN_WORDS_PER_LINE:
                spans.append(Span(start, end, type="lines_with_too_few_words"))

            count += 1
            start = end

        spans.append(Span(0, len(doc.text), type="line_count", score=count))
        spans.append(Span(0, len(doc.text), type="lines_with_no_ending_punctuation_ratio", score=line_no_pending_punc_count / count))
        return DocResult(doc=doc, spans=spans)

However, I found that 'lines_with_no_ending_punctuation_ratio' is not working and the results of c4_v2 don't contain this data field.
Could you please help me on this c4 rule?
Many thanks! :)

Best regards,
Xinlin Zhuang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant