You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear authors,
I tried to implement the rule on page 57 of your Dolma paper 'Remove documents with more than half of their line not ending in...'.
And I modified a few lines of code at python/dolma/taggers/c4.py to:
Line 107~ Line 130
start = count = 0
line_no_pending_punc_count = 0
for sent in text.split("\n"):
end = start + len(sent)
if end != len(text):
# account for the newline
end += 1
# strip any trailing whitespace
sent = sent.strip()
if not sent.endswith((".", "?", "!", '"')):
spans.append(Span(start, end, type="lines_with_no_ending_punctuation"))
line_no_pending_punc_count += 1
if len(sent.split()) < MIN_WORDS_PER_LINE:
spans.append(Span(start, end, type="lines_with_too_few_words"))
count += 1
start = end
spans.append(Span(0, len(doc.text), type="line_count", score=count))
spans.append(Span(0, len(doc.text), type="lines_with_no_ending_punctuation_ratio", score=line_no_pending_punc_count / count))
return DocResult(doc=doc, spans=spans)
However, I found that 'lines_with_no_ending_punctuation_ratio' is not working and the results of c4_v2 don't contain this data field.
Could you please help me on this c4 rule?
Many thanks! :)
Best regards,
Xinlin Zhuang
The text was updated successfully, but these errors were encountered:
Dear authors,
I tried to implement the rule on page 57 of your Dolma paper 'Remove documents with more than half of their line not ending in...'.
And I modified a few lines of code at python/dolma/taggers/c4.py to:
Line 107~ Line 130
However, I found that 'lines_with_no_ending_punctuation_ratio' is not working and the results of c4_v2 don't contain this data field.
Could you please help me on this c4 rule?
Many thanks! :)
Best regards,
Xinlin Zhuang
The text was updated successfully, but these errors were encountered: