You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to add turkish language (tr) to cutoff.csv on rp_v1 branch.
There is few data on how the language score is calculated
How do we add custom language score on this csv?
The text was updated successfully, but these errors were encountered:
As far as I know, the CCNet pipeline does not support Turkish out of the box, but you can probably modify the pipeline to get it to support tr. We never went through that process, but to get there, I think you have to do the following steps:
train your own reference wikipedia model (checkout the makefile and ccnet readme for that)
collect the percentile statistics on the model distribution on the commoncrawl corpus
add those statistics to the cutoff.csv
run ccnet with turkish
I'd also recommend contacting the maintainers of the ccnet if there are issues related to that.
I am trying to add turkish language (tr) to cutoff.csv on rp_v1 branch.
There is few data on how the language score is calculated
How do we add custom language score on this csv?
The text was updated successfully, but these errors were encountered: