Skip to content

Why is CLD2 Fast and Small?

cld2 edited this page Jul 28, 2015 · 2 revisions

CLD2 detects languages in Unicode UTF-8 text by alternating between extracting runs of text from the input document and scoring that text. The first step is done by getonescriptspan() and the second by scoreonescriptspan(). Each is designed for speed and small size.

For more details concerning feature extraction and scoring, see this write-up: https://docs.google.com/document/d/1aDAVnCiFxUg5YNZM3vCckWdIdtO0WzSMmVDGzOA1GmQ/edit?pli=1

Clone this wiki locally