Basic analysis of the script content of strings #764

jwiggins · 2021-03-29T15:43:05Z

This is another part of #762

There are two pieces here, and a ton of machine-generated code (don't fear the diff):

A small program which fetches http://www.unicode.org/Public/UNIDATA/Scripts.txt, parses it, and writes out the kiva.fonttools.text._data module.
A new class UnicodeAnalyzer which uses the data from Scripts.txt and returns the languages and slices for a given input string.

UnicodeAnalyzer is pretty basic right now. I'd like to keep it that way in this PR. For instance, Emoji ligatures are not great:

In [3]: s = "👩‍👩‍👧‍👧" 

In [4]: s                                                                       
Out[4]: '👩\u200d👩\u200d👧\u200d👧'

In [5]: an.languages(s)                                                         
Out[5]: 
[(0, 1, 'Common'),
 (1, 2, 'Inherited'),
 (2, 3, 'Common'),
 (3, 4, 'Inherited'),
 (4, 5, 'Common'),
 (5, 6, 'Inherited'),
 (6, 7, 'Common')]

jwiggins · 2021-03-29T15:44:15Z

kiva/fonttools/_util.py

+    94: "Tai_Le",
+    95: "New_Tai_Lue",


These were changed to match the names in kiva.fonttools.text._data.

rahulporuri

LGTM

rahulporuri · 2021-03-30T08:32:22Z

kiva/fonttools/text/_unicode_lookup.py

+
+    def _lookup_codepoint(self, cp):
+        comps = self.ranges - ord(cp)
+        index = ((comps[:, 0] <= 0) == (comps[:, 1] >= 0)).argmax()


This looks like the most important detail in the PR - how we're selecting the entry given a code point - and it'd be useful if you could elaborate on how we're doing it.

Fair enough

rahulporuri · 2021-03-30T09:04:51Z

still LGTM

jwiggins · 2021-03-30T09:09:19Z

Thanks for the review

Add a parser for the Unicode Scripts.txt file

c925724

jwiggins commented Mar 29, 2021

View reviewed changes

rahulporuri approved these changes Mar 30, 2021

View reviewed changes

PR feedback

94fcdf1

jwiggins merged commit ffd2535 into master Mar 30, 2021

jwiggins deleted the feature/language-parse branch March 30, 2021 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic analysis of the script content of strings #764

Basic analysis of the script content of strings #764

jwiggins commented Mar 29, 2021

jwiggins Mar 29, 2021

rahulporuri left a comment

rahulporuri Mar 30, 2021

jwiggins Mar 30, 2021

rahulporuri commented Mar 30, 2021

jwiggins commented Mar 30, 2021

Basic analysis of the script content of strings #764

Basic analysis of the script content of strings #764

Conversation

jwiggins commented Mar 29, 2021

jwiggins Mar 29, 2021

Choose a reason for hiding this comment

rahulporuri left a comment

Choose a reason for hiding this comment

rahulporuri Mar 30, 2021

Choose a reason for hiding this comment

jwiggins Mar 30, 2021

Choose a reason for hiding this comment

rahulporuri commented Mar 30, 2021

jwiggins commented Mar 30, 2021