add charwidth property #2

stevengj · 2014-07-16T12:55:39Z

As discussed in JuliaLang/julia#6939, the wcwidth function is broken on many operating systems. When we import the Unicode data, it would be good to add another field to our database in order to store the character width, so that we can provide an up-to-date character-width function.

The text was updated successfully, but these errors were encountered:

jiahao · 2014-07-17T23:29:04Z

It looks like Unicode does not actually maintain a definitive list of character widths. The only data I can find on this topic is on East Asian characters; my understanding of how this ought to translate into numerical character widths is summarized in this gist. Confusingly, there are wide(W)-narrow(Na) and full(F)-half(H) pairings which appear to have the same functionality; I'm not clear on the differences between the two pairs.

stevengj · 2014-07-17T23:43:43Z

See also this wcwidth implementation, which I think we use in Julia on Windows. Unfortunately, it dates to Unicode 5.0.

stevengj · 2014-07-18T15:52:59Z

And a slightly patched version of wcwidth in newlib, also quite old.

jakebolewski · 2014-07-18T16:55:13Z

Here is a BSD licesced Go version, don't know if it is completely up to date with the latest unicode standard.
https://github.com/shinichy/go-wcwidth

jakebolewski · 2014-07-18T17:01:36Z

Nevermind, I see it is just a post of Markus Kuhns version.

jiahao · 2014-07-19T15:47:15Z

@jakebolewski and I were just discussing how we could test wcwidths definitively by rendering characters in a virtual framebuffer using fixed width fonts and checking if the rendered size fits within the size allotted by wcwidth=0,1,2. One nice thing is that we won't have worry much about of the Unihan characters since they have character widths defined by UAX 11.

The challenge here is finding a useable font; there aren't that many monospaced fonts with decent Unicode support. I personally use Consolas and find its support for the BMP to be pretty good. But GNU Unifont 7.0 seems like the one for this task, should anyone be foolhardy enough to try.

jiahao · 2014-07-19T15:53:27Z

Or it may be sufficient to simply compute bounding boxes from Unifont's font table bitmap.

jiahao · 2014-07-19T16:07:53Z

Or process the Unicode archived code charts directly. (93 mb pdf)

stevengj · 2014-07-19T19:02:59Z

Note that GNU libunistring also provides character width functions (LGPL). I'm not sure how they are implemented or how up-to-date they are, but at the very least they provide a useful baseline to compare to.

jiahao · 2014-07-28T21:09:49Z

Here is one chunk of code to address a prerequisite part of the problem, which is how the output of isprint (which currently uses libc.iswprint()) maps onto Unicode character properties. A nonprintable character must return a wcwidth of -1, when then gets floored to 0 in charwidth.

This code snippet classifies code points in the BMP (from 0 to 0xffff) as to

whether they are printable according to isprint, and
their Unicode category.

general_category_abbr=[
    "Lu", "Ll", "Lt", "Lm", "Lo", "Mn", "Mc", "Me", "Nd", "Nl",
    "No", "Pc", "Pd", "Ps", "Pe", "Pi", "Pf", "Po", "Sm", "Sc",
    "Sk", "So", "Zs", "Zl", "Zp", "Cc", "Cf", "Cs", "Co", "Cn"
]

categories=Dict()
for c in 0:65535
    catcode = Base.UTF8proc.category_code(int32(c))
    #catcode=unsafe_load(ccall((:utf8proc_get_property,:libmojibake), Ptr{Uint16}, (Int32,), c))
    isprintable = isprint(char(c))
    categories[(catcode, isprintable)]=push!(get(categories,
        (catcode, isprintable), {}), c)
end

abbr(catcode)= catcode==0 ? "00" : general_category_abbr[catcode]

for catcode in 0:30, val in [false, true]
    println(abbr(catcode), "\t", val, "\t", length(get(categories, (catcode, val), [])))
end

Note that I had to accommodate a possible return value of 0 from the library call in UTF8proc.category_code() which is not a valid category code, but probably refer to an unknown category.

The results for the BMP, when summarized by character category, are:

Cat	iswprint	OSX 10.9.4		Ubuntu 14.04.1
Cat	iswprint	libutf8proc	libmojibake	libutf8proc	libmojibake
0	FALSE	4991	2032	3718	2032
0	TRUE	17	0	1290	0
Lu	FALSE	129	267	0	37
Lu	TRUE	707	707	836	937
Ll	FALSE	278	457	0	75
Ll	TRUE	824	820	1102	1202
Lt	FALSE	0	0	0	0
Lt	TRUE	31	31	31	31
Lm	FALSE	106	166	0	32
Lm	TRUE	61	65	167	199
Lo	FALSE	626	1999	0	776
Lo	TRUE	44055	44074	44681	45297
Mn	FALSE	86	468	0	231
Mn	TRUE	516	516	602	753
Mc	FALSE	52	161	0	49
Mc	TRUE	115	117	167	229
Me	FALSE	0	4	0	1
Me	TRUE	10	9	10	12
Nd	FALSE	41	181	0	60
Nd	TRUE	189	189	230	310
Nl	FALSE	0	14	0	10
Nl	TRUE	51	51	51	55
No	FALSE	12	50	0	25
No	TRUE	240	240	252	265
Pc	FALSE	1	1	0	0
Pc	TRUE	9	9	10	10
Pd	FALSE	1	6	0	4
Pd	TRUE	17	18	18	20
Ps	FALSE	3	10	0	1
Ps	TRUE	63	65	66	74
Pe	FALSE	3	9	0	0
Pe	TRUE	62	64	65	73
Pi	FALSE	5	6	0	0
Pi	TRUE	6	6	11	12
Pf	FALSE	5	6	0	0
Pf	TRUE	4	4	9	10
Po	FALSE	64	191	0	91
Po	TRUE	196	198	260	298
Sm	FALSE	15	50	0	4
Sm	TRUE	889	886	904	932
Sc	FALSE	7	18	0	11
Sc	TRUE	34	34	41	41
Sk	FALSE	42	61	0	17
Sk	TRUE	57	55	99	99
So	FALSE	279	581	0	255
So	TRUE	2071	2068	2350	2394
Zs	FALSE	1	0	0	0
Zs	TRUE	17	17	18	17
Zl	FALSE	0	0	1	1
Zl	TRUE	1	1	0	0
Zp	FALSE	0	0	1	1
Zp	TRUE	1	1	0	0
Cc	FALSE	65	65	65	65
Cc	TRUE	0	0	0	0
Cf	FALSE	29	38	0	7
Cf	TRUE	4	2	33	33
Cs	FALSE	0	0	2048	2048
Cs	TRUE	2048	2048	0	0
Co	FALSE	0	0	0	0
Co	TRUE	6400	6400	6400	6400
Cn	FALSE	0	0	0	0
Cn	TRUE	0	0	0	0

I naively thought that everything was printable except for the category C characters, but this turns out not to be true...?

jiahao · 2014-07-29T03:11:18Z

@JeffBezanson and I briefly discussed what it meant to be a printable character but I don't think we really came to a firm answer.

Instead, I'm going to break it down by type of code point (Unicode 6.3 standard pdf; Ch 2, Table 2-3), which makes the decision much clearer for me:

Basic Type	Brief Description	General Category	printable on OSX 10.9.4?	printable on Ubuntu 14.04.01?	Should be printable?
Graphic	Letter, mark, number, punctuation, symbol, and spaces	L, M, N, P, S, Zs	Mostly Yes	Mostly Yes	Yes (there is something to draw)
Format	Invisible but affects neighboring characters	Cf	Mostly No	Mostly Yes	Yes (affects display of other characters)
"	"	Zl, Zp	Yes	No	Yes (affects display of other characters)
Control	Usage defined by protocols or standards outside the Unicode Standard	Cc	No	No	No (extends ASCII notion of nonprintable character)
Private-use	Usage defined by private agreement outside the Unicode Standard	Co	Yes	Yes	Yes? (undefined by standard but everyone says Yes?)
Surrogate	Permanently reserved for UTF-16; restricted interchange	Cs	Yes	No	No (not a character, so it can't be a printable character)
Noncharacter/Reserved	Permanently reserved for internal usage or future assignment; restricted interchange	Cn	N/A	N/A	No (not a character, so it can't be a printable character)

I think there is no question that all Graphic code points should be printable and all Control/Surrogate/Noncharacter/Reserved code points should not be printable, from their very definitions. Format code points are debatable, and Private-use code points I don't think are decidable.

jiahao · 2014-07-29T03:18:52Z

Based on this, I'll propose that a printable character is any code point that is not in the Cc, Cn or Cs general categories. These are the code points for which wcwidth should return -1 and have charwidth 0.

I don't think it's necessary to refine the definition of 'printable character' beyond the Unicode General Category.

jiahao · 2014-07-29T05:27:17Z

Here is another snippet of code that tries to compute character widths (and wcwidths) from the Unifont font bitmap. For each 16x16 plot in the bitmap, the code computes the left and right bounds of a printable character and returns the number of pixels over 8.

(cc @timholy: this is what I've been abusing Images for in the past few days)

using Images

#Download BMP of BMP
filename="unifont-7.0.03.bmp"
isfile(filename) || download("http://unifoundry.com/pub/unifont-7.0.03/unifont-7.0.03.bmp", filename)
unitable=imread("unifont-7.0.03.bmp", Images.ImageMagick)

#Check for printable character
catcode(c::Union(Char,Integer))=Base.UTF8proc.category_code(int32(c))
isprintable(c::Union(Char,Integer)) = c ≤ 0x10ffff && isprintable_category(catcode(c))
function isprintable_category(category::Integer)
    !( category==Base.UTF8proc.UTF8PROC_CATEGORY_CN #Unassigned 
    || category==Base.UTF8proc.UTF8PROC_CATEGORY_CS #Surrogate
    || category==Base.UTF8proc.UTF8PROC_CATEGORY_CC #Control
    )
end

#Compute left and right bounds
function wcwidthbmp(codepoint::Integer)
    isprintable(codepoint) || return -1
    col=(codepoint & 0xff)
    row=(codepoint >> 8)
    charbmp=unitable.data[32+col*16+(1:16),64+row*16+(1:16)]'
    l=1
    for j=1:16
        any(charbmp[:,j] .== 0) && break
        l += 1
    end
    r=16
    for j=16:-1:l
        any(charbmp[:,j] .== 0) && break
        r -= 1
    end
    (r-l)//8
end

#Box it up
Boxes=Dict()
for c in 0x0000:0xffff
    wcwidth_bmp = iceil(wcwidthbmp(c))
    wcwidth_sys = int(ccall(:wcwidth, Int32, (Uint32,), c))
    coord = (wcwidth_bmp, wcwidth_sys)
    Boxes[coord]=push!(get(Boxes, coord, {}), c)
end

#Draw table in Github-flavored Markdown
k1min, k1max=extrema([k[1] for (k,v) in Boxes])
k2min, k2max=extrema([k[2] for (k,v) in Boxes])
println("wcwidth (system) -->")
print("\t |\t")
for k2=k2min:k2max
    print(k2," |\t")
end
println()
println(" ------- |"^(k2max-k2min+1)*" --------")
for k1=k1min:k1max
    print("__",k1, "__\t |\t")
    for k2=k2min:k2max
        k1==k2 && print("__")
        print(length(get(Boxes, (k1,k2), [])))
        k1==k2 && print("__")
        print(" | \t")
    end
    println()
end

Results (OSX 10.9.4, libutf8proc) - wcwidth on columns, derived wcwidth on row

	-1	0	1	2
-1	64	1	2048	0
0	51	5	67	8
1	2857	149	6119	1026
2	3868	18	8499	40756

JeffBezanson · 2014-07-29T15:22:37Z

That's pretty cool.
There are probably several characters whose glyphs are quite small, but that are technically 2 cells wide. For example fullwidth punctuation.

jiahao · 2014-07-29T15:39:46Z

Oh, right. And spaces will have width 0 by this computation.

Take two - parse the hmtx table from a TTF to read out the advance width of defined glyphs.

jiahao · 2014-07-29T15:43:47Z

Sadly I think I'll have to find a different font to read, since computing the charwidths from the font file directly is probably a GPL-derivative.

jiahao · 2014-07-29T15:45:41Z

Another item to throw into the mix - the Unicode private use region (category Co) appears to have a consensus of usage; there is a ConScript Unicode Registry
that coordinates use of this space.

StefanKarpinski · 2014-07-29T15:47:30Z

Sadly I think I'll have to find a different font to read, since computing the charwidths from the font file directly is probably a GPL-derivative.

I find this highly unlikely. Data about copyrighted things is not copyrighted.

jiahao · 2014-07-29T17:27:52Z

Ok, that makes it a lot easier.

So I opened Unifont in Fontforge and saved it in Fontforge's SFD format, which is plaintext and quite easily parsed.

#Read sfdfile for character widths
function parsesfd(filename::String)
    CharWidths=Dict{Int,Int}()
    state=:seekchar
    for line in readlines(open(filename))
        if state==:seekchar         #StartChar: nonmarkingreturn
            if contains(line, "StartChar: ")
                codepoint = nothing
                width = nothing
                state = :readdata
            end
        elseif state==:readdata #Encoding: 65538 -1 2, Width: 1024
            contains(line, "Encoding:") && (codepoint = int(split(line)[3]))
            contains(line, "Width:") && (width = int(split(line)[2]))
            if codepoint!=nothing && width!=nothing
                CharWidths[codepoint]=width
                state = :seekchar
            end
        end
    end
    CharWidths
end

@time CharWidths=parsesfd("UnifontMedium.sfd")
println("Number of character widths read: ", length(CharWidths))

#Classify characters
Boxes=Dict()
for c in 0x0000:0xffff
    haskey(CharWidths,c) || continue
    idx = (CharWidths[c]÷512, charwidth(char(c)))
    Boxes[idx] = push!(get(Boxes, idx, {}), c)
end

#Output table in GFM format
print("i \\ j")
for j=0:2
    print("\t | ", j )
end
println("\n"*"------- | "^3 * "-------")
for i=0:2
    print("__", i, "__")
    for j=0:2
        print("\t | ")
        i==j && print("__")
        print(length(get(Boxes, (i,j), {})))
        i==j && print("__")
    end
    println()
end

The output of this program on my local MBP is

i \ j	0	1	2
0	805	443	6
1	1535	5012	5
2	4670	2830	41779

where the rows are the computed charwidths from the font advance widths and the columns are the system output from wcwidth.

A few of the discrepancies are small enough to inspect:

general_category_abbr=[
    "Lu", "Ll", "Lt", "Lm", "Lo", "Mn", "Mc", "Me", "Nd", "Nl",
    "No", "Pc", "Pd", "Ps", "Pe", "Pi", "Pf", "Po", "Sm", "Sc",
    "Sk", "So", "Zs", "Zl", "Zp", "Cc", "Cf", "Cs", "Co", "Cn"
];
abbr(catcode)= catcode==0 ? "00" : general_category_abbr[catcode]
catabbr(c)=abbr(Base.UTF8proc.category_code(int32(c)))
map(_->(_, char(_), catabbr(_)), Boxes[0, 2])

6-element Array{Any,1}:
 (0x302a,'〪',"Mn")
 (0x302b,'〫',"Mn")
 (0x302c,'〬',"Mn")
 (0x302d,'〭',"Mn")
 (0x302e,'〮',"Mn")
 (0x302f,'〯',"Mn")

These are all Chinese punctuation characters that are combining marks.

map(_->(_, char(_), catabbr(_)), Boxes[1, 2])

5-element Array{Any,1}:
 (0x2329,'〈',"Ps")
 (0x232a,'〉',"Pe")
 (0xfe33,'︳',"Pc")
 (0xfe34,'︴',"Pc")
 (0xff3d,'］',"Pe")

stevengj · 2014-07-30T20:33:34Z

@jiahao, as I understand it, Conscript is only about use of Co for constructed languages; there could be other people using these codepoints for other things. Anyway, I don't think we should worry about this for now, Klingon and Elvish notwithstanding.

stevengj · 2014-07-30T20:36:23Z

@jiahao, which system's wcwidth are you comparing to? Might be sensible to try a couple of different ones (e.g. glibc's, the portable wcwidth that we ship, which is based on Unicode 5).

jiahao · 2014-07-30T20:44:26Z

This is on my MBP running OSX 10.9.4. I'll look into producing a script that can be run on other systems.

jiahao · 2014-07-31T05:18:33Z

So... I finally have a candidate.

This IJulia notebook spits out an implementation of wcwidth that gives answers that I'm happy with, along with explanations of why. As you can see, it is essentially one giant switch statement and can probably be improved. Suggestions welcome.

Final discrepancy table:

font \ sys	-1	0	1	2
-1	866893	1	2604	0
0	833	89	448	0
1	2803	111	143522	0
2	8078	2	3685	85043

stevengj · 2014-07-31T12:35:00Z

Probably you want a binary tree of if statements.

ivarne · 2014-07-31T12:49:13Z

A weighted binary tree based on the relative occurrence of the different codepoints?

quinnj · 2014-07-31T12:54:36Z

I have this code example of a macro/function pair that build a binary tree of if statements from an input array to branch on and another input array of return values.

stevengj · 2014-07-31T17:16:40Z

I wonder whether it would be more conservative to assume Cn is printable and width 1, in case someone is using ConScript or similar.

jiahao · 2014-07-31T17:24:48Z

ConScript only concerns itself with Co (private use); I don't think it touches Cn.

stevengj · 2014-10-30T16:24:55Z

Would be great to get this merged in some form (probably after converting to some sort of binary tree).

stevengj · 2014-12-17T20:36:41Z

An even simpler implementation would be to just add a 2-bit width field to utf8proc_property_t and add an accessor function to fetch it. (We have recently broken binary compatibility with the grapheme functionality, so now would be the time to add this.)

The main question is whether Jan is interested in incorporating this upstream.

StefanKarpinski · 2014-12-17T20:43:53Z

Has anything been incorporated upstream so far? Just curious how realistic upstreaming changes really is.

stevengj · 2014-12-17T20:46:20Z

They've incorporated our Unicode-7 patches (though not in a release yet) and posted a public mercurial repository. I haven't heard back yet about the grapheme updates.

StefanKarpinski · 2014-12-17T22:40:39Z

Ah, cool. That's quite positive sounding.

nalimilan · 2015-02-21T10:19:56Z

Any news on the upstream inclusion of mojibake patches? I'd just like to make sure it will be ready when Julia 0.4 is released.

stevengj · 2015-02-21T14:06:56Z

I haven't heard back; just pinged them again.

stevengj · 2015-03-06T19:22:43Z

Now that we are the official utf8proc maintainers, I think we should go ahead and add a 2-bit width field to utf8proc_property_t and populate it via jiahao's algorithm.

StefanKarpinski · 2015-03-06T19:25:17Z

Yes!

stevengj added the enhancement label Jul 17, 2014

jiahao mentioned this issue Aug 29, 2014

Julia doesn't like Pizza JuliaLang/julia#3721

Closed

jiahao mentioned this issue Nov 21, 2014

Test failure on OSX JuliaLang/julia#9071

Closed

This was referenced Mar 6, 2015

cmake script #23

Closed

shared-library versioning #24

Closed

stevengj added a commit that referenced this issue Mar 8, 2015

fix #2: add charwidth function

63ba255

stevengj mentioned this issue Mar 8, 2015

WIP: charwidth function #27

Merged

stevengj added a commit that referenced this issue Mar 8, 2015

fix #2: add charwidth function

9896839

stevengj added a commit that referenced this issue Mar 8, 2015

fix #2: add charwidth function

1d07802

stevengj added a commit that referenced this issue Mar 9, 2015

fix #2: add charwidth function

2a40628

stevengj added a commit that referenced this issue Mar 9, 2015

fix #2: add charwidth function

1167448

stevengj added a commit that referenced this issue Mar 9, 2015

fix #2: add charwidth function

f6d46cf

stevengj added a commit that referenced this issue Mar 11, 2015

fix #2: add charwidth function

b429b69

stevengj mentioned this issue Mar 11, 2015

wrong width for U+00AD jquast/wcwidth#8

Open

stevengj added a commit that referenced this issue Mar 12, 2015

fix #2: add charwidth function

bf58768

stevengj added a commit that referenced this issue Mar 12, 2015

fix #2: add charwidth function

a4c84d2

stevengj closed this as completed in #27 Mar 12, 2015

add charwidth property #2

add charwidth property #2

Comments

stevengj commented Jul 16, 2014

jiahao commented Jul 17, 2014

stevengj commented Jul 17, 2014

stevengj commented Jul 18, 2014

jakebolewski commented Jul 18, 2014

jakebolewski commented Jul 18, 2014

jiahao commented Jul 19, 2014

jiahao commented Jul 19, 2014

jiahao commented Jul 19, 2014

stevengj commented Jul 19, 2014

jiahao commented Jul 28, 2014

jiahao commented Jul 29, 2014

jiahao commented Jul 29, 2014

jiahao commented Jul 29, 2014

JeffBezanson commented Jul 29, 2014

jiahao commented Jul 29, 2014

jiahao commented Jul 29, 2014

jiahao commented Jul 29, 2014

StefanKarpinski commented Jul 29, 2014

jiahao commented Jul 29, 2014

stevengj commented Jul 30, 2014

stevengj commented Jul 30, 2014

jiahao commented Jul 30, 2014

jiahao commented Jul 31, 2014

stevengj commented Jul 31, 2014

ivarne commented Jul 31, 2014

quinnj commented Jul 31, 2014

stevengj commented Jul 31, 2014

jiahao commented Jul 31, 2014

stevengj commented Oct 30, 2014

stevengj commented Dec 17, 2014

StefanKarpinski commented Dec 17, 2014

stevengj commented Dec 17, 2014

StefanKarpinski commented Dec 17, 2014

nalimilan commented Feb 21, 2015

stevengj commented Feb 21, 2015

stevengj commented Mar 6, 2015

StefanKarpinski commented Mar 6, 2015