Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add charwidth property #2

Closed
stevengj opened this issue Jul 16, 2014 · 37 comments · Fixed by #27
Closed

add charwidth property #2

stevengj opened this issue Jul 16, 2014 · 37 comments · Fixed by #27

Comments

@stevengj
Copy link
Member

As discussed in JuliaLang/julia#6939, the wcwidth function is broken on many operating systems. When we import the Unicode data, it would be good to add another field to our database in order to store the character width, so that we can provide an up-to-date character-width function.

@jiahao
Copy link
Collaborator

jiahao commented Jul 17, 2014

It looks like Unicode does not actually maintain a definitive list of character widths. The only data I can find on this topic is on East Asian characters; my understanding of how this ought to translate into numerical character widths is summarized in this gist. Confusingly, there are wide(W)-narrow(Na) and full(F)-half(H) pairings which appear to have the same functionality; I'm not clear on the differences between the two pairs.

@stevengj
Copy link
Member Author

See also this wcwidth implementation, which I think we use in Julia on Windows. Unfortunately, it dates to Unicode 5.0.

@stevengj
Copy link
Member Author

And a slightly patched version of wcwidth in newlib, also quite old.

@jakebolewski
Copy link

Here is a BSD licesced Go version, don't know if it is completely up to date with the latest unicode standard.
https://github.com/shinichy/go-wcwidth

@jakebolewski
Copy link

Nevermind, I see it is just a post of Markus Kuhns version.

@jiahao
Copy link
Collaborator

jiahao commented Jul 19, 2014

@jakebolewski and I were just discussing how we could test wcwidths definitively by rendering characters in a virtual framebuffer using fixed width fonts and checking if the rendered size fits within the size allotted by wcwidth=0,1,2. One nice thing is that we won't have worry much about of the Unihan characters since they have character widths defined by UAX 11.

The challenge here is finding a useable font; there aren't that many monospaced fonts with decent Unicode support. I personally use Consolas and find its support for the BMP to be pretty good. But GNU Unifont 7.0 seems like the one for this task, should anyone be foolhardy enough to try.

@jiahao
Copy link
Collaborator

jiahao commented Jul 19, 2014

Or it may be sufficient to simply compute bounding boxes from Unifont's font table bitmap.

@jiahao
Copy link
Collaborator

jiahao commented Jul 19, 2014

Or process the Unicode archived code charts directly. (93 mb pdf)

@stevengj
Copy link
Member Author

Note that GNU libunistring also provides character width functions (LGPL). I'm not sure how they are implemented or how up-to-date they are, but at the very least they provide a useful baseline to compare to.

@jiahao
Copy link
Collaborator

jiahao commented Jul 28, 2014

Here is one chunk of code to address a prerequisite part of the problem, which is how the output of isprint (which currently uses libc.iswprint()) maps onto Unicode character properties. A nonprintable character must return a wcwidth of -1, when then gets floored to 0 in charwidth.

This code snippet classifies code points in the BMP (from 0 to 0xffff) as to

  • whether they are printable according to isprint, and
  • their Unicode category.
general_category_abbr=[
    "Lu", "Ll", "Lt", "Lm", "Lo", "Mn", "Mc", "Me", "Nd", "Nl",
    "No", "Pc", "Pd", "Ps", "Pe", "Pi", "Pf", "Po", "Sm", "Sc",
    "Sk", "So", "Zs", "Zl", "Zp", "Cc", "Cf", "Cs", "Co", "Cn"
]

categories=Dict()
for c in 0:65535
    catcode = Base.UTF8proc.category_code(int32(c))
    #catcode=unsafe_load(ccall((:utf8proc_get_property,:libmojibake), Ptr{Uint16}, (Int32,), c))
    isprintable = isprint(char(c))
    categories[(catcode, isprintable)]=push!(get(categories,
        (catcode, isprintable), {}), c)
end

abbr(catcode)= catcode==0 ? "00" : general_category_abbr[catcode]

for catcode in 0:30, val in [false, true]
    println(abbr(catcode), "\t", val, "\t", length(get(categories, (catcode, val), [])))
end

Note that I had to accommodate a possible return value of 0 from the library call in UTF8proc.category_code() which is not a valid category code, but probably refer to an unknown category.

The results for the BMP, when summarized by character category, are:

Cat iswprint OSX 10.9.4 Ubuntu 14.04.1
libutf8proc libmojibake libutf8proc libmojibake
0 FALSE 4991 2032 3718 2032
0 TRUE 17 0 1290 0
Lu FALSE 129 267 0 37
Lu TRUE 707 707 836 937
Ll FALSE 278 457 0 75
Ll TRUE 824 820 1102 1202
Lt FALSE 0 0 0 0
Lt TRUE 31 31 31 31
Lm FALSE 106 166 0 32
Lm TRUE 61 65 167 199
Lo FALSE 626 1999 0 776
Lo TRUE 44055 44074 44681 45297
Mn FALSE 86 468 0 231
Mn TRUE 516 516 602 753
Mc FALSE 52 161 0 49
Mc TRUE 115 117 167 229
Me FALSE 0 4 0 1
Me TRUE 10 9 10 12
Nd FALSE 41 181 0 60
Nd TRUE 189 189 230 310
Nl FALSE 0 14 0 10
Nl TRUE 51 51 51 55
No FALSE 12 50 0 25
No TRUE 240 240 252 265
Pc FALSE 1 1 0 0
Pc TRUE 9 9 10 10
Pd FALSE 1 6 0 4
Pd TRUE 17 18 18 20
Ps FALSE 3 10 0 1
Ps TRUE 63 65 66 74
Pe FALSE 3 9 0 0
Pe TRUE 62 64 65 73
Pi FALSE 5 6 0 0
Pi TRUE 6 6 11 12
Pf FALSE 5 6 0 0
Pf TRUE 4 4 9 10
Po FALSE 64 191 0 91
Po TRUE 196 198 260 298
Sm FALSE 15 50 0 4
Sm TRUE 889 886 904 932
Sc FALSE 7 18 0 11
Sc TRUE 34 34 41 41
Sk FALSE 42 61 0 17
Sk TRUE 57 55 99 99
So FALSE 279 581 0 255
So TRUE 2071 2068 2350 2394
Zs FALSE 1 0 0 0
Zs TRUE 17 17 18 17
Zl FALSE 0 0 1 1
Zl TRUE 1 1 0 0
Zp FALSE 0 0 1 1
Zp TRUE 1 1 0 0
Cc FALSE 65 65 65 65
Cc TRUE 0 0 0 0
Cf FALSE 29 38 0 7
Cf TRUE 4 2 33 33
Cs FALSE 0 0 2048 2048
Cs TRUE 2048 2048 0 0
Co FALSE 0 0 0 0
Co TRUE 6400 6400 6400 6400
Cn FALSE 0 0 0 0
Cn TRUE 0 0 0 0

I naively thought that everything was printable except for the category C characters, but this turns out not to be true...?

@jiahao
Copy link
Collaborator

jiahao commented Jul 29, 2014

@JeffBezanson and I briefly discussed what it meant to be a printable character but I don't think we really came to a firm answer.

Instead, I'm going to break it down by type of code point (Unicode 6.3 standard pdf; Ch 2, Table 2-3), which makes the decision much clearer for me:

Basic Type Brief Description General Category printable on OSX 10.9.4? printable on Ubuntu 14.04.01? Should be printable?
Graphic Letter, mark, number, punctuation, symbol, and spaces L, M, N, P, S, Zs Mostly Yes Mostly Yes Yes (there is something to draw)
Format Invisible but affects neighboring characters Cf Mostly No Mostly Yes Yes (affects display of other characters)
" " Zl, Zp Yes No Yes (affects display of other characters)
Control Usage defined by protocols or standards outside the Unicode Standard Cc No No No (extends ASCII notion of nonprintable character)
Private-use Usage defined by private agreement outside the Unicode Standard Co Yes Yes Yes? (undefined by standard but everyone says Yes?)
Surrogate Permanently reserved for UTF-16; restricted interchange Cs Yes No No (not a character, so it can't be a printable character)
Noncharacter/Reserved Permanently reserved for internal usage or future assignment; restricted interchange Cn N/A N/A No (not a character, so it can't be a printable character)

I think there is no question that all Graphic code points should be printable and all Control/Surrogate/Noncharacter/Reserved code points should not be printable, from their very definitions. Format code points are debatable, and Private-use code points I don't think are decidable.

@jiahao
Copy link
Collaborator

jiahao commented Jul 29, 2014

Based on this, I'll propose that a printable character is any code point that is not in the Cc, Cn or Cs general categories. These are the code points for which wcwidth should return -1 and have charwidth 0.

I don't think it's necessary to refine the definition of 'printable character' beyond the Unicode General Category.

@jiahao
Copy link
Collaborator

jiahao commented Jul 29, 2014

Here is another snippet of code that tries to compute character widths (and wcwidths) from the Unifont font bitmap. For each 16x16 plot in the bitmap, the code computes the left and right bounds of a printable character and returns the number of pixels over 8.

(cc @timholy: this is what I've been abusing Images for in the past few days)

using Images

#Download BMP of BMP
filename="unifont-7.0.03.bmp"
isfile(filename) || download("http://unifoundry.com/pub/unifont-7.0.03/unifont-7.0.03.bmp", filename)
unitable=imread("unifont-7.0.03.bmp", Images.ImageMagick)

#Check for printable character
catcode(c::Union(Char,Integer))=Base.UTF8proc.category_code(int32(c))
isprintable(c::Union(Char,Integer)) = c  0x10ffff && isprintable_category(catcode(c))
function isprintable_category(category::Integer)
    !( category==Base.UTF8proc.UTF8PROC_CATEGORY_CN #Unassigned 
    || category==Base.UTF8proc.UTF8PROC_CATEGORY_CS #Surrogate
    || category==Base.UTF8proc.UTF8PROC_CATEGORY_CC #Control
    )
end

#Compute left and right bounds
function wcwidthbmp(codepoint::Integer)
    isprintable(codepoint) || return -1
    col=(codepoint & 0xff)
    row=(codepoint >> 8)
    charbmp=unitable.data[32+col*16+(1:16),64+row*16+(1:16)]'
    l=1
    for j=1:16
        any(charbmp[:,j] .== 0) && break
        l += 1
    end
    r=16
    for j=16:-1:l
        any(charbmp[:,j] .== 0) && break
        r -= 1
    end
    (r-l)//8
end

#Box it up
Boxes=Dict()
for c in 0x0000:0xffff
    wcwidth_bmp = iceil(wcwidthbmp(c))
    wcwidth_sys = int(ccall(:wcwidth, Int32, (Uint32,), c))
    coord = (wcwidth_bmp, wcwidth_sys)
    Boxes[coord]=push!(get(Boxes, coord, {}), c)
end

#Draw table in Github-flavored Markdown
k1min, k1max=extrema([k[1] for (k,v) in Boxes])
k2min, k2max=extrema([k[2] for (k,v) in Boxes])
println("wcwidth (system) -->")
print("\t |\t")
for k2=k2min:k2max
    print(k2," |\t")
end
println()
println(" ------- |"^(k2max-k2min+1)*" --------")
for k1=k1min:k1max
    print("__",k1, "__\t |\t")
    for k2=k2min:k2max
        k1==k2 && print("__")
        print(length(get(Boxes, (k1,k2), [])))
        k1==k2 && print("__")
        print(" | \t")
    end
    println()
end

Results (OSX 10.9.4, libutf8proc) - wcwidth on columns, derived wcwidth on row

-1 0 1 2
-1 64 1 2048 0
0 51 5 67 8
1 2857 149 6119 1026
2 3868 18 8499 40756

@JeffBezanson
Copy link
Member

That's pretty cool.
There are probably several characters whose glyphs are quite small, but that are technically 2 cells wide. For example fullwidth punctuation.

@jiahao
Copy link
Collaborator

jiahao commented Jul 29, 2014

Oh, right. And spaces will have width 0 by this computation.

Take two - parse the hmtx table from a TTF to read out the advance width of defined glyphs.

@jiahao
Copy link
Collaborator

jiahao commented Jul 29, 2014

Sadly I think I'll have to find a different font to read, since computing the charwidths from the font file directly is probably a GPL-derivative.

@jiahao
Copy link
Collaborator

jiahao commented Jul 29, 2014

Another item to throw into the mix - the Unicode private use region (category Co) appears to have a consensus of usage; there is a ConScript Unicode Registry
that coordinates use of this space.

@StefanKarpinski
Copy link
Member

Sadly I think I'll have to find a different font to read, since computing the charwidths from the font file directly is probably a GPL-derivative.

I find this highly unlikely. Data about copyrighted things is not copyrighted.

@jiahao
Copy link
Collaborator

jiahao commented Jul 29, 2014

Ok, that makes it a lot easier.

So I opened Unifont in Fontforge and saved it in Fontforge's SFD format, which is plaintext and quite easily parsed.

#Read sfdfile for character widths
function parsesfd(filename::String)
    CharWidths=Dict{Int,Int}()
    state=:seekchar
    for line in readlines(open(filename))
        if state==:seekchar         #StartChar: nonmarkingreturn
            if contains(line, "StartChar: ")
                codepoint = nothing
                width = nothing
                state = :readdata
            end
        elseif state==:readdata #Encoding: 65538 -1 2, Width: 1024
            contains(line, "Encoding:") && (codepoint = int(split(line)[3]))
            contains(line, "Width:") && (width = int(split(line)[2]))
            if codepoint!=nothing && width!=nothing
                CharWidths[codepoint]=width
                state = :seekchar
            end
        end
    end
    CharWidths
end

@time CharWidths=parsesfd("UnifontMedium.sfd")
println("Number of character widths read: ", length(CharWidths))

#Classify characters
Boxes=Dict()
for c in 0x0000:0xffff
    haskey(CharWidths,c) || continue
    idx = (CharWidths[c]÷512, charwidth(char(c)))
    Boxes[idx] = push!(get(Boxes, idx, {}), c)
end

#Output table in GFM format
print("i \\ j")
for j=0:2
    print("\t | ", j )
end
println("\n"*"------- | "^3 * "-------")
for i=0:2
    print("__", i, "__")
    for j=0:2
        print("\t | ")
        i==j && print("__")
        print(length(get(Boxes, (i,j), {})))
        i==j && print("__")
    end
    println()
end

The output of this program on my local MBP is

i \ j 0 1 2
0 805 443 6
1 1535 5012 5
2 4670 2830 41779

where the rows are the computed charwidths from the font advance widths and the columns are the system output from wcwidth.

A few of the discrepancies are small enough to inspect:

general_category_abbr=[
    "Lu", "Ll", "Lt", "Lm", "Lo", "Mn", "Mc", "Me", "Nd", "Nl",
    "No", "Pc", "Pd", "Ps", "Pe", "Pi", "Pf", "Po", "Sm", "Sc",
    "Sk", "So", "Zs", "Zl", "Zp", "Cc", "Cf", "Cs", "Co", "Cn"
];
abbr(catcode)= catcode==0 ? "00" : general_category_abbr[catcode]
catabbr(c)=abbr(Base.UTF8proc.category_code(int32(c)))
map(_->(_, char(_), catabbr(_)), Boxes[0, 2])
6-element Array{Any,1}:
 (0x302a,'〪',"Mn")
 (0x302b,'〫',"Mn")
 (0x302c,'〬',"Mn")
 (0x302d,'〭',"Mn")
 (0x302e,'〮',"Mn")
 (0x302f,'〯',"Mn")

These are all Chinese punctuation characters that are combining marks.

map(_->(_, char(_), catabbr(_)), Boxes[1, 2])
5-element Array{Any,1}:
 (0x2329,'〈',"Ps")
 (0x232a,'〉',"Pe")
 (0xfe33,'︳',"Pc")
 (0xfe34,'︴',"Pc")
 (0xff3d,']',"Pe")

@stevengj
Copy link
Member Author

@jiahao, as I understand it, Conscript is only about use of Co for constructed languages; there could be other people using these codepoints for other things. Anyway, I don't think we should worry about this for now, Klingon and Elvish notwithstanding.

@stevengj
Copy link
Member Author

@jiahao, which system's wcwidth are you comparing to? Might be sensible to try a couple of different ones (e.g. glibc's, the portable wcwidth that we ship, which is based on Unicode 5).

@jiahao
Copy link
Collaborator

jiahao commented Jul 30, 2014

This is on my MBP running OSX 10.9.4. I'll look into producing a script that can be run on other systems.

@jiahao
Copy link
Collaborator

jiahao commented Jul 31, 2014

So... I finally have a candidate.

This IJulia notebook spits out an implementation of wcwidth that gives answers that I'm happy with, along with explanations of why. As you can see, it is essentially one giant switch statement and can probably be improved. Suggestions welcome.

Final discrepancy table:

font \ sys -1 0 1 2
-1 866893 1 2604 0
0 833 89 448 0
1 2803 111 143522 0
2 8078 2 3685 85043

@stevengj
Copy link
Member Author

Probably you want a binary tree of if statements.

@ivarne
Copy link

ivarne commented Jul 31, 2014

A weighted binary tree based on the relative occurrence of the different codepoints?

@quinnj
Copy link
Member

quinnj commented Jul 31, 2014

I have this code example of a macro/function pair that build a binary tree of if statements from an input array to branch on and another input array of return values.

@stevengj
Copy link
Member Author

I wonder whether it would be more conservative to assume Cn is printable and width 1, in case someone is using ConScript or similar.

@jiahao
Copy link
Collaborator

jiahao commented Jul 31, 2014

ConScript only concerns itself with Co (private use); I don't think it touches Cn.

@stevengj
Copy link
Member Author

Would be great to get this merged in some form (probably after converting to some sort of binary tree).

@stevengj
Copy link
Member Author

An even simpler implementation would be to just add a 2-bit width field to utf8proc_property_t and add an accessor function to fetch it. (We have recently broken binary compatibility with the grapheme functionality, so now would be the time to add this.)

The main question is whether Jan is interested in incorporating this upstream.

@StefanKarpinski
Copy link
Member

Has anything been incorporated upstream so far? Just curious how realistic upstreaming changes really is.

@stevengj
Copy link
Member Author

They've incorporated our Unicode-7 patches (though not in a release yet) and posted a public mercurial repository. I haven't heard back yet about the grapheme updates.

@StefanKarpinski
Copy link
Member

Ah, cool. That's quite positive sounding.

@nalimilan
Copy link
Member

Any news on the upstream inclusion of mojibake patches? I'd just like to make sure it will be ready when Julia 0.4 is released.

@stevengj
Copy link
Member Author

I haven't heard back; just pinged them again.

@stevengj
Copy link
Member Author

stevengj commented Mar 6, 2015

Now that we are the official utf8proc maintainers, I think we should go ahead and add a 2-bit width field to utf8proc_property_t and populate it via jiahao's algorithm.

@StefanKarpinski
Copy link
Member

Yes!

This was referenced Mar 6, 2015
stevengj added a commit that referenced this issue Mar 8, 2015
stevengj added a commit that referenced this issue Mar 8, 2015
stevengj added a commit that referenced this issue Mar 8, 2015
stevengj added a commit that referenced this issue Mar 9, 2015
stevengj added a commit that referenced this issue Mar 9, 2015
stevengj added a commit that referenced this issue Mar 9, 2015
stevengj added a commit that referenced this issue Mar 11, 2015
stevengj added a commit that referenced this issue Mar 12, 2015
stevengj added a commit that referenced this issue Mar 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants