-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about Zed_char
#15
Comments
Hi, Drup. The original type definition of zed_char.t explains some of your questions: type Zed_char.t= {
core: UChar.t;
combined: UChar.t list;
width: int; (* width of this char *)
size: int; (* 1 + List.length combined *)
} because the structure takes up too much space, it was internally redefined as utf8 string.
val unsafe_of_uChar : UChar.t -> Zed_char.t does the job.
when it encounters a malformed zed_char sequence, i.e a sequence starting with an combining mark dangling there.
Internally, they share the same type definition:
a Zed_char.t is a bundle of UChar.t, and It's more linguistic aware(printable/unprintable/combining marks/sequence legality)
Zed_char can be either an arbitrary code point(like UChar.t) or a legal grapheme. |
Because it's not aware of the linguistic meaning. The unsafe_ functions doesn't check if it's well formed as a legal grapheme. But we still need these functions, for example, to deal with control characters, to combine a normal char with a sequence of combining marks(which is malformed in _safe functions) |
Well, my bug report is really an invitation for you to document the mli (and, it seems, the .ml), rather than answer here. From the look of it, the data structure should definitely be documented in the code. ;)
Does that mean On one hand, you say that In any case, for ease of use of the API, I feel it's essential to have functions that are safe and of type Don't take this the wrong way, I'm only probing you with questions so that we can converge towards a naming/API that people (with reasonable knowledge of unicode) will be able to approach and understand. I slightly regret missing out on the various PRs before the release, but it's not too late to improve things! :) |
To answer here first is good, people can give suggestions, ask further questions immediately. After these questions are satisfied, I'll add them to the mli interface. Yes, Zed_utf8 is the utf8 representation of an UChat.t. "Zed_grapheme", a good name. But names like 'zed_char', 'zed_string' follow the name convention(char, string) and imply that they are essential types while working with the zed library.
Ah, you are right. Someone should have given such a suggention early. |
I understand where the name come from. I think it's ok to keep them, as long as the documentation is very explicit on what is what. My proposal would be to enforce (and document) that Do we really need all those Is it possible to make |
@kandu Please do document the .mli file with a detailed description like the one you've presented here. |
It would be like that we extract and purify the ASCII code, define a new type, called english_grapheme and when we deal with real world writing systems/user input, we then define a higher level type upon them, the result would be There are two more considerations, one is backward-compatibility, in case a project depends on zed, the mirgrating progress from zed 1 to zed 2 should be as simple as possible. So my proposal would be that we release a minor version with two new functions in Zed_char, |
Q&A are useless, they are not integrated in the API documentation, and thus are hard to find. Nobody reads them and they end up outofdate extremely quickly. If you want to document things, document them in the .mli |
Sorry for the late reply. I've been busy with some meetings these days and didn't read your questions very carefully nor answered it with proper attention. Some of the replies above are non-sense. Sorry for that. The current API are sufficient, though the naming may not be so deliberate. And to update the document is unnecessary either. If a guy has reasonable knowledge about combining marks, combined glyph or variation selectors, either of them, he'll understand the API entirely. Even if he don't know any of them, just not doing things the wrong way: I've looked up the PR you mentioned let zchar_of_uchar ch =
Zed_char.of_uChars [ch;]
|> function
| (Some t, _) -> t
| _ -> raise @@ Failure (
"Fail to get a Zed_char.t from "
^ (CamomileLibrary.UChar.int_of ch |> string_of_int)) In case we are dealing with user input, John wants to input In almost all cases, you should not determine if it's a legal grapheme with Zed_char. Because input sequence are always uchar-individual therefore all the combining marks user input are temporarily not a legal grapheme. And if a user input ten uchar, he may end up getting only one grapheme, one core appending with nine combining marks. Zed_string Zed_edit Zed_rope Zed_... are already well designed to deal with such a situation. For example, if you Zed_string.append or Zed_rope.append two str: str1, str2. The Zed_string.length or Zed_rope.length result_str may not be equal to length str1 + length str2. Because the new zed automatically done the duty work --- combining mark joining. The only missing document we should have written in zed_char.mli is: the prefix 'unsafe' means it doesn't check if it's a valid grapheme, and in most cases, that's the right way to work with Zed_char. And in case you really want to do that, Zed_char.( prop / prop_uChar / is_printable / is_printable_core in_combining_mark will help). |
@kandu, you already took the time of answering the very valid questions that @Drup asked. You should take the time to write down your answer in a structured way in the You cannot assume that people will read your mind and understand perfectly the API. It is true that the API is quite straightforward, but there is some wiggling room in the unsafe functions that should be much better documented:
|
The simple fact that we have this long discussion is proof that we really need some more documentation. :) Let me be clear about one fact: common use of an API should never have to use |
@kandu Please do add the documentation to the .mli file. Most of what you need you've already written in this discussion. |
I'm back home, squeezing time for it. |
I'm starting to understand why you are concerning about the new features/apis. OCaml community used to live in a simple and elegant world, there is no IPA, no CJK, no glyph variant. And for a long time, we didn't expect that these features would be supported. So we are now concerning about the precondition, failure, unsafety. A picture will explain better: The literal string in function append_cm is malformed. It's '^' 'b'. The leading char is a dangling dissociated combining mark. What we are seeing is the complicated real world, we even have to deal with malformed strings and take them as legal literal values. In other words, even these malformed literal values are well concerned and supported, so what we do is
And that's one of the reasons why I said
p.s. the tolerance are only allowed in user input phase. But I were wrong. People in the OCaml community are always clever and reasonable guys. They want to infer, analyze, understand things both outwardly and inwardly. I drafted #18, try to clarify some of the doc comment. But not all the questions above are answered in the PR. Some of the questions above are come from the unfamiliarity. I think to clarify the basic principle, to serve it as a clue for further detailed info is more useful. And here are some context info about the new zed/lambda-term, ocaml-community/lambda-term#2 |
Also to mention, the reason why Zed_re module was divided into two Zed_re.Core Zed_re.Raw modules is for regexp matching on glyph variant, which is a common issue while investigate, research ancient CJK scripts. I was going to improve the functionality of Lterm_edit and Zed_edit so that it would be a competent widget to help us study/research into ancient scripts. It seems that we're going to eliminate the regexp facility in #16. It's fine. Anyway, If necessary, we can implement such a regexp engine in reasonable time. |
The documentation about the new
Zed_char
module is not really sufficient:of_utf8
fails ?Zed_utf8.t
?Uchar
?Given how central the notion of character is, I think a much more precise documentation is necessary.
These questions are not hypotheticals btw. I can't even review this patch because the new version of zed makes the code awful for reasons that are completely unclear to me.
The text was updated successfully, but these errors were encountered: