Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Julia interface for custom Unicode normalization #30275

Closed
wants to merge 4 commits into from
Closed

WIP: Julia interface for custom Unicode normalization #30275

wants to merge 4 commits into from

Conversation

chengchingwen
Copy link
Member

I found there was a utf8proc_decompose_custom function under the c call of utf8proc_map inside Base.Unicode.utf8proc_map but not expose to the Julia interface. So I made another Base.Unicode.utf8proc_map method that support the custom func and custom data with native Julia function and data type
*there is another PR that seems to make a custom unicode normalization, but I didn't found a Julia function to it.

@maleadt
Copy link
Member

maleadt commented Dec 5, 2018

You seem to have been developing on the release-1.0 branch; either set that as your base branch for this PR or rebase onto master to get rid of all the extraneous commits.

@chengchingwen
Copy link
Member Author

@maleadt the CI seems to fail on a unrelated test, I guess there might be no other problem in the changes.

	`pointer_from_objref(x)` require the `x` is a mutable.
	wrap the immutable object as a list before passing to the `custom_data`
lump::Bool=false,
stripmark::Bool=false,

function utf8proc_map(str::String, options::Integer, custom_func, custom_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Julia side you should just pass custom_func — there is no need for custom_data in Julia because that is just a crude way to simulate a closure in C.

Then write

custom_func_wrapper(codepoint::Int32, custom_func::Any) = Int32(custom_func(codepoint))

and in utf8proc_map do

ccustom_func = @cfunction(custom_func_wrapper, Int32, (Int32, Any))

and pass ccustom_func, custom_func to utf8proc_decompose_custom. (That is, the Julia custom_func is actually passed as the data to utf8proc_decompose_custom. The corresponding argument in the ccall should be defined as Any.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, That is surprising. If it is the case, do we really need to make custom_func a closure?
But beside the origin of the custom_data, will it be more convenient to pass extra data for the custom_func? I read the c code at utf8proc. It seems that the custom_func can only do a codepoint mapping? I original thought that the custom_data is something like a pre-define table for mapping the codepoint and can change the behavior of utf8proc_iterate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don’t need an extra argument to pass extra data to a function argument in Julia. You just pass an anonymous function that references whatever local data you want.

In general, these “data” arguments are only for languages like C that don’t have closures. Google “closure computer science” if you don’t know what I’m talking about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wondering whether using closure is a good design. Suppose we have a struct called MappingRule and a method mapping(m::MappingRule, codepoing). Then every time you call the normalize you will need to pass a x->mapping(m, x)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Or you just make the MappingRule a callable object.

(Ptr{UInt8}, Int, Ptr{UInt8}, Int, Cint, Ptr{Cvoid}, Ptr{Cvoid}),
str, sizeof(str), C_NULL, 0, options, ccustom_func, data_ptr)
nwords < 0 && utf8proc_error(nwords)
buffer = Base.StringVector(nwords*4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*sizeof(Int32) would be clearer.

@fredrikekre fredrikekre added the unicode Related to unicode characters and encodings label Dec 5, 2018
@chengchingwen
Copy link
Member Author

closed by #42561

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants