Non latin search NG ? #133

jesus2099 · 2014-06-11T06:57:08Z

Hello,
It seems that non latin script search is not functioning.
歌詞 should find kasi. PLAIN TEXT LYRICS 歌詞コピー純文本歌詞 (same if i try to manually encode the URL with %E6%AD%8C%E8%A9%9E).

The text was updated successfully, but these errors were encountered:

sizzlemctwizzle · 2014-06-11T07:10:44Z

It probably has to do with the use of a route param instead of a query parameter. This is being fixed with #103.

Martii · 2014-06-12T06:08:00Z

Not working in current production with 歌詞 pasted into the search box. Landing on https://openuserjs.org/?q=%E6%AD%8C%E8%A9%9E ... returns no results

jesus2099 · 2014-06-12T11:22:57Z

Just in case, I tried with 歌詞コピー (but a real CJK search should also find 歌詞 as words are not separated by spaces) and it doesn’t work either.

Martii · 2014-06-12T17:30:12Z

I'll see what I can do. USO doesn't support this either so flipping label to feature/enhancement and assigning myself for the time being... help is always appreciated though. This is something we should strive to support. Probably just some UTF-8 flag somewhere.

Zren · 2014-06-12T17:42:08Z

We're using regex to look things up, so it should be possible. Just means we need to change these regexes a bit.

https://github.com/OpenUserJs/OpenUserJS.org/blob/master/libs/modelQuery.js#L46

We're using \b

\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
\w match any word character [a-zA-Z0-9_]

So we'll need to make our own \b for unicode.

Misc links:

https://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters

Martii · 2014-06-12T18:01:52Z

I think Johan touched on some of what you are referencing on MDN and I used it in one of my scripts for a different purpose... I'll see if I can dig up that link and post momentarily...

Here we go... That way we could possibly search on that kind of modified content... although it may use up quite a bit of drone memory/cycles if unicode is found in a search string... What do you think @Zren?

A native node.js package would be preferred if these fail to meet expectations.

See also:

https://github.com/slevithan/xregexp (claims xregexp-all.js but doesn't install using $ npm install xregexp-all.js) ... a npm homepage... does install with $ npm install xregexp

If we do end up using any route like these we should try to only exec this if unicode is detected and default to native V8 for everything else.

jesus2099 · 2014-06-12T19:23:12Z

Thanks for looking into this ! :)

@ZRENs

…led `term`s since node *(V8)* doesn't currently support them. Possible fix for OpenUserJS#133 * This is a mix of @ZRENs idea mixed in with a little installWith snippet I did a while back * Needs some intense testing before production use

Martii · 2014-06-13T00:06:26Z

So tried out the XRegExp package and couldn't get it to work well under the current architectural project design... but did figure out how to emulate (this means not exact btw) the word boundary for Unicode enabled strings... mixed this in with a bit of installWith code (GPL v3+ btw) and it appears it is good to go for intense dev testing.

Works with:

/?q=歌詞 returning currently one result on dev... expected (one by me pseudo forked from production)
/?q=歌詞+text returning currently one result on dev... expected (one by me pseudo forked from production)
/?q=hel+pa returning currently two results on dev... expected (one by sizzle and one by me)

I am not going to submit a pr until this is cross-confirmed as functionally equivalent plus for the RFE in this issue by multiple devs... but it can be checked out from this named issue branch number since my master/branch is currently sync'd with upstream/master if my GH repo is added as a remote.

sizzlemctwizzle · 2014-06-13T00:45:28Z

@Martii
I'd argue we could test this better in production (not everyone knows how to build the site) since it doesn't affect ASCII searches and at worst doesn't work on non-ASCII (current behavior). Can you submit a PR (after you take a look at my comment of course)?

Martii · 2014-06-13T01:03:08Z

Can you submit a PR

I need a break... been reading and testing on this all day today... but I'll see if your line note change meets with my basic testing results after I get some food. :)

* Applies to OpenUserJS#133

Martii · 2014-06-13T04:03:01Z

Alright ready for final check, merge and deploy @sizzlemctwizzle

@jesus2099
I will need you to test this out fully with your system when sizzle deploys it with your @name and @description and your user-content (the script description via Edit Script Info) and see if this meets your needs... if not please report back. I'll close this issue in about 3 days if I don't see/hear any problems.

Martii · 2014-06-13T04:26:12Z

Leaving open during "needs testing" phase.

Currently deployed.

jesus2099 · 2014-06-13T08:34:55Z

Thanks Marti, here is what I tested.
音楽 and 音楽の森 show that it works for name and description (summary).
But in fact it seems it does not work anytime : 直接のリンク and サービス should have found JASRACへの直リンク — even 直リンク (a word part of the name) does not find it.

Martii · 2014-06-13T10:49:24Z

MAJOR REEDIT:

hmmm may have found a solution...

…st char being non-word. * Match STYLEGUIDE standard for identifier naming * Remove stray re begin at... grr * Trim the terms as not allowing searching on spaces Applies to OpenUserJS#133

Applies to OpenUserJS#133

Martii · 2014-06-13T21:02:12Z

I'm out of ideas e.g. hung... it was working on dev before I went to sleep but now it's not... the only 100% consistent time this currently works is if prefixStr and fullStr are identical e.g. no word boundary when nonASCII terms are encountered. We might have to drop "starting with" searching feature and just use full searches when nonASCII detected and reimplement when re has full Unicode support (whatever year that is). So dropping that specific feature in that use case or another yet to be presented solution... this needs discussion, vote, implement with possible override or table. e.g. what to do next?

I also noticed prop values for fullSearchFields covers different data than what is used in prefixSearchFields. This may be a bug elsewhere... but I don't see where these are initially set.

Some helpful reference material evaluated for a lot of this:

http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular-expressions
http://www.regular-expressions.info/unicode.html (Using \P{L} broke things plus it doesn't cover numbers with the negation which implies V8 doesn't work properly in this arena)
http://www.unicode.org/reports/tr18/ (RFC/WhitePaper here and notice it's relatively recent and most specifically the draft here ... then trying to get V8 to support it is a lengthy time issue)

This and the following commits should be PR ready, and should not break existing routes. This means refactored code that affect more than one route will be duplicated and renamed. We can easily cleanup extra code after implementing the entire refactor.

…ve libs to do. * Changed "WARNING" on login page to "CAUTION"... a little too assertive... some users may want their email addy in there. * Notated some STYLEGUIDE conformance needs * Notated very short function name(s) * At least one undefined identifier

Martii · 2015-07-29T09:10:00Z

Had some additional transient thoughts on this for the next assignee:

Since I had to do the BOM detection with a UTF-16 value... this implies that S3/node is using UTF-16 strings... not sure where to go with this yet as other priorities are in effect at the moment for me personally... escape() converts to ASCII percent encoded.
TODO I read on the mongoose DB docs that $regex may use a different regular expression engine (PCRE)... e.g. we might be able to handle Unicode better with Perl's implementation, assuming that is picked and tested against.

Martii · 2015-12-23T07:35:10Z

@jesus2099

But in fact it seems it does not work anytime : 直接のリンク and サービス should have found JASRACへの直リンク — even 直リンク (a word part of the name) does not find it.

Exactly which scripts of yours are each of the quoted queries supposed to find again?

EDITED:
Would you simplify the search down to the smallest query? e.g. preferably one and two characters only please.

Found one... but some of those characters don't seem to exist as whole words.
Thanks.

Martii · 2015-12-23T07:39:46Z

Just a FYI https://openuserjs.org/?q=ello+world doesn't pull up my related unit test scripts... I don't think it does mid-word to end-word searches... just beginning to mid. I'll reverify that in a few days in the code itself. (after this holiday)

Martii · 2015-12-23T07:55:42Z

Temporarily forked your script here with adding a space and performed this query... so most likely it is from beginning of word to mid searching.

Refs:

jesus2099 · 2015-12-23T08:52:57Z

The specificity of CJK texts is that you should not expect any space characters anywhere.
Just characters, punctuations and line breaks.
But maybe the search engine we have can simply not cope with it.
Tell me if you still have some questions. :)

Martii · 2015-12-23T09:09:24Z

Well I think this answer covers what you are looking for that I linked above in the refs. Everything I've skimmed over with my latest research says it's whole words and every example is beginning on the word not in the middle of it... like I said I'll double check our code to see if we are doing something different... but I doubt we are going to sub-index words (assuming I'm using their terminology correctly)... I don't believe we have the CPU/MEMORY power for that and that could get seriously expensive if we tried.

From what I've read on MongoDB it's majorly used at v2.x and v3.x is on our development but searching does the same thing on local pro vs development... so I don't think you are going to get mid-word to end search capability. If I can't find an answer I'll have to add the "tracking upstream" label... it may appear somewhere along code migrations but it could be a loooooooooooong time (as we've already experienced). I'm not ready to dive into another DB system either as I'm just entering figuring out MongoDB... plus I don't think sizzle would like that large of a change at this time.

jesus2099 · 2015-12-23T10:02:59Z

OK but just for the record, you speak about words but in the sense of unspaced characters not words here.
サービス is a word, 直接のリンク are 3 words (sort of) : 直接, の and リンク.
音楽 is a word, 音楽の森 are 3 : 音楽, の and 森.
We can’t expect Japanese to look for separate words.
It is completely impossible to imagine stuff like spaced out texts : 音楽の森の直接のリンク (in script descriptions in paticular but in any texts in general).
So MongoDB is not CJK friendly. :)
Thanks for your time, as always.

jesus2099 · 2015-12-23T10:04:42Z

I agree it is difficult.
I think yahoo, google, etc. — for both the text that user types and for indexing pages — tries each characters as if there were spaces and also tries sets of consecutive characters against its dictionary.

Martii · 2015-12-23T16:23:23Z

google, etc. — for both the text that user types and for indexing pages — tries each characters as if there were spaces and also tries sets of consecutive characters against its dictionary.

Well funny that you should mention that... I just dealt with trespasserW on this in a script issue... there's a limit on the number of characters they will process... which this exact issue may explain why. OUJS doesn't have as much processing power as a search engine does. Attempting to do that would be a vain and very expensive effort and probably would affect how OUJS is presented. Currently we have no Ads but I guarantee that there would be if we even tried to compete with the processing power of a search engine. Privacy would be a thing of the past trying to purchase server clouds, internet backbones, etc... unless everyone involved has a huge pocketbook to contribute I think it may have to be as is.

is not CJK friendly.

I don't think computer languages in general are to that type of language. The space is an important delimiter. I don't mean to sound rude or insensitive but CJK should adopt some sort of space (as a breather at the very least) because no person is able to speak or think without pauses and you'll eventually run out of parchment paper... so there has to be "breaks" with words somehow. ;) Human brains have always been more adaptable than most computers too... which is why they haven't taken over. ;) ... alas I'm drifting off topic here.

サービス is a word

In all computer languages that would be delimited by a terminating string null usually in C/PP native app which includes JavaScript... and creating a clause requires spaces as part of grammar.

When I was trying translating some of your text I noticed that one character meant one thing and then the next character changed its meaning too... at least according to google...

straight
1 + 2) direct

... an so on.

Anyhow... if you find something in those refs that solves the situation, even if it's at a later date, by all means please let everyone know. Part of the experience on OUJS (and node) is contributing with whatever capabilities one has. :)

Martii · 2015-12-23T16:34:58Z

Here's a thought.. since you know CJK way better than I do... Would you be willing to run some tests with all the HTML entities (Unicode versions though and I'm still not sure if we transform from UTF-16 yet) with the different types of spaces? There is the non-breaking space that comes to my thoughts first out.

jesus2099 · 2015-12-23T17:00:10Z

I guess there is a zero width space in the unicode table, it seems to be what you are looking for. :)
But you can’t ask millions of people to use spaces when they have never done. ;)
Your remark is good, some words are compound, it’s like in English “step father”.
In English you are free to assemble or to let compound words separate by space.
The fact that there is space before and after a compound word, the same space as inside it, is no problem for the reader to understand that these are a compound word, thanks to context.
It’s the same in CJK (I know Japanese at least) You don’t need something supplemental to say warning these are a compound word, spaces are not necessary to distinguish single “character” words from “compound words”, the context is enough.
You can see how small is the bar space on Japanese keyboards, it’s merely used to select next word in predictable typing, much less often than us Latin alphabet users.
When a monosyllabic language like Vietnamese, which beforehand used Chinese characters with no spaces started using Latin alphabet, it had to use spaces. :)

Martii · 2015-12-23T17:14:49Z

But you can’t ask millions of people to use spaces when they have never done

Sure I can! 😉 😄

iamgoingtotrysomethinghereandseeifyoucanreaditinenglish.thiswillbeignoringallcapitalizationbutkeepingpunctuation.itseemstomethatifihadanovelwrittenlikethisitwouldnotbesearchablebyanymeans. ;)

zero width space in the unicode table

Ahh thanks.. that's the other one I couldn't think of.

jesus2099 · 2015-12-23T17:17:30Z

I have just sent a private message in your sourceforge account (it seems impossible here). 😉

sizzlemctwizzle · 2015-12-23T17:18:42Z

Just to jump in an explain how the search currently works:

The search term is broken up using spaces to extract "words" (more like
ordered character groups). Multiple fields are searched for the the
presence of all words from the search term. So if the search term contains
two "words", both must be present in the script title for that field to
match. Some fields are searched for exact matches on "words", and others
only care if the beginning of a word matches a search word.

No exactly the best search algorithm, but I'd have to quit my job and use
NLP to build a really good search engine.
On Dec 23, 2015 1:39 AM, "Marti Martz" notifications@github.com wrote:

Just a FYI https://openuserjs.org/?q=ello+world doesn't pull up my unit
test scripts... I don't think it does mid-word to end searches... just
beginning to mid. I'll reverify that in a few days.

—
Reply to this email directly or view it on GitHub
#133 (comment)
.

Martii · 2015-12-23T17:20:43Z

@jesus2099
Yah it's impossible now... they (GH) removed that a few years ago.

Btw https://github.com/OpenUserJs/OpenUserJS.org/search?utf8=%E2%9C%93&q=novel doesn't find my post above in this issue either... so GH (seems to) require spaces as well.

@sizzlemctwizzle
Awesome! Always good to have the genius creator hop in to explain things. :)

jesus2099 · 2015-12-23T17:21:16Z

iamgoingtotrysomethinghereandseeifyoucanreaditinenglish.thiswillbeignoringallcapitalizationbutkeepingpunctuation.itseemstomethatifihadanovelwrittenlikethisitwouldnotbesearchablebyanymeans. ;)

I know it’s just fun sarcasm because I already explained, but — just in case — no-space in Japanese is nothing but natural to read, it’s not a challenge as it is with Latin alphabet, where it is nearly impossible.
Adding spaces would not help at all and would just end up looking 👽 awkward.

Martii · 2015-12-23T17:21:56Z

Well it was also a test for GH as I just replied.

Martii · 2015-12-23T17:37:20Z

@jesus2099

... use spaces ...

In all seriousness... at the very least between CJK and en-US is usually a good thing. With @name being an "undefined" language and most of the multilingual contributors on OUJS use it only I think an en-SPACE would be a good thing between the different Unicode terms.

jesus2099 · 2015-12-23T17:39:24Z

It works in GH when you specify issue type (default is code) : thatwordyoutested and even 純文本 (which is not separated with spaces). 😊

But I am absolutely not underestimating the complexity of making up such a search. I don’t even know anything about it. 🔰

Martii · 2015-12-23T17:43:33Z

@jesus2099
Might have been a delay in parsing this issues content... which would possibly indicate they have more background processing power as well and probably on a low clock cycle with the instances/threads. It's showing up now with my query. They probably have some sort of dictionary caching that MongoDB doesn't... they probably "love me" at GH with all my additions... good test data.

I don’t even know anything about it.

Well we can all learn together if everyone is willing and of course available time. It takes me longer to digest what has been said than sizzle comparitively but usually I am at a slower pace then to a faster one when I understand things better.

Martii · 2015-12-23T17:56:31Z

private message in your sourceforge account

Don't see it there but I'll keep looking. SF has its own issues too. To allay any possible misunderstandings I appreciate your contributions and queries here.

sizzlemctwizzle added needs testing labels Jun 11, 2014

Martii added Feature and removed bug labels Jun 12, 2014

Martii self-assigned this Jun 12, 2014

Martii added the enhancement label Jun 12, 2014

Martii pushed a commit to Martii/OpenUserJS.org that referenced this issue Jun 13, 2014

Try sizzles re method for production testing

9f8f7d6

* Applies to OpenUserJS#133

Martii mentioned this issue Jun 13, 2014

Issue 133 #151

Merged

sizzlemctwizzle closed this as completed in #151 Jun 13, 2014

Martii reopened this Jun 13, 2014

sizzlemctwizzle removed the Feature label Jun 13, 2014

Martii mentioned this issue Jun 13, 2014

Correction in emulation of unicode word boundary and detection of first ... #154

Merged

Martii pushed a commit to Martii/OpenUserJS.org that referenced this issue Jun 13, 2014

remove the logical or

e954cab

Applies to OpenUserJS#133

Martii referenced this issue Jun 24, 2014

Add a admon panel to a groupScriptListPage linking the json.

f9cb56a

Martii removed their assignment Jun 24, 2014

Martii added needs discussion and removed needs testing labels Jul 3, 2014

Martii removed the question A question has been encountered by anyone and has remained unanswered until cleared. label Dec 23, 2015

Martii added tracking upstream Waiting, watching, wanting. and removed hung Not what you are thinking. Unable to resolve after assignment. labels Mar 24, 2016

Non latin search NG ? #133

Non latin search NG ? #133

Comments

jesus2099 commented Jun 11, 2014

sizzlemctwizzle commented Jun 11, 2014

Martii commented Jun 12, 2014

jesus2099 commented Jun 12, 2014

Martii commented Jun 12, 2014

Zren commented Jun 12, 2014

Martii commented Jun 12, 2014

jesus2099 commented Jun 12, 2014

Martii commented Jun 13, 2014

sizzlemctwizzle commented Jun 13, 2014

Martii commented Jun 13, 2014

Martii commented Jun 13, 2014

Martii commented Jun 13, 2014

jesus2099 commented Jun 13, 2014

Martii commented Jun 13, 2014

Martii commented Jun 13, 2014

Martii commented Jul 29, 2015

Martii commented Dec 23, 2015

Martii commented Dec 23, 2015

Martii commented Dec 23, 2015

jesus2099 commented Dec 23, 2015

Martii commented Dec 23, 2015

jesus2099 commented Dec 23, 2015

jesus2099 commented Dec 23, 2015

Martii commented Dec 23, 2015

Martii commented Dec 23, 2015

jesus2099 commented Dec 23, 2015

Martii commented Dec 23, 2015

jesus2099 commented Dec 23, 2015

sizzlemctwizzle commented Dec 23, 2015

Martii commented Dec 23, 2015

jesus2099 commented Dec 23, 2015

Martii commented Dec 23, 2015

Martii commented Dec 23, 2015

jesus2099 commented Dec 23, 2015

Martii commented Dec 23, 2015

Martii commented Dec 23, 2015