Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non latin search NG ? #133

Open
jesus2099 opened this issue Jun 11, 2014 · 36 comments
Open

Non latin search NG ? #133

jesus2099 opened this issue Jun 11, 2014 · 36 comments
Labels
enhancement Something we do have implemented already but needs improvement upon to the best of knowledge. needs discussion Blah, blah, blah, wahh, wahh, wahh, etc. tracking upstream Waiting, watching, wanting.

Comments

@jesus2099
Copy link

Hello,
It seems that non latin script search is not functioning.
歌詞 should find kasi. PLAIN TEXT LYRICS 歌詞コピー 純文本歌詞 (same if i try to manually encode the URL with %E6%AD%8C%E8%A9%9E).

@sizzlemctwizzle
Copy link
Member

It probably has to do with the use of a route param instead of a query parameter. This is being fixed with #103.

@Martii
Copy link
Member

Martii commented Jun 12, 2014

Not working in current production with 歌詞 pasted into the search box. Landing on https://openuserjs.org/?q=%E6%AD%8C%E8%A9%9E ... returns no results

@jesus2099
Copy link
Author

Just in case, I tried with 歌詞コピー (but a real CJK search should also find 歌詞 as words are not separated by spaces) and it doesn’t work either.

@Martii Martii added Feature and removed bug labels Jun 12, 2014
@Martii
Copy link
Member

Martii commented Jun 12, 2014

I'll see what I can do. USO doesn't support this either so flipping label to feature/enhancement and assigning myself for the time being... help is always appreciated though. This is something we should strive to support. Probably just some UTF-8 flag somewhere.

@Martii Martii self-assigned this Jun 12, 2014
@Zren
Copy link
Contributor

Zren commented Jun 12, 2014

We're using regex to look things up, so it should be possible. Just means we need to change these regexes a bit.

https://github.com/OpenUserJs/OpenUserJS.org/blob/master/libs/modelQuery.js#L46

We're using \b

  • \b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
  • \w match any word character [a-zA-Z0-9_]

So we'll need to make our own \b for unicode.

Misc links:

@Martii
Copy link
Member

Martii commented Jun 12, 2014

I think Johan touched on some of what you are referencing on MDN and I used it in one of my scripts for a different purpose... I'll see if I can dig up that link and post momentarily...


Here we go... That way we could possibly search on that kind of modified content... although it may use up quite a bit of drone memory/cycles if unicode is found in a search string... What do you think @Zren?


A native node.js package would be preferred if these fail to meet expectations.

See also:

If we do end up using any route like these we should try to only exec this if unicode is detected and default to native V8 for everything else.

@jesus2099
Copy link
Author

Thanks for looking into this ! :)

Martii pushed a commit to Martii/OpenUserJS.org that referenced this issue Jun 13, 2014
…led `term`s since node *(V8)* doesn't currently support them. Possible fix for OpenUserJS#133

* This is a mix of @ZRENs idea mixed in with a little installWith snippet I did a while back
* Needs some intense testing before production use
@Martii
Copy link
Member

Martii commented Jun 13, 2014

So tried out the XRegExp package and couldn't get it to work well under the current architectural project design... but did figure out how to emulate (this means not exact btw) the word boundary for Unicode enabled strings... mixed this in with a bit of installWith code (GPL v3+ btw) and it appears it is good to go for intense dev testing.

Works with:

  • /?q=歌詞 returning currently one result on dev... expected (one by me pseudo forked from production)
  • /?q=歌詞+text returning currently one result on dev... expected (one by me pseudo forked from production)
  • /?q=hel+pa returning currently two results on dev... expected (one by sizzle and one by me)

I am not going to submit a pr until this is cross-confirmed as functionally equivalent plus for the RFE in this issue by multiple devs... but it can be checked out from this named issue branch number since my master/branch is currently sync'd with upstream/master if my GH repo is added as a remote.

@sizzlemctwizzle
Copy link
Member

@Martii
I'd argue we could test this better in production (not everyone knows how to build the site) since it doesn't affect ASCII searches and at worst doesn't work on non-ASCII (current behavior). Can you submit a PR (after you take a look at my comment of course)?

@Martii
Copy link
Member

Martii commented Jun 13, 2014

Can you submit a PR

I need a break... been reading and testing on this all day today... but I'll see if your line note change meets with my basic testing results after I get some food. :)

Martii pushed a commit to Martii/OpenUserJS.org that referenced this issue Jun 13, 2014
@Martii Martii mentioned this issue Jun 13, 2014
@Martii
Copy link
Member

Martii commented Jun 13, 2014

Alright ready for final check, merge and deploy @sizzlemctwizzle

@jesus2099
I will need you to test this out fully with your system when sizzle deploys it with your @name and @description and your user-content (the script description via Edit Script Info) and see if this meets your needs... if not please report back. I'll close this issue in about 3 days if I don't see/hear any problems.

@Martii
Copy link
Member

Martii commented Jun 13, 2014

Leaving open during "needs testing" phase.


Currently deployed.

@jesus2099
Copy link
Author

Thanks Marti, here is what I tested.
音楽 and 音楽の森 show that it works for name and description (summary).
But in fact it seems it does not work anytime : 直接のリンク and サービス should have found JASRACへの直リンク — even 直リンク (a word part of the name) does not find it.

@Martii
Copy link
Member

Martii commented Jun 13, 2014

MAJOR REEDIT:

hmmm may have found a solution...

Martii pushed a commit to Martii/OpenUserJS.org that referenced this issue Jun 13, 2014
…st char being non-word.

* Match STYLEGUIDE standard for identifier naming
* Remove stray re begin at... grr
* Trim the terms as not allowing searching on spaces

Applies to OpenUserJS#133
Martii pushed a commit to Martii/OpenUserJS.org that referenced this issue Jun 13, 2014
@Martii
Copy link
Member

Martii commented Jun 13, 2014

I'm out of ideas e.g. hung... it was working on dev before I went to sleep but now it's not... the only 100% consistent time this currently works is if prefixStr and fullStr are identical e.g. no word boundary when nonASCII terms are encountered. We might have to drop "starting with" searching feature and just use full searches when nonASCII detected and reimplement when re has full Unicode support (whatever year that is). So dropping that specific feature in that use case or another yet to be presented solution... this needs discussion, vote, implement with possible override or table. e.g. what to do next?

I also noticed prop values for fullSearchFields covers different data than what is used in prefixSearchFields. This may be a bug elsewhere... but I don't see where these are initially set.

Some helpful reference material evaluated for a lot of this:

Martii referenced this issue Jun 24, 2014
This and the following commits should be PR ready, and should not
break existing routes. This means refactored code that affect more
than one route will be duplicated and renamed. We can easily cleanup
extra code after implementing the entire refactor.
@Martii Martii removed their assignment Jun 24, 2014
Martii referenced this issue in Martii/OpenUserJS.org Jul 16, 2014
…ve libs to do.

* Changed "WARNING" on login page to "CAUTION"... a little too assertive... some users may want their email addy in there.
* Notated some STYLEGUIDE conformance needs
* Notated very short function name(s)
* At least one undefined identifier
@Martii
Copy link
Member

Martii commented Jul 29, 2015

Had some additional transient thoughts on this for the next assignee:

  • Since I had to do the BOM detection with a UTF-16 value... this implies that S3/node is using UTF-16 strings... not sure where to go with this yet as other priorities are in effect at the moment for me personally... escape() converts to ASCII percent encoded.
  • TODO I read on the mongoose DB docs that $regex may use a different regular expression engine (PCRE)... e.g. we might be able to handle Unicode better with Perl's implementation, assuming that is picked and tested against.

@Martii
Copy link
Member

Martii commented Dec 23, 2015

@jesus2099

But in fact it seems it does not work anytime : 直接のリンク and サービス should have found JASRACへの直リンク — even 直リンク (a word part of the name) does not find it.

Exactly which scripts of yours are each of the quoted queries supposed to find again?

EDITED:
Would you simplify the search down to the smallest query? e.g. preferably one and two characters only please.

Found one... but some of those characters don't seem to exist as whole words.
Thanks.

@Martii
Copy link
Member

Martii commented Dec 23, 2015

Just a FYI https://openuserjs.org/?q=ello+world doesn't pull up my related unit test scripts... I don't think it does mid-word to end-word searches... just beginning to mid. I'll reverify that in a few days in the code itself. (after this holiday)

@jesus2099
Copy link
Author

The specificity of CJK texts is that you should not expect any space characters anywhere.
Just characters, punctuations and line breaks.
But maybe the search engine we have can simply not cope with it.
Tell me if you still have some questions. :)

@Martii
Copy link
Member

Martii commented Dec 23, 2015

Well I think this answer covers what you are looking for that I linked above in the refs. Everything I've skimmed over with my latest research says it's whole words and every example is beginning on the word not in the middle of it... like I said I'll double check our code to see if we are doing something different... but I doubt we are going to sub-index words (assuming I'm using their terminology correctly)... I don't believe we have the CPU/MEMORY power for that and that could get seriously expensive if we tried.

From what I've read on MongoDB it's majorly used at v2.x and v3.x is on our development but searching does the same thing on local pro vs development... so I don't think you are going to get mid-word to end search capability. If I can't find an answer I'll have to add the "tracking upstream" label... it may appear somewhere along code migrations but it could be a loooooooooooong time (as we've already experienced). I'm not ready to dive into another DB system either as I'm just entering figuring out MongoDB... plus I don't think sizzle would like that large of a change at this time.

@Martii Martii removed the question A question has been encountered by anyone and has remained unanswered until cleared. label Dec 23, 2015
@jesus2099
Copy link
Author

OK but just for the record, you speak about words but in the sense of unspaced characters not words here.
サービス is a word, 直接のリンク are 3 words (sort of) : 直接, の and リンク.
音楽 is a word, 音楽の森 are 3 : 音楽, の and 森.
We can’t expect Japanese to look for separate words.
It is completely impossible to imagine stuff like spaced out texts : 音楽 の 森 の 直接 の リンク (in script descriptions in paticular but in any texts in general).
So MongoDB is not CJK friendly. :)
Thanks for your time, as always.

@jesus2099
Copy link
Author

I agree it is difficult.
I think yahoo, google, etc. — for both the text that user types and for indexing pages — tries each characters as if there were spaces and also tries sets of consecutive characters against its dictionary.

@Martii
Copy link
Member

Martii commented Dec 23, 2015

google, etc. — for both the text that user types and for indexing pages — tries each characters as if there were spaces and also tries sets of consecutive characters against its dictionary.

Well funny that you should mention that... I just dealt with trespasserW on this in a script issue... there's a limit on the number of characters they will process... which this exact issue may explain why. OUJS doesn't have as much processing power as a search engine does. Attempting to do that would be a vain and very expensive effort and probably would affect how OUJS is presented. Currently we have no Ads but I guarantee that there would be if we even tried to compete with the processing power of a search engine. Privacy would be a thing of the past trying to purchase server clouds, internet backbones, etc... unless everyone involved has a huge pocketbook to contribute I think it may have to be as is.

is not CJK friendly.

I don't think computer languages in general are to that type of language. The space is an important delimiter. I don't mean to sound rude or insensitive but CJK should adopt some sort of space (as a breather at the very least) because no person is able to speak or think without pauses and you'll eventually run out of parchment paper... so there has to be "breaks" with words somehow. ;) Human brains have always been more adaptable than most computers too... which is why they haven't taken over. ;) ... alas I'm drifting off topic here.

サービス is a word

In all computer languages that would be delimited by a terminating string null usually in C/PP native app which includes JavaScript... and creating a clause requires spaces as part of grammar.

When I was trying translating some of your text I noticed that one character meant one thing and then the next character changed its meaning too... at least according to google...

  1. straight
    1 + 2) direct

... an so on.

Anyhow... if you find something in those refs that solves the situation, even if it's at a later date, by all means please let everyone know. Part of the experience on OUJS (and node) is contributing with whatever capabilities one has. :)

@Martii
Copy link
Member

Martii commented Dec 23, 2015

Here's a thought.. since you know CJK way better than I do... Would you be willing to run some tests with all the HTML entities (Unicode versions though and I'm still not sure if we transform from UTF-16 yet) with the different types of spaces? There is the non-breaking space that comes to my thoughts first out.

@jesus2099
Copy link
Author

I guess there is a zero width space in the unicode table, it seems to be what you are looking for. :)
But you can’t ask millions of people to use spaces when they have never done. ;)
Your remark is good, some words are compound, it’s like in English “step father”.
In English you are free to assemble or to let compound words separate by space.
The fact that there is space before and after a compound word, the same space as inside it, is no problem for the reader to understand that these are a compound word, thanks to context.
It’s the same in CJK (I know Japanese at least) You don’t need something supplemental to say warning these are a compound word, spaces are not necessary to distinguish single “character” words from “compound words”, the context is enough.
You can see how small is the bar space on Japanese keyboards, it’s merely used to select next word in predictable typing, much less often than us Latin alphabet users.
When a monosyllabic language like Vietnamese, which beforehand used Chinese characters with no spaces started using Latin alphabet, it had to use spaces. :)

@Martii
Copy link
Member

Martii commented Dec 23, 2015

But you can’t ask millions of people to use spaces when they have never done

Sure I can! 😉 😄

iamgoingtotrysomethinghereandseeifyoucanreaditinenglish.thiswillbeignoringallcapitalizationbutkeepingpunctuation.itseemstomethatifihadanovelwrittenlikethisitwouldnotbesearchablebyanymeans. ;)

zero width space in the unicode table

Ahh thanks.. that's the other one I couldn't think of.

@jesus2099
Copy link
Author

I have just sent a private message in your sourceforge account (it seems impossible here). 😉

@sizzlemctwizzle
Copy link
Member

Just to jump in an explain how the search currently works:

The search term is broken up using spaces to extract "words" (more like
ordered character groups). Multiple fields are searched for the the
presence of all words from the search term. So if the search term contains
two "words", both must be present in the script title for that field to
match. Some fields are searched for exact matches on "words", and others
only care if the beginning of a word matches a search word.

No exactly the best search algorithm, but I'd have to quit my job and use
NLP to build a really good search engine.
On Dec 23, 2015 1:39 AM, "Marti Martz" notifications@github.com wrote:

Just a FYI https://openuserjs.org/?q=ello+world doesn't pull up my unit
test scripts... I don't think it does mid-word to end searches... just
beginning to mid. I'll reverify that in a few days.


Reply to this email directly or view it on GitHub
#133 (comment)
.

@Martii
Copy link
Member

Martii commented Dec 23, 2015

@jesus2099
Yah it's impossible now... they (GH) removed that a few years ago.

Btw https://github.com/OpenUserJs/OpenUserJS.org/search?utf8=%E2%9C%93&q=novel doesn't find my post above in this issue either... so GH (seems to) require spaces as well.

@sizzlemctwizzle
Awesome! Always good to have the genius creator hop in to explain things. :)

@jesus2099
Copy link
Author

iamgoingtotrysomethinghereandseeifyoucanreaditinenglish.thiswillbeignoringallcapitalizationbutkeepingpunctuation.itseemstomethatifihadanovelwrittenlikethisitwouldnotbesearchablebyanymeans. ;)

I know it’s just fun sarcasm because I already explained, but — just in case — no-space in Japanese is nothing but natural to read, it’s not a challenge as it is with Latin alphabet, where it is nearly impossible.
Adding spaces would not help at all and would just end up looking 👽 awkward.

@Martii
Copy link
Member

Martii commented Dec 23, 2015

Well it was also a test for GH as I just replied.

@Martii
Copy link
Member

Martii commented Dec 23, 2015

@jesus2099

... use spaces ...

In all seriousness... at the very least between CJK and en-US is usually a good thing. With @name being an "undefined" language and most of the multilingual contributors on OUJS use it only I think an en-SPACE would be a good thing between the different Unicode terms.

@jesus2099
Copy link
Author

It works in GH when you specify issue type (default is code) : thatwordyoutested and even 純文本 (which is not separated with spaces). 😊

But I am absolutely not underestimating the complexity of making up such a search. I don’t even know anything about it. 🔰

@Martii
Copy link
Member

Martii commented Dec 23, 2015

@jesus2099
Might have been a delay in parsing this issues content... which would possibly indicate they have more background processing power as well and probably on a low clock cycle with the instances/threads. It's showing up now with my query. They probably have some sort of dictionary caching that MongoDB doesn't... they probably "love me" at GH with all my additions... good test data.

I don’t even know anything about it.

Well we can all learn together if everyone is willing and of course available time. It takes me longer to digest what has been said than sizzle comparitively but usually I am at a slower pace then to a faster one when I understand things better.

@Martii
Copy link
Member

Martii commented Dec 23, 2015

private message in your sourceforge account

Don't see it there but I'll keep looking. SF has its own issues too. To allay any possible misunderstandings I appreciate your contributions and queries here.

@Martii Martii added tracking upstream Waiting, watching, wanting. and removed hung Not what you are thinking. Unable to resolve after assignment. labels Mar 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Something we do have implemented already but needs improvement upon to the best of knowledge. needs discussion Blah, blah, blah, wahh, wahh, wahh, etc. tracking upstream Waiting, watching, wanting.
Development

No branches or pull requests

4 participants