-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non latin search NG ? #133
Comments
It probably has to do with the use of a route param instead of a query parameter. This is being fixed with #103. |
Not working in current production with |
Just in case, I tried with 歌詞コピー (but a real CJK search should also find 歌詞 as words are not separated by spaces) and it doesn’t work either. |
I'll see what I can do. USO doesn't support this either so flipping label to feature/enhancement and assigning myself for the time being... help is always appreciated though. This is something we should strive to support. Probably just some UTF-8 flag somewhere. |
We're using regex to look things up, so it should be possible. Just means we need to change these regexes a bit. https://github.com/OpenUserJs/OpenUserJS.org/blob/master/libs/modelQuery.js#L46 We're using
So we'll need to make our own Misc links: |
I think Johan touched on some of what you are referencing on MDN and I used it in one of my scripts for a different purpose... I'll see if I can dig up that link and post momentarily... Here we go... That way we could possibly search on that kind of modified content... although it may use up quite a bit of drone memory/cycles if unicode is found in a search string... What do you think @Zren? A native node.js package would be preferred if these fail to meet expectations. See also:
If we do end up using any route like these we should try to only exec this if unicode is detected and default to native V8 for everything else. |
Thanks for looking into this ! :) |
…led `term`s since node *(V8)* doesn't currently support them. Possible fix for OpenUserJS#133 * This is a mix of @ZRENs idea mixed in with a little installWith snippet I did a while back * Needs some intense testing before production use
So tried out the XRegExp package and couldn't get it to work well under the current architectural project design... but did figure out how to emulate (this means not exact btw) the word boundary for Unicode enabled strings... mixed this in with a bit of installWith code (GPL v3+ btw) and it appears it is good to go for intense dev testing. Works with:
I am not going to submit a pr until this is cross-confirmed as functionally equivalent plus for the RFE in this issue by multiple devs... but it can be checked out from this named issue branch number since my master/branch is currently sync'd with upstream/master if my GH repo is added as a remote. |
@Martii |
I need a break... been reading and testing on this all day today... but I'll see if your line note change meets with my basic testing results after I get some food. :) |
Alright ready for final check, merge and deploy @sizzlemctwizzle @jesus2099 |
Leaving open during "needs testing" phase. Currently deployed. |
MAJOR REEDIT: hmmm may have found a solution... |
…st char being non-word. * Match STYLEGUIDE standard for identifier naming * Remove stray re begin at... grr * Trim the terms as not allowing searching on spaces Applies to OpenUserJS#133
Applies to OpenUserJS#133
I'm out of ideas e.g. hung... it was working on dev before I went to sleep but now it's not... the only 100% consistent time this currently works is if I also noticed Some helpful reference material evaluated for a lot of this:
|
This and the following commits should be PR ready, and should not break existing routes. This means refactored code that affect more than one route will be duplicated and renamed. We can easily cleanup extra code after implementing the entire refactor.
…ve libs to do. * Changed "WARNING" on login page to "CAUTION"... a little too assertive... some users may want their email addy in there. * Notated some STYLEGUIDE conformance needs * Notated very short function name(s) * At least one undefined identifier
Had some additional transient thoughts on this for the next assignee:
|
Exactly which scripts of yours are each of the quoted queries supposed to find again? EDITED: Found one... but some of those characters don't seem to exist as whole words. |
Just a FYI https://openuserjs.org/?q=ello+world doesn't pull up my related unit test scripts... I don't think it does mid-word to end-word searches... just beginning to mid. I'll reverify that in a few days in the code itself. (after this holiday) |
Temporarily forked your script here with adding a space and performed this query... so most likely it is from beginning of word to mid searching. Refs:
|
The specificity of CJK texts is that you should not expect any space characters anywhere. |
Well I think this answer covers what you are looking for that I linked above in the refs. Everything I've skimmed over with my latest research says it's whole words and every example is beginning on the word not in the middle of it... like I said I'll double check our code to see if we are doing something different... but I doubt we are going to sub-index words (assuming I'm using their terminology correctly)... I don't believe we have the CPU/MEMORY power for that and that could get seriously expensive if we tried. From what I've read on MongoDB it's majorly used at v2.x and v3.x is on our development but searching does the same thing on local pro vs development... so I don't think you are going to get mid-word to end search capability. If I can't find an answer I'll have to add the "tracking upstream" label... it may appear somewhere along code migrations but it could be a loooooooooooong time (as we've already experienced). I'm not ready to dive into another DB system either as I'm just entering figuring out MongoDB... plus I don't think sizzle would like that large of a change at this time. |
OK but just for the record, you speak about words but in the sense of unspaced characters not words here. |
I agree it is difficult. |
Well funny that you should mention that... I just dealt with trespasserW on this in a script issue... there's a limit on the number of characters they will process... which this exact issue may explain why. OUJS doesn't have as much processing power as a search engine does. Attempting to do that would be a vain and very expensive effort and probably would affect how OUJS is presented. Currently we have no Ads but I guarantee that there would be if we even tried to compete with the processing power of a search engine. Privacy would be a thing of the past trying to purchase server clouds, internet backbones, etc... unless everyone involved has a huge pocketbook to contribute I think it may have to be as is.
I don't think computer languages in general are to that type of language. The space is an important delimiter. I don't mean to sound rude or insensitive but CJK should adopt some sort of space (as a breather at the very least) because no person is able to speak or think without pauses and you'll eventually run out of parchment paper... so there has to be "breaks" with words somehow. ;) Human brains have always been more adaptable than most computers too... which is why they haven't taken over. ;) ... alas I'm drifting off topic here.
In all computer languages that would be delimited by a terminating string null usually in C/PP native app which includes JavaScript... and creating a clause requires spaces as part of grammar. When I was trying translating some of your text I noticed that one character meant one thing and then the next character changed its meaning too... at least according to google...
... an so on. Anyhow... if you find something in those refs that solves the situation, even if it's at a later date, by all means please let everyone know. Part of the experience on OUJS (and node) is contributing with whatever capabilities one has. :) |
Here's a thought.. since you know CJK way better than I do... Would you be willing to run some tests with all the HTML entities (Unicode versions though and I'm still not sure if we transform from UTF-16 yet) with the different types of spaces? There is the non-breaking space that comes to my thoughts first out. |
I guess there is a zero width space in the unicode table, it seems to be what you are looking for. :) |
Sure I can! 😉 😄 iamgoingtotrysomethinghereandseeifyoucanreaditinenglish.thiswillbeignoringallcapitalizationbutkeepingpunctuation.itseemstomethatifihadanovelwrittenlikethisitwouldnotbesearchablebyanymeans. ;)
Ahh thanks.. that's the other one I couldn't think of. |
I have just sent a private message in your sourceforge account (it seems impossible here). 😉 |
Just to jump in an explain how the search currently works: The search term is broken up using spaces to extract "words" (more like No exactly the best search algorithm, but I'd have to quit my job and use
|
@jesus2099 Btw https://github.com/OpenUserJs/OpenUserJS.org/search?utf8=%E2%9C%93&q=novel doesn't find my post above in this issue either... so GH (seems to) require spaces as well. @sizzlemctwizzle |
I know it’s just fun sarcasm because I already explained, but — just in case — no-space in Japanese is nothing but natural to read, it’s not a challenge as it is with Latin alphabet, where it is nearly impossible. |
Well it was also a test for GH as I just replied. |
In all seriousness... at the very least between CJK and en-US is usually a good thing. With |
It works in GH when you specify issue type (default is code) : thatwordyoutested and even 純文本 (which is not separated with spaces). 😊 But I am absolutely not underestimating the complexity of making up such a search. I don’t even know anything about it. 🔰 |
@jesus2099
Well we can all learn together if everyone is willing and of course available time. It takes me longer to digest what has been said than sizzle comparitively but usually I am at a slower pace then to a faster one when I understand things better. |
Don't see it there but I'll keep looking. SF has its own issues too. To allay any possible misunderstandings I appreciate your contributions and queries here. |
Hello,
It seems that non latin script search is not functioning.
歌詞 should find kasi. PLAIN TEXT LYRICS 歌詞コピー 純文本歌詞 (same if i try to manually encode the URL with
%E6%AD%8C%E8%A9%9E
).The text was updated successfully, but these errors were encountered: