-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] new plugin with normalizer & analyzer for phone numbers #11326
Comments
I can see this being extremely challenging, given the wide variety of conventions. The most "normal" form would probably involve stripping any international calling prefix and removing any punctuation/whitespace. I suppose it would be debatable as to whether a country-specific long-distance prefix should be removed. E.g. 1-800-MICROSOFT is functionally equivalent to 800MICROSOFT, because the 1 historically indicated a call outside your local area code. |
I found this plugin that could be leveraged as a starting point. https://github.com/purecloudlabs/elasticsearch-phone. It looks like its been abandoned. cc: @drewinin, @ININDevEvangelists - any thoughts, guidance, or warnings on this? |
@macohen I administer the purecloudlabs org and am unfamiliar with the project itself, but I can tell you that it has no maintainers and so can be considered abandoned; it's on my list of repos that are to be archived. It's licensed under a MIT license, so you are free to do with it as you please. |
thanks for finding this library, @macohen!
thanks for this information! you have a i've now seen that this library is backed by google's libphonenumber (which i didn't know before), that looks like it would help in abstracting away most of the discussions above (by leaving it to this library), though i haven't looked into it in any detail (yet). its Falsehoods Programmers Believe About Phone Numbers list is very informative and shows that we probably shouldn't attempt at implementing this manually (e.g. with textual token filtering). |
Oh gross; I didn't notice that. I think the correct answer is this is meant to be licensed under Apache per the misspelled I've deleted the MIT license and archived the repo to rectify this situation. I'll have to retract my previous statement about licensure. You're still free to do whatever you want with it, just per the remaining Apache license. |
@macohen I think we could fork that abandoned library into opensearch-phone-number-analyzer or bring it into core here |
Sounds good to me. Is it "core" enough to include here? I'm open to what seems right to maintainers. |
@macohen core is pretty bloated, so I would include it only if you think it should be enabled by default, WDYT? |
I do think it should be enabled by default - as long as it doesn't use any resources unless invoked. However, I don't think we're at a stage yet where we should just move it in. @rursprung are you up for forking this to your own account, making it work with OpenSearch as a plugin and then we can take a public vote/decision on if we move this into the opensearch-project itself? I think I'd want to make sure this library can at least meet your use cases. @dblock do you think the change from a plugin/extension to a core module would be significant enough that we would want to make a decision before anyone starts working on it? cc: @dagneyb for an FYI |
I don't feel strongly about it at all. |
i have now created an implementation for this (currently on 2.x because i'm testing there, but easily ported to 3.x): opensearch-phone-number-analyzer this is not 100% ready for prime-time yet (namely i'm waiting for 1-2 PRs to end up in the plugin template and plan to then re-create the repo based on that and with everything in this is largely based on the functionality from @opensearch-project (CC @macohen):
i'd like to nominate this plugin to be moved to the |
i've now changed the title of the issue and added a comment at the beginning to clearly state that this is now an RFC to get the plugin moved over to this org. see my previous comment for more details. |
@AmiStrn was trying to contribute a process for deciding how/when/why we bring plugins into the org @rursprung Do you have any users of the fork? Get CI to be green, make a release? We can consider including a plugin in the main distribution if it’s widely used. |
Note that we generally describe the process of moving repos into the org in https://github.com/opensearch-project/.github/blob/main/ADMINS.md#new-repos and have a proposal in opensearch-project/opensearch-plugins#216 to document moving plugins. |
@dblock: thanks for your answers!
not yet. i published it right after building it to (a) get feedback on it and (b) see if it can be upstreamed directly.
that was a minor oversight (the lucene test expects the tokens to be in the specified order and seems that i mixed up something there; already fixed locally and on the branch using i'll update the repo once #14179 has been answered (probably re-create the repo from the template as a lot has happened there since i did so and then re-apply the commits with the necessary cleanups/code changes; as there are no releases, no forks, etc. there shouldn't be an issue with throwing away the existing git history in this case), this will then also porting it from 2.x to main (i haven't tried it yet and hope that the lucene major release upgrade won't cause any issues here).
if ever possible i'd prefer to avoid going through that if we can move it to the opensearch project as then we'd have to do the whole setup twice. also, some things would then probably have to be renamed (i presume it'd be bad style if i'd publish it with
i think these are two steps:
while the latter would be nice (full disclosure: we're not using it, we use the min distribution and add only the plugins we need for our own setup) the former would be my immediate goal with this issue. |
Both of these are significant amounts of work for people who work on this project. The second one is also a promise to continue working on the plugin for security fixes and other version upgrades forever. I think it's a reasonable ask that any plugin that is included in the org and is released with the distribution gets either a team publicly committing to do it (your company?) or at least some traction with users, first, no? |
This is a lightweight tokenizer. Adding all the bloat of a new plugin to support one more analyzer seems silly. Adding it to analysis-common would be much easier. |
The standard way of denoting the international call prefix is "+", not "00" (https://en.wikipedia.org/wiki/List_of_international_call_prefixes) If the field is known to be a phone number (not mixed with other text), then all punctuation and whitespace can be removed to canonicalize it, e.g., (234) 321-9321, 234.321.9321, 234 321 9321 should all canonicalize to 2343219321. It would be nice if we could add + to indicate the international call prefix, but unfortunately, numbers are often given without the country code. If the application is hosted in North America, and the number has 10 digits, it's pretty safe to guess that the prefix "+1" should be applied. But alas you can't count on that outside of North America. An industrial-strength implementation would look at other fields (notably address) to make a better guess. Probably ML to do this robustly. |
this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
* add `Strings#isDigits` API inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> * add `phone` & `phone-search` analyzer + tokenizer this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes #11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> --------- Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
Thanks a lot for contribution @rursprung ! Please provide the pull request for opensearch-project/documentation-website#8389 with documentation update upon your convenience (before 2.18.0 release), thank you. |
this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves opensearch-project#8389
this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves opensearch-project#8389 Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
Also https://github.com/opensearch-project/opensearch-api-specification needs an update, with a working example. Thanks! |
this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves opensearch-project#8389 Co-authored-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
* add `Strings#isDigits` API inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> * add `phone` & `phone-search` analyzer + tokenizer this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> --------- Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
* add `Strings#isDigits` API inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> * add `phone` & `phone-search` analyzer + tokenizer this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> --------- Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
* add `Strings#isDigits` API inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> * add `phone` & `phone-search` analyzer + tokenizer this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> --------- Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
* document the new `analysis-phonenumber` plugin this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves #8389 Co-authored-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> * Minor rewrites Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/supported-analyzers/phone-analyzers.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/supported-analyzers/phone-analyzers.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. the new test group `analysis` has been added so that it can later be extended with all other optional language analyzers (which are currently also not covered). Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
UPDATE: RFC for new plugin
please use this issue as an RFC to have a new plugin under the
opensearch-project
org for the phone number normalizer & analyzer). i have implemented & open-sourced the plugin (needs minor polishing for the git history & port to 3.x - this can be done in a few minutes once we know where it'll live) and would very much like to see it hosted & owned by the project as i believe in the general usefulness of this.see this comment for more details: #11326 (comment)
original post:
Is your feature request related to a problem? Please describe.
we have a use-case where we store (amongst other things) a phone number in a dedicated field of the document. this is ingested from another system where in turn it has been entered by users (while there's some validation there might still be some variation in how the number is written). a user can then trigger a search which (amongst other things) will try to match the phone number. since the text to be searched is entered by the user, the phone number might come in any format (with or without international calling prefix, calling prefix with
+
or00
(or the national equivalent thereof), with or without separators (whitespaces, dashes, dots, you pick a character and chances are that a country is using it), with or without brackets for grouping numbers together, etc.).as a corner case (doesn't really affect us, but relevant for a general solution): even e.g. just filtering for numbers doesn't work in case a number would be entered with alphabetical representation. the only one i actually know is
1-800-MICROSOFT
in the USA, but i think you have lots of these over there?Describe the solution you'd like
it'd be great if OpenSearch could ship with a normalizer (or even a dedicated field type which automatically uses this normalizer?) for phone numbers which would cover most (if not all) cases. it could start with the most common ones and then be improved over time by the community when need arises.
Describe alternatives you've considered
everyone can build their own normalizer for phone numbers. the problem is that none of them will cover all (or even most) phone numbers and this just creates additional effort if everyone needs to re-invent the wheel.
the following is a very basic implementation which however doesn't cover most of the cases listed above (hence why it's hard to build a good one on your own):
Additional context
the wikipedia article on national conventions for writing telephone numbers seems to cover most (if not all?) ways of writing phone numbers
The text was updated successfully, but these errors were encountered: