-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/full text search #1136
Feature/full text search #1136
Conversation
c15389d
to
251d09c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very good job, thanks! I really appreciate you also added the search feature to Riak's backend. Here are my general comments also:
- The PR description is not up-to-date (doesn't say about Riak backend)
- It would be very valuable if you could add an example stanza in the PR desc so it's easy for client devs to see and understand how to use this feature.
- It would be perfect if the search feature were optional. There maybe some installations where the packet format is in fact sth encrypted (that's why the packet's format is configurable). In such situation we don't want to keep the body in plain text. Also for setups where the search feature is not needed, disabling it will safe some storage space.
_Start, _End, _Now, _WithJID, <<_SearchText/binary>>, | ||
_PageSize, _LimitPassed, _MaxResultLimit, | ||
_IsSimple) -> | ||
omg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really interesting return value :) however I think {error, omg}
is better and can still be improved ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, sorry, that was a placeholder :)
_Start, _End, _Now, _WithJID, <<_SearchText/binary>>, | ||
_PageSize, _LimitPassed, _MaxResultLimit, | ||
_IsSimple) -> | ||
{error, 'not-supported'}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one looks far more better then omg
from prev comment :)
@@ -133,7 +143,7 @@ stop_pm(Host) -> | |||
%% ---------------------------------------------------------------------- | |||
%% Add hooks for mod_mam_muc | |||
|
|||
-spec start_muc(ejabberd:server(),_) -> 'ok'. | |||
-spec start_muc(ejabberd:server(), _) -> 'ok'. | |||
start_muc(Host, _Opts) -> | |||
case gen_mod:get_module_opt(Host, ?MODULE, no_writer, false) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not needed here, am I right?
apps/ejabberd/src/mod_mam_utils.erl
Outdated
@@ -141,13 +149,13 @@ microseconds_to_now(MicroSeconds) when is_integer(MicroSeconds) -> | |||
|
|||
%% @doc Returns time in `now()' format. | |||
-spec iso8601_datetime_binary_to_timestamp(iso8601_datetime_binary()) | |||
-> erlang:timestamp() | undefined. | |||
-> erlang:timestamp() | undefined. | |||
iso8601_datetime_binary_to_timestamp(DateTime) when is_binary(DateTime) -> | |||
jlib:datetime_binary_to_timestamp(DateTime). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A word (or more) of an explanation will makes things easier to understand. A simple example may also help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this refers to, considering that i didn't even touch this function ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your IDE seems to have touched it.
replace_x_user_element(FromJID, Role, Affiliation, Packet) -> | ||
append_x_user_element(FromJID, Role, Affiliation, | ||
delete_x_user_element(Packet)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's clever :) I like it!
elvis.config
Outdated
@@ -27,7 +27,7 @@ | |||
%% Allow usage of ?assertMatch and similar macros, since after expanding they | |||
%% result in use of variables like __X and __V that wouldn't be permitted by | |||
%% default regex. | |||
{elvis_style, variable_naming_convention, #{regex => "^(_?[A-Z][0-9a-zA-Z_]*)$"}} | |||
{elvis_style, variable_naming_convention, #{regex => "^(_?[A-Z][0-9a-zA-Z]*)$"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do need _
in this regexp. Otherwise Elvis may complain about macros like ?A_MACRO_WITH_WORDS
, am I right? We have such macros in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna revert it. That was an merge issue.
8643221
to
7548e5f
Compare
@@ -228,29 +239,32 @@ do_archive_message(_Result, Host, MessID, UserID, | |||
SDir = encode_direction(Dir), | |||
SRemLResource = ejabberd_odbc:escape(RemLResource), | |||
Data = packet_to_stored_binary(Packet), | |||
TextBody = packet_to_search_body(Host, Packet), | |||
STextBody = ejabberd_odbc:escape(TextBody), | |||
string:to_lower(STextBody), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not needed here, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's redundant. I'm gonna remove it.
-spec normalize_search_text(binary() | string() | undefined, string()) -> string() | undefined. | ||
normalize_search_text(undefined, _WordSeparator) -> | ||
undefined; | ||
normalize_search_text(Text, WordSeparator) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A word (or more) of an explanation will makes things easier to understand. A simple example may also help.
true -> | ||
false %% full page but not the last one in the result set | ||
end; | ||
TotalCount =:= PagedCount; %% false means full page but not the last one in the result set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's clever :) I like it!
@@ -917,6 +978,153 @@ simple_archive_request(Config) -> | |||
end, | |||
escalus_fresh:story(Config, [{alice, 1}, {bob, 1}], F). | |||
|
|||
text_search_is_not_available(Config) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to Elvis:
The code in the following (LINE, COL) locations has the same structure: (981, 1), (1008, 1).
509a6b4
to
89c8cea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still documentation missing, with focus on what happens when full-text search is turned on (and a note that turning it on won't make old messages searchable).
ArchiveID :: mod_mam:archive_id(), ArchiveJID :: ejabberd:jid()) | ||
-> integer(). | ||
ArchiveID :: mod_mam:archive_id(), ArchiveJID :: ejabberd:jid()) | ||
-> integer(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the formatter doesn't like lines beginning with an arrow.
-> purge_single_message_result(). | ||
MessID :: mod_mam:message_id(), ArchiveID :: mod_mam:archive_id(), | ||
ArchiveJID :: ejabberd:jid(), Now :: mod_mam:unix_timestamp()) | ||
-> purge_single_message_result(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more mis-formatted arrow.
apps/ejabberd/src/mod_commands.erl
Outdated
@@ -186,7 +186,7 @@ send_message(From, To, Body) -> | |||
ejabberd_hooks:run(user_send_packet, | |||
F#jid.lserver, | |||
[F, T, Packet]), | |||
% privacy check is missing, but is it needed? | |||
% privacy check is missing, but is it needed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misformatted comment.
apps/ejabberd/src/mod_mam.erl
Outdated
@@ -65,7 +67,8 @@ | |||
|
|||
-import(mod_mam_utils, | |||
[maybe_microseconds/1, | |||
microseconds_to_now/1]). | |||
microseconds_to_now/1, | |||
normalize_search_text/2]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would make sense to avoid importing new things into this module and call them fully qualified with mod_mam_utils:
? The end goal would be to eliminate those imports alltogether.
apps/ejabberd/src/mod_mam.erl
Outdated
@@ -112,7 +116,8 @@ | |||
-include_lib("ejabberd/include/amp.hrl"). | |||
-include_lib("exml/include/exml.hrl"). | |||
|
|||
%% ---------------------------------------------------------------------- | |||
%% --------------------------------------------------- | |||
%% ------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened :<
apps/ejabberd/src/mod_mam_utils.erl
Outdated
children = [ResultSetEl]}. | ||
name = <<"fin">>, | ||
attrs = [{<<"xmlns">>, MamNs}] | ||
++ [{<<"complete">>, <<"true">>} || IsComplete] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like formatting in this case makes the code less readable.
apps/ejabberd/src/mod_mam_utils.erl
Outdated
children = [ResultSetEl]}. | ||
name = <<"fin">>, | ||
attrs = [{<<"xmlns">>, MamNs}] | ||
++ [{<<"complete">>, <<"true">>} || IsComplete] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like formatting in this case makes the code less readable.
apps/ejabberd/src/mod_mam_utils.erl
Outdated
[form_type_field(MamNs), | ||
form_field(<<"jid-single">>, <<"with">>), | ||
form_field(<<"text-single">>, <<"start">>), | ||
form_field(<<"text-single">>, <<"end">>)]. | ||
form_field(<<"text-single">>, <<"end">>)] ++ TextSearch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| TextSearch]
? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the same :) I want lists concatenation, not insertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not the same in general, but isn't it more idiomatic (and marginally faster) in this case?
apps/ejabberd/src/mod_mam_utils.erl
Outdated
LowerBody = string:to_lower(BodyString), | ||
ReOpts = [{return, list}, global, unicode, ucp], | ||
Re0 = re:replace(LowerBody, "[, .:;-?!]+", " ", ReOpts), | ||
Re1 = re:replace(Re0, "([^\\w\\d ]+)|(^\\s+)|(\\s+$)", "", ReOpts), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The \d
is redundant here as \w
group already includes digits.
-spec normalize_search_text(binary() | string() | undefined, string()) -> string() | undefined. | ||
normalize_search_text(undefined, _WordSeparator) -> | ||
undefined; | ||
normalize_search_text(Text, WordSeparator) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has to do some heavy lifting on texts, have you measured the performance impact of ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It adds like 100 μs to processing time per row on my slow CPU. Simple string:to_lower
is only 10 times faster then this whole function. Also, I've checked that precompiling regexes only speed up this function by ~5%. Removing only one of those 3 regexes could give also no more then 5-10%, but I'm not sure if its easy to do.
fbc1746
to
456e7e3
Compare
9a698c7
to
30dbec3
Compare
30dbec3
to
9569e6d
Compare
This PR introduces MVP for MAM full text search. Currently it works only with ODBC and Riak backends and its very limited when comes to text matching. This is NOT an extension to the standard. Current MAM XEP-0313 allows for queries using "forms" that may contain any server-defined fields, therefore simple field named 'full-text-search' was introduced in order to provide this feature.
The following stanza shows a example MAM stanza that uses full text search:
Full Text Search is an optional server-side feature. To find out whether its supported, client may query server about active form fields for MAM stanza:
The response shows which form fields may be used to filter MAM messages (that includes required fields i.e. with, start and end):
While reading PR you may want to skip "Make Elvis happy" commit to get rid of code quality changes.