Skip to content

Suboptimal Searches (Dev, Trial and any unindexed releases not to be included here) #1231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
oalders opened this issue Jun 16, 2014 · 30 comments
Assignees

Comments

@oalders
Copy link
Member

oalders commented Jun 16, 2014

I'm opening this issue as a place to collect searches which could be improved. Individual searches can be broken into issues as they are tackled, but this is essentially a place to get the conversation started.

@oalders
Copy link
Member Author

oalders commented Jun 16, 2014

I'm looking for File::Temp.

https://metacpan.org/search?q=tmpfile

(result 8)

vs

http://search.cpan.org/search?query=tmpfile&mode=all

(result 2)

@oalders oalders changed the title Suboptimal Searches Suboptimal Searches (Dev, Trial and any unindexed releases not to be included here) Jun 17, 2014
@ribasushi
Copy link

Thanks for kicking this off

http://search.cpan.org/search?query=mop&mode=all vs https://metacpan.org/search?q=mop (known issue, but the most painful manifestation of it)

http://search.cpan.org/search?query=dbix+helper&mode=all vs https://metacpan.org/search?q=dbix+helper (note how the only thing coming up is the deprecated one)

@oalders
Copy link
Member Author

oalders commented Jun 30, 2014

@ribasushi I think the main issue with the dbix+helper search is that the MetaCPAN search results are collapsed. If you follow through on the link for more results you get https://metacpan.org/search?q=distribution:DBIx-Class-Helpers+dbix%20helper which is much more helpful. I'm not invalidating your comment. I'm just trying to work through what we're seeing. Obviously showing a deprecated module as the first result is not helpful. We should look at tweaking the collapsed search in this kind of case.

One other problem may be that the search is for "helper" and not "helpers". The collapsed results for "helpers" look better: https://metacpan.org/search?q=dbix+helpers

@ribasushi
Copy link

@oalders Does ES provide a way to calculate a "churn coefficient"? In other words - can it rank the entries by "most changes since" and thus give you a sane collapse criteria?

@dagolden
Copy link

You need a way to specify a search by module name -- you effectively have this for the search box autocomplete, but something like module:MooseX ought to give all dists with MooseX in the name rather than a full-text search. SCO has had this feature forever and it's a major gap in MetaCPAN.

@ranguard
Copy link
Member

More smarts on start matching...

I want to find all Plack::Middleware::** modules that have 'time'

https://metacpan.org/search?q=plack%3A%3Amiddleware+time

This might be a new feature rather than a suboptimal search but thought I'd mention it here

@oalders
Copy link
Member Author

oalders commented Jul 10, 2014

What @dagolden is proposing is something we can do relatively easily, so I think we should make that a priority. We'd just need to sort out the syntax. The single colon is part of lucene's search syntax. Also we just need to advertise that you can use lucene's syntax to constrain searches. A good example is https://metacpan.org/search?q=plack+author%3ADAGOLDEN

@tsibley
Copy link
Contributor

tsibley commented Jul 11, 2014

module.name:MooseX is accepted, but I don't get why it only returns "MooseX" and not any of the subclasses. I thought term queries/filters were contains not equals?

@rwstauner
Copy link
Contributor

Putting field:val in the search box ends up doing a query_string search (that's what recognizes the operators), not a term filter.

To clarify term filters, they are for exact values (like not_analyzed strings).

The reference docs do use the word "contain" (which isn't very clear) but they also say "not_analyzed":

Matches documents that have fields that contain a term (not analyzed).

which means it won't be tokenized (hence the exact match requirement).

The book ("definitive guide") is slightly more specific:

The term filter is used to filter by exact values, be they numbers, dates, booleans, or not_analyzed exact value string fields".

Also note that the "term" operator doesn't analyze the input, so for example
{"filter": {"term": {"file.module.name.analyzed": "MooseX"}}} returns no results, but
{"filter": {"term": {"file.module.name.analyzed": "moosex"}}} returns several relevant matches.

However you can't see that difference using the search box because of the query_string query (which does analyze the input). So, since we have several "fields" for module name, using an analyzed field can get you what you want: module.name.analyzed:MooseX

@dagolden
Copy link

This is not user friendly.

Instead of making us jump hoops to know, understand and remember your data model and search engine behaviors, why not just intercept the search box contents before it goes to Lucene and create the right search for us?

module:Foo  → match modules names containing "Foo"
module:^Foo  → match modules names starting with "Foo"

Or, if you don't like colon separators, do something like DDG: !module Foo

@oalders
Copy link
Member Author

oalders commented Jul 11, 2014

My preference here would be to go with the colon separators because that's what people are used to. We could use some other character for stuff that people want to pass directly to ES/Lucene. Aside from the distribution search, I don't think we use this syntax at all. Nobody really seems to be aware of it and it would follow that really nobody is taking advantage of this. Also, you really need to know a fair bit about the internals to take advantage of this.

So, I'd say, let's make this as friendly as possible. If someone wants the old behaviour, they can preface the query with some syntax that doesn't get in the way.

@rwstauner
Copy link
Contributor

I wasn't suggesting that people should know how to work that (or that it was good enough), I was just trying to clarify what Thomas was experiencing.

We actually do have some special casing for author: and dist: (and distribution:) and I agree we should add some more (like module:). Intercepting these is fairly easy and continuing to let other fields that we don't capture pass through to lucene will continue to work.

@rwstauner
Copy link
Contributor

There is also some DDG-like operator in there, but I'm not sure how that works.

We obviously could use a page to explain what's available and how it works.

@oalders FWIW, In the search results there's a link that says "search in distribution" which just redoes the current search with an added dist:blah on it. I'm not implying that anybody knows how to use it directly, but the site itself actually does make use of it :-)

@oalders
Copy link
Member Author

oalders commented Jul 11, 2014

@rwstauner Yeah, that's what I meant with "Aside from the distribution search, I don't think we use this syntax at all". :)

@rwstauner
Copy link
Contributor

Yeah, I guess so. I was looking at the next sentence and thinking you were
considering not needing to keep it if it wasn't used much.

@tsibley
Copy link
Contributor

tsibley commented Jul 11, 2014

@rwstauner Thanks for the great explanation. I was looking at the Lucene docs, which I swear mentioned something about being contains not equals, but I don't see it now. And then to make it more confusing I conflated foo:bar in a query string as being the same as "term": { "foo": "bar" }. Thanks for straightening me out!

I wrote the user-friendly version which munges "module:..." as PR #1246.

@mattp-
Copy link
Contributor

mattp- commented Jul 23, 2014

vanity searches for pause ids seem to return weird results for modules:
https://metacpan.org/search?q=mattp
why is DDP::s returned? https://metacpan.org/search?q=data%3A%3Aprinter%3A%3Ascoped shows the proper main pod for Data::Printer::Scoped.

You can see a similar result searching for https://metacpan.org/search?q=FREW

@frioux
Copy link

frioux commented Jul 25, 2014

Searching for GetOpt yields a weird, apparently unsorted set of output.

@oalders
Copy link
Member Author

oalders commented Dec 9, 2014

@andreeap Despite the fact that this ticket is on metacpan-web, most of the fixes here would involve a deep dive into Elasticsearch rather than front end work, so this is perfect for the scope of your OPfW time. You can pick searches from this list which interest you, create new issues for them and then link those issues back to this one so that we can track their progress.

@oalders
Copy link
Member Author

oalders commented Dec 9, 2014

I should note that a bunch of search-related issues can also be found here https://github.com/CPAN-API/metacpan-web/labels/group:Search

@frioux
Copy link

frioux commented Dec 9, 2014

dbix::class datemethods finds nothing at all

@frioux
Copy link

frioux commented Jan 14, 2015

@its-johnt
Copy link

I'm trying to find something to parse XML, so I searched for "xml". Most of the first results are from modules with last uploads circa 2000. Giving more weight to modules with more recent upload dates may be helpful.

@shlomif
Copy link
Contributor

shlomif commented Jun 11, 2015

From IRC:

This search - https://metacpan.org/search?q=uri - places a module from 1998 with no upvotes or reviews above URI.pm which has 71 upvotes and three 5-star reviews. Furthermore, https://metacpan.org/search?q=XSLT does not find XML::LibXSLT anywhere in the top results.

@ranguard
Copy link
Member

See also:

#1373
#1372
#1253
#905
#1265

@oalders
Copy link
Member Author

oalders commented Jul 22, 2015

[11:27:44]  <ether> https://metacpan.org/search?q=Extutils%3A%3ADepends returns its first match as the wrong distribution
[11:27:51]  <ether> I think this may have come up before?
[11:28:00]  <ether> the indexed module should be ranked first in search results
[11:34:46]  <haarg> caps
[11:42:49]  <leont> He has comaint on it, so it doesn't trigger unauthorized
[11:44:39]  <haarg> for search we really should be ignoring case

@oalders
Copy link
Member Author

oalders commented Jul 24, 2015

[18:16:18]  <ether> more on search results - searching for "YAML-Tiny" results in that distribution in second place, with Tiny::YAML in #1.

@oalders
Copy link
Member Author

oalders commented Apr 3, 2016

[09:10:03] <kentnl> [07:16:19] https://metacpan.org/search?q=JSON&search_type=modules # I'm not sure what to say here, but for some reason, JSON::MaybeXS doesn't rank, despite having a 5-star review rating and 26 ++'s

@pink-mist
Copy link

If you search for either perlvar or perlrun you get a result from PodSimplify from 1996 instead of the latest perl release as first result; perl's perlvar and perlrun pages are the second result for their respective searches.

@Grinnz
Copy link
Contributor

Grinnz commented Dec 8, 2016

https://metacpan.org/search?q=overload In a search for overload, the first result is the correct overload module in core, but its link https://metacpan.org/pod/overload goes to a very unrelated module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Priority
Development

No branches or pull requests