-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update fetchLinkSuggestions to sort results by relevancy #62397
Conversation
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Unlinked AccountsThe following contributors have not linked their GitHub and WordPress.org accounts: @scrobbleme. Contributors, please read how to link your accounts to ensure your work is properly credited in WordPress releases. If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Is this really a bug? Should this be backported for WP 6.6? If so, please add the |
I think so. The existing experience is broken. I don't think it needs to be backported to 6.6. |
This comment was marked as outdated.
This comment was marked as outdated.
Thanks for this; As I understand, this would also close #39964 |
- Rename __experimentalFetchLinkSuggestions to fetchLinkSuggestions. - Rewrite fetchLinkSuggestions in TypeScript. - Sort results by relevancy using cosine similiarty between term frequency vectors.
2b7a770
to
16628ee
Compare
Size Change: +922 B (+0.05%) Total Size: 1.76 MB
ℹ️ View Unchanged
|
Rebased this and fixed the tests. It's ready for review.
The simpler approach achieves good enough results so I am happy to switch to it if cosine similarity is too difficult to understand or more than we need. Let me know, I won't be offended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been testing this with lots and lots of taxonomy terms, posts and pages in English, German and Japanese.
It's hard to see the benefit of these changes at first — trunk behaves mostly the same until you have 100s of pages/posts with similar keywords.
Here's me looking for "Paris"
Kapture.2024-06-13.at.14.37.58.mp4
Great stuff - overall I think it's a big improvement to have tags/cats surfaced this way.
Chat-GPT and "ELI5" set me straight 😄 But it is very clever, so I paused at whether this will be accessible to folks who want to iterate on the feature. Is the "simpler" approach less code? Does it perform as well? Easier to read? If the answer is "yes" to these questions I'd probably consider using the simpler approach, or at least stating a convincing reason to go with cosine similarity, e.g., it produces much better results more of the time. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work here, and thanks for the detailed explanations for how this works, and for the test wp cli commands 👍
It didn't work as well for me in testing, though, because cosine similarity will give smaller titles that are similar to the query an edge over long titles that contain lots of irrelevant words in addition to the query.
IMO I think the advantage of smaller titles that more directly match the query could be worth it, especially if you wind up having a simple category like Travel
where you'd expect it to be at the top. With this PR applied, it's working very nicely for me:
Whereas on trunk
I get pages first of all, and categories and tags are way down at the bottom of the list.
Trunk top of list | Trunk bottom of list |
---|---|
Just left a few questions, but overall I think this is a big improvement, and I also like the idea of stabilising the API for it. The function has been around for a long time and it seems general purpose enough to be useful in situations outside of Gutenberg to me (i.e. I could imagine a plugin in an admin area wanting a quick way of fetching link suggestions).
Would it be worth getting a second opinion / see if anyone objects to the more complex cosine similarity approach?
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
I tried a scoring approach where we simply divide the number of tokens in the title that are also in the search query, divided by the total number of tokens in the title. This means shorter titles receive higher scores, fixing the problem I noted in #62397 (comment). It seems to work well enough in my testing and is much simpler to understand. I'd appreciate if you can test it with various search terms, etc. again though 🙂 |
The more I look at the API of |
This is still testing nicely for me after the latest change 👍 Is this since we're still iterating on the logic? Sounds reasonable to defer stabilising it for now. |
Updated PR description. This is altogether a much simpler PR now 😅
No but that's not a bad reason either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still testing nicely for me, and I like that the logic is simpler, easier to read and to maintain. Not for now, but another potential future enhancement could be to look at partial matches with tokens / fuzzy search, too, as I need to type the whole word "travel" before the travel category gets to the top of the list:
"trave"
"travel"
This is already a big improvement, though, so just jotting this down as a thought, nothing to worry about for now.
LGTM! 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran through similar tests as before. 500+ database entities with different languages.
Tags and categories are surfaced as expected. Snappy results.
Kapture.2024-06-14.at.16.22.11.mp4
Very nice. 🚢
Thanks for bearing 🐻 with me while I ventured unnecessarily deep into rabbit 🐰 holes! |
…2397) * Update fetchLinkSuggestions to sort results by relevancy - Rename __experimentalFetchLinkSuggestions to fetchLinkSuggestions. - Rewrite fetchLinkSuggestions in TypeScript. - Sort results by relevancy using cosine similiarty between term frequency vectors. * Fix tsc errors * Update @wordpress/core-data imports * Make tokenize unicode aware * Remove unnecessary mutation * Add tests for all the helper functions * Simpler scoring function * Keep experimental for now Unlinked contributors: scrobbleme. Co-authored-by: noisysocks <noisysocks@git.wordpress.org> Co-authored-by: ellatrix <ellatrix@git.wordpress.org> Co-authored-by: ntsekouras <ntsekouras@git.wordpress.org> Co-authored-by: ramonjd <ramonopoly@git.wordpress.org> Co-authored-by: andrewserong <andrewserong@git.wordpress.org> Co-authored-by: skorasaurus <skorasaurus@git.wordpress.org>
What?
Fixes #56478.
Updates
fetchLinkSuggestions
, which is used byLinkControl
, which is used by the popover that appears when you insert a link, to sort its results by relevancy to the search query that the user typed in.Why?
When inserting a link, the user can search for posts, pages, tags, categories, post formats, and media. The search is done via the
/wp/v2/search
REST API endpoint.Unfortunately WordPress can only search one database table at a time. For example it can't search
wp_posts
andwp_terms
using one SQL query. The REST API does not hide this limitation and forces you to call/wp/v2/search
with just onetype
param.To get around this limitation,
fetchLinkSuggestions
currently makes 4 requests (posts, taxonomies, post formats, media) at once using aPromise.all
, and then concatenates the results together. There is a max of 20 results per request for a combined max of 80 results. The first 20 results of the combined results are shown to the user.There's a problem with this approach. Say the user is searching for a category named "Travel Tips" and that there are 20 pages or posts containing the words "travel" or "tips". The "Travel Tips" category will never be shown because it is crowded out by the 20 posts that appear in the combined results before any single category appears.
How?
My fix for this issue is to sort the combined results before we take the first 20 and show them to the user.
I'm sorting the results by how similar the title is to the search query that the user provided. This is done using cosine similarity. We treat the search query as one document and the result title as another document. Then, we build term frequency map vectors for each document. Finally, we calculate the relevancy score by taking the cosine similarity of the two vectors.This PR now sorts the results by scoring each result where the score is the number of tokens in the title that are also in the search query, divided by the total number of tokens in the title. This gives us a score between 0 and 1, where 1 is a perfect match. This achieves good enough results but is much simpler to understand than the previous approach described above.
Alternative approaches
Ideally this would all be handled at the database level using full text search. I don't think we can assume that every WordPress installation's MySQL database has full text search enabled, however.
We could look at doing logic similar to what's in this PR at the server level. I'm not sure if we should, though. The REST API is arguably correct to not encapsulate this limitation. Not all clients will want to order and combine results the same way.
Doing this in the client doesn't prevent us from moving the logic to the server in the future, so it's a good place to start.
I tried simple keyword matching as an alternative algorithm for ranking results. This is where you award 1 point per keyword that the title and search term have in common. It didn't work as well for me in testing, though, because cosine similarity will give smaller titles that are similar to the query an edge over long titles that contain lots of irrelevant words in addition to the query.House keeping
This PR also contains some house keeping while I'm in this part of the codebase:
Rename__experimentalFetchLinkSuggestions
tofetchLinkSuggestions
.fetchLinkSuggestions
in TypeScript. (It was already partially JSDoc typed.)Testing Instructions
You really don't notice this bug unless you're testing with real data and have more than 20 posts, tags, etc.
If you don't have any real data, you use WP CLI to create some realistic-enough posts, pages, tags, and categories. Here's some test data that I had ChatGPT spit out:
Test data
Now:
Screenshots or screencast
Before:
Kapture.2024-06-07.at.16.16.16.mp4
After:
Kapture.2024-06-07.at.16.17.18.mp4