Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1876208 - API for dismissing suggestions #6147

Merged
merged 1 commit into from
Mar 20, 2024

Conversation

bendk
Copy link
Contributor

@bendk bendk commented Feb 27, 2024

(This is on top of #5546, let's merge that one first)

Dismissed suggestion Urls are stored in the database and not returned again in subsequent queries.

This currently works with most providers, but not Yelp or Weather.

Pull Request checklist

  • Breaking changes: This PR follows our breaking change policy
    • This PR follows the breaking change policy:
      • This PR has no breaking API changes, or
      • There are corresponding PRs for our consumer applications that resolve the breaking changes and have been approved
  • Quality: This PR builds and tests run cleanly
    • Note:
      • For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
      • If this pull request includes a breaking change, consider cutting a new release after merging.
  • Tests: This PR includes thorough tests or an explanation of why it does not
  • Changelog: This PR includes a changelog entry in CHANGELOG.md or an explanation of why it does not need one
    • Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
  • Dependencies: This PR follows our dependency management guidelines
    • Any new dependencies are accompanied by a summary of the due dilligence applied in selecting them.

Branch builds: add [firefox-android: branch-name] to the PR title.

@codecov-commenter
Copy link

codecov-commenter commented Feb 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.09%. Comparing base (48798d2) to head (b934153).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6147   +/-   ##
=======================================
  Coverage   84.09%   84.09%           
=======================================
  Files         117      117           
  Lines       15627    15627           
=======================================
  Hits        13141    13141           
  Misses       2486     2486           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bendk
Copy link
Contributor Author

bendk commented Feb 27, 2024

We've decided to stop storing unparsable records and simply always drop and recreate the suggestions DB whenever the schema is updated.

@bendk bendk closed this Feb 27, 2024
@linabutler
Copy link
Contributor

@bendk Hmmm, was that comment and close meant for #6107? 😅

@bendk
Copy link
Contributor Author

bendk commented Feb 28, 2024

Yes it was, I was wondering where it went and why I had to close that issue twice 🤣

@bendk bendk reopened this Feb 28, 2024
@bendk bendk force-pushed the suggest-dismissal branch 5 times, most recently from fc521b1 to 3f6684a Compare March 12, 2024 13:51
@bendk bendk force-pushed the suggest-dismissal branch 2 times, most recently from 6ae8733 to 856a73f Compare March 12, 2024 17:38
Copy link
Member

@ncloudioj ncloudioj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, r=wc.

components/suggest/src/db.rs Outdated Show resolved Hide resolved
let mut hasher = Md5::new();
hasher.update(s.as_bytes());
let hash_value = i128::from_le_bytes(hasher.finalize().into());
Ok(Some(hash_value as i64))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's grab the leading 64 bits explicitly. Using as to cast from a larger integer to a smaller integer will truncate which could yield weird negative results.

    println!("{}", (i128::MAX >> 64) as i64);  // => 9223372036854775807

    println!("{}", i128::MAX as i64);               // => -1
    println!("{}", (i128::MAX - 1) as i64);       // => -2
    println!("{}", (i128::MAX - 2) as i64);      // => -3
    println!("{}", (i128::MAX - 3) as i64);      // => -4
Suggested change
Ok(Some(hash_value as i64))
Ok(Some((hash_value >> 64) as i64))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code to do this, but I think there still could be negatives (for example u128::MAX gets converted to -1). We could shift by 65 bits to avoid this, but I'd rather keep the extra hash bit and live with potentially negative numbers. WDYT?

///
/// Dismissed suggestions will not be returned again
#[handle_error(Error)]
pub fn dismiss_suggestion(&self, suggestion: Suggestion) -> SuggestApiResult<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @0c0w3, FYI, this is the dismissal API exposed by the component. It will take a Suggestion as the only argument. Any concerns about that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it a bit more (sorry I didn't think of this earlier!), the one downside I can think of is that the caller needs to hold on to the entire Suggestion for as long as it's visible.

We can do that on Android and iOS—there's a metadata map we can use to smuggle it on Android, and we can add a field on iOS...

But since the suggestion carries the raw icon data, we're duplicating the icon data in memory: once for the buffer in the Suggestion, and once for the decoded android.graphics.Bitmap / UIImage instance. I think we'll also incur a third copy of that icon data when we pass it back over the FFI to dismiss_suggestion.

This would be a backward-incompatible change, but I wonder if it's worth making Url a dictionary or an opaque interface type. Alternatively, maybe we could add a dismissal_id field—either a string, or an opaque interface—that carries the dismissal URL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I was initially thinking about the special handling we did for desktop and wondered if it's possible to restore the Suggestion object and pass it back to the component with all needed information, especially, the enum variant type. Lina's comment makes this even more questionable.

How about we just take a String instead as the only input for this API? Or, if we expect to include other fields for a dismissal in the future, we can add an interface Dismiss with a string url field for now and use this to capture the context for this API.

Copy link
Contributor Author

@bendk bendk Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having the consumer send us a URL rather than the full Suggestion.

I think now that https://bugzilla.mozilla.org/show_bug.cgi?id=1875848 has been implemented, we can do a type check to figure out the variant when we try to extract the URL. This still means some duplicated code, but it's not that bad.

Going forward, I think we could probably improve the API shape. Now that mozilla/uniffi-rs#1869 is merged, it should be possible to define enums that store objects which gives us many more options. Maybe the icon field could be an object. I think that would allow us to pass Suggestion instances around efficiently (assuming that's the only field that's storing a large amount of data).

@bendk bendk force-pushed the suggest-dismissal branch from 856a73f to f716d71 Compare March 13, 2024 17:58
}

mod sql_fns {
use md5::{Digest, Md5};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my Places peer hat on 🧙🏼‍♀️ I should call out that the Desktop security engineering team has been working on removing deprecated hashes from Firefox, including MD5. Places recently switched to using SHA-256 for cases where it previously used MD5.

Even though we have a known and trusted set of URLs here, I think introducing a new use of a deprecated hash function is still going to raise a few eyebrows when we vendor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the average length of suggestions.url is ~64 bytes. SHA-256 digests are 32 bytes, so it looks like it's still a bit more space-efficient on average to use a hash instead of the full URL.

That said, though, I think storing the full URL should also be OK, especially if it turns out we need it because of the %%YYYYMMDDHH% replacement parameters.

In general, I think a hash is going to be more efficient than a long text primary key or index when there are a lot of rows, as in Places (tens of thousands). I don't think we expect dismissed suggestions to be anywhere close to that size, though? For dozens or hundreds of rows, storing the URL directly is A-OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between moving to SHA-256 and using the full URL, I think we should just use the full one. As you point out the space difference is small and we shouldn't expect many URLs to be stored in the table. @ncloudioj how do you feel about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up, Lina!

I am still leaning towards hashing those dismissed URLs to be consistent with what we are doing for other similar cases in the browser (e.g. we MD5 hash the dismissed URLs for New Tab; we SHA1 hash dismissed URLs for Firefox Suggest Desktop). We can pick SHA-256 to stay away from those insecure hashers. If we don't store the whole digest, we could grab a smaller portion of it (e.g. 16 bytes) which should offer enough entropy given the relatively smaller input size.

Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could grab a smaller portion of it (e.g. 16 bytes) which should offer enough entropy given the relatively smaller input size.

That was my other question 😅—it looks like you're only keeping the first 8 bytes for the primary key, so any collisions will exclude all the colliding URLs. Is that right? That still sounds OK for our case, but definitely subtle.

Even if we used the full 32-byte hash, though, it's still going to be shorter than any URLs—the shortest URL we have is https://en.wikipedia.org/wiki/LVM3, which is 34 bytes.

For comparison, Places uses a non-cryptographic 48-bit hash, with a non-unique index—so if there are any collisions, it'll still compare URLs, but the index reduces the search space from tens of thousands to a handful. But that scheme pays off for larger tables—for smaller tables like ours, it can end up increasing the on-disk size.

Do you happen to know why we store hashes on Desktop, @ncloudioj and @0c0w3? Is it mostly for size benefits, or are there other reasons like privacy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we start with raw URLs and consider adding the hashes as a performance/privacy improvement later (although I have some questions on how much of a privacy gain this is).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, let's revisit this later.

although I have some questions on how much of a privacy gain this is.

Curious about your questions, can you share?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion list is public and also fairly small. So shouldn't an attacker be able to create a map of hashed URLs -> full URLs. This won't be perfect if there's a hash collision, but it seems like this would get around most of the privacy protections. Maybe adding a salt component would fix this though? Not sure on that one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Your point makes total sense in this context.

I think for all those dismissal scenarios, we were not attempting to apply standard mechanisms (salting & encrypting) to secure the data or preserve users' privacy in the best effort. Rather, we meant to convert the data in a way such that the processed data is not obvious to read. Imagine one dismissed a URL from the New Tab page because they wanted to "hide" it, storing it in plain in another user visible place could cause some surprises & awkwardness later. Other benefits (such as smaller footprint & easier to index/query) are a plus so I think that's why decided to hash the raw URLs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we start with raw URLs

That sounds great to me!

@bendk bendk force-pushed the suggest-dismissal branch from f716d71 to 8cc30fe Compare March 13, 2024 19:19
Copy link
Contributor

@linabutler linabutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general shape of this looks great—thank you so much @bendk!

components/suggest/src/db.rs Show resolved Hide resolved
components/suggest/src/db.rs Show resolved Hide resolved
}

mod sql_fns {
use md5::{Digest, Md5};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could grab a smaller portion of it (e.g. 16 bytes) which should offer enough entropy given the relatively smaller input size.

That was my other question 😅—it looks like you're only keeping the first 8 bytes for the primary key, so any collisions will exclude all the colliding URLs. Is that right? That still sounds OK for our case, but definitely subtle.

Even if we used the full 32-byte hash, though, it's still going to be shorter than any URLs—the shortest URL we have is https://en.wikipedia.org/wiki/LVM3, which is 34 bytes.

For comparison, Places uses a non-cryptographic 48-bit hash, with a non-unique index—so if there are any collisions, it'll still compare URLs, but the index reduces the search space from tens of thousands to a handful. But that scheme pays off for larger tables—for smaller tables like ours, it can end up increasing the on-disk size.

Do you happen to know why we store hashes on Desktop, @ncloudioj and @0c0w3? Is it mostly for size benefits, or are there other reasons like privacy?

components/suggest/src/suggestion.rs Show resolved Hide resolved
@bendk bendk force-pushed the suggest-dismissal branch from b39f9b8 to f600a67 Compare March 14, 2024 16:58
@bendk
Copy link
Contributor Author

bendk commented Mar 14, 2024

Updated this PR to not hash the URLs for now at least. AFAICT, these are the remaining open questions:

  • Is the external API okay (passing a Suggestion object into dismiss_suggestion)?
  • pub vs pub(crate)
  • hashing only 63-bits to avoid signed integer issues (not currently relevant for this feature, but I figure we might as well check in the hash function in case we want to use it in the future).

@0c0w3 @linabutler @ncloudioj what do you think about the remaining questions?

Copy link
Member

@ncloudioj ncloudioj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Is the external API okay (passing a Suggestion object into dismiss_suggestion)?

The reason I was asking is b/c those comments, and wanted to confirm with Drew to see if it's possible for the consumer to feed the Suggestion object back for dismissal.

hashing only 63-bits to avoid signed integer issues (not currently relevant for this feature, but I figure we might as well check in the hash function in case we want to use it in the future).

Per Lina's comment, we'd like to avoid those legacy insecure hashers (e.g. MD5) in Places, shall we follow that here as well?

@@ -127,7 +127,7 @@ pub const SQL: &str = "
-- Just store the MD5 hash of the dismissed suggestion. The collision rate is low and the
-- impact of a collision is not showing a suggestion, which is not that bad.
CREATE TABLE dismissed_suggestions (
url_hash INTEGER PRIMARY KEY
url TEXT PRIMARY KEY
Copy link
Member

@ncloudioj ncloudioj Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're storing raw URLs, let's drop the above comment about hashing.

@bendk
Copy link
Contributor Author

bendk commented Mar 14, 2024

Per Lina's comment, we'd like to avoid those legacy insecure hashers (e.g. MD5) in Places, shall we follow that here as well?

Good point. I just went ahead and removed that code, since it seems like we don't know exactly how we want to handle hashing in the future.

@bendk bendk force-pushed the suggest-dismissal branch from f600a67 to b238c57 Compare March 14, 2024 18:09
@linabutler
Copy link
Contributor

Thank you so much for working through all this, @bendk!

I did think of one downside for passing the full Suggestion to dismiss_suggestion that I should've thought of earlier, but this review thread is getting pretty long, and the rest of the plumbing looks good. We can always land this PR as-is, and iterate on the API in a follow-up.

@bendk bendk force-pushed the suggest-dismissal branch 2 times, most recently from beea5c0 to b1576a6 Compare March 15, 2024 20:32
Copy link
Contributor

@0c0w3 0c0w3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben!

Would it be a bad idea to have a single check-for-dismissals step at the end of fetch_suggestions()? It seems a little onerous to have to add a subquery (or whatever logic) per suggestion type, especially as we start to scale up new types. Doing one more query at the end doesn't seem too bad especially since dismissed_suggestions is keyed on the URL.

This doesn't handle Yelp suggestions, which is fine for a follow-up IMO. We'll need to define what constitutes a raw URL for Yelp since the way Yelp builds URLs is relatively complex (location params, query modifiers).

@@ -123,6 +123,12 @@ interface SuggestStore {
[Throws=SuggestApiError]
sequence<Suggestion> query(SuggestionQuery query);

[Throws=SuggestApiError]
void dismiss_suggestion(string suggestion_url);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document here that suggestion_url should be the raw URL and call it raw_url or suggestion_raw_url? It would be very easy for consumers to pass in cooked URLs and break their dismissal feature.

I don't know if people consult the udl when they're implementing features -- I sometimes do -- so maybe it won't make much of a difference in practice. I'm almost tempted to suggest introducing a new type for raw URLs or a dismissal token or something. :-D (or this function could take the suggestion itself)

| Self::Wikipedia { url, .. }
| Self::Amo { url, .. }
| Self::Yelp { url, .. }
| Self::Mdn { url, .. } => Some(url),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Desktop adds UTM params to Pocket, Amo, Yelp, and Mdn URLs. I don't know whether mobile does too but if not it's conceivable, so we should look into moving those into Rust so we have proper raw URLs for those types too and consumers don't need to strip off UTM params before dismissing URLs.

Beyond that, we'll need to work out what "raw URL" means for Yelp, as I mentioned.

These are concerns for follow-ups but I think it's something for desktop (at least) to watch out for as we switch over to this new API.

@0c0w3 0c0w3 requested review from 0c0w3 and removed request for 0c0w3 March 16, 2024 02:12
Dismissed suggestion Url hashes are stored in the database and the
suggestions are not returned again in subsequent queries.

This currently works with most providers, but not Yelp or Weather.
@bendk bendk force-pushed the suggest-dismissal branch from b1576a6 to b934153 Compare March 18, 2024 21:35
@bendk
Copy link
Contributor Author

bendk commented Mar 18, 2024

Would it be a bad idea to have a single check-for-dismissals step at the end of fetch_suggestions()? It seems a little onerous to have to add a subquery (or whatever logic) per suggestion type, especially as we start to scale up new types. Doing one more query at the end doesn't seem too bad especially since dismissed_suggestions is keyed on the URL.

This seems like a good idea to me, but do we need to worry about the limits? I'm slightly paranoid that the dismissed suggestions will push non-dismissed suggestions down the list and cause them to be filtered out.

@bendk
Copy link
Contributor Author

bendk commented Mar 19, 2024

AFAICT, there's no major blockers left. How do others feel about merging this one, then following up on:

  • Simplifying the code by removing the sub-queries on each suggestion
  • Better system to identify suggestions

Are there any other remaining questions that I'm missing?

@ncloudioj
Copy link
Member

Sounds good to me. Go ahead! :shipit:

@bendk bendk added this pull request to the merge queue Mar 20, 2024
Merged via the queue into mozilla:main with commit 5f00479 Mar 20, 2024
16 checks passed
@bendk bendk deleted the suggest-dismissal branch March 20, 2024 15:18
@bendk
Copy link
Contributor Author

bendk commented Mar 20, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants