-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1876208 - API for dismissing suggestions #6147
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6147 +/- ##
=======================================
Coverage 84.09% 84.09%
=======================================
Files 117 117
Lines 15627 15627
=======================================
Hits 13141 13141
Misses 2486 2486 ☔ View full report in Codecov by Sentry. |
We've decided to stop storing unparsable records and simply always drop and recreate the suggestions DB whenever the schema is updated. |
Yes it was, I was wondering where it went and why I had to close that issue twice 🤣 |
fc521b1
to
3f6684a
Compare
6ae8733
to
856a73f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, r=wc.
components/suggest/src/schema.rs
Outdated
let mut hasher = Md5::new(); | ||
hasher.update(s.as_bytes()); | ||
let hash_value = i128::from_le_bytes(hasher.finalize().into()); | ||
Ok(Some(hash_value as i64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: let's grab the leading 64 bits explicitly. Using as
to cast from a larger integer to a smaller integer will truncate which could yield weird negative results.
println!("{}", (i128::MAX >> 64) as i64); // => 9223372036854775807
println!("{}", i128::MAX as i64); // => -1
println!("{}", (i128::MAX - 1) as i64); // => -2
println!("{}", (i128::MAX - 2) as i64); // => -3
println!("{}", (i128::MAX - 3) as i64); // => -4
Ok(Some(hash_value as i64)) | |
Ok(Some((hash_value >> 64) as i64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code to do this, but I think there still could be negatives (for example u128::MAX
gets converted to -1
). We could shift by 65 bits to avoid this, but I'd rather keep the extra hash bit and live with potentially negative numbers. WDYT?
components/suggest/src/store.rs
Outdated
/// | ||
/// Dismissed suggestions will not be returned again | ||
#[handle_error(Error)] | ||
pub fn dismiss_suggestion(&self, suggestion: Suggestion) -> SuggestApiResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @0c0w3, FYI, this is the dismissal API exposed by the component. It will take a Suggestion
as the only argument. Any concerns about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it a bit more (sorry I didn't think of this earlier!), the one downside I can think of is that the caller needs to hold on to the entire Suggestion
for as long as it's visible.
We can do that on Android and iOS—there's a metadata
map we can use to smuggle it on Android, and we can add a field on iOS...
But since the suggestion carries the raw icon
data, we're duplicating the icon data in memory: once for the buffer in the Suggestion
, and once for the decoded android.graphics.Bitmap
/ UIImage
instance. I think we'll also incur a third copy of that icon data when we pass it back over the FFI to dismiss_suggestion
.
This would be a backward-incompatible change, but I wonder if it's worth making Url
a dictionary or an opaque interface
type. Alternatively, maybe we could add a dismissal_id
field—either a string, or an opaque interface
—that carries the dismissal URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I was initially thinking about the special handling we did for desktop and wondered if it's possible to restore the Suggestion
object and pass it back to the component with all needed information, especially, the enum variant type. Lina's comment makes this even more questionable.
How about we just take a String
instead as the only input for this API? Or, if we expect to include other fields for a dismissal in the future, we can add an interface Dismiss
with a string url
field for now and use this to capture the context for this API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of having the consumer send us a URL rather than the full Suggestion.
I think now that https://bugzilla.mozilla.org/show_bug.cgi?id=1875848 has been implemented, we can do a type check to figure out the variant when we try to extract the URL. This still means some duplicated code, but it's not that bad.
Going forward, I think we could probably improve the API shape. Now that mozilla/uniffi-rs#1869 is merged, it should be possible to define enums that store objects which gives us many more options. Maybe the icon
field could be an object. I think that would allow us to pass Suggestion instances around efficiently (assuming that's the only field that's storing a large amount of data).
856a73f
to
f716d71
Compare
components/suggest/src/schema.rs
Outdated
} | ||
|
||
mod sql_fns { | ||
use md5::{Digest, Md5}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my Places peer hat on 🧙🏼♀️ I should call out that the Desktop security engineering team has been working on removing deprecated hashes from Firefox, including MD5. Places recently switched to using SHA-256 for cases where it previously used MD5.
Even though we have a known and trusted set of URLs here, I think introducing a new use of a deprecated hash function is still going to raise a few eyebrows when we vendor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the average length of suggestions.url
is ~64 bytes. SHA-256 digests are 32 bytes, so it looks like it's still a bit more space-efficient on average to use a hash instead of the full URL.
That said, though, I think storing the full URL should also be OK, especially if it turns out we need it because of the %%YYYYMMDDHH%
replacement parameters.
In general, I think a hash is going to be more efficient than a long text primary key or index when there are a lot of rows, as in Places (tens of thousands). I don't think we expect dismissed suggestions to be anywhere close to that size, though? For dozens or hundreds of rows, storing the URL directly is A-OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between moving to SHA-256 and using the full URL, I think we should just use the full one. As you point out the space difference is small and we shouldn't expect many URLs to be stored in the table. @ncloudioj how do you feel about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up, Lina!
I am still leaning towards hashing those dismissed URLs to be consistent with what we are doing for other similar cases in the browser (e.g. we MD5 hash the dismissed URLs for New Tab; we SHA1 hash dismissed URLs for Firefox Suggest Desktop). We can pick SHA-256 to stay away from those insecure hashers. If we don't store the whole digest, we could grab a smaller portion of it (e.g. 16 bytes) which should offer enough entropy given the relatively smaller input size.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could grab a smaller portion of it (e.g. 16 bytes) which should offer enough entropy given the relatively smaller input size.
That was my other question 😅—it looks like you're only keeping the first 8 bytes for the primary key, so any collisions will exclude all the colliding URLs. Is that right? That still sounds OK for our case, but definitely subtle.
Even if we used the full 32-byte hash, though, it's still going to be shorter than any URLs—the shortest URL we have is https://en.wikipedia.org/wiki/LVM3
, which is 34 bytes.
For comparison, Places uses a non-cryptographic 48-bit hash, with a non-unique index—so if there are any collisions, it'll still compare URLs, but the index reduces the search space from tens of thousands to a handful. But that scheme pays off for larger tables—for smaller tables like ours, it can end up increasing the on-disk size.
Do you happen to know why we store hashes on Desktop, @ncloudioj and @0c0w3? Is it mostly for size benefits, or are there other reasons like privacy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we start with raw URLs and consider adding the hashes as a performance/privacy improvement later (although I have some questions on how much of a privacy gain this is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, let's revisit this later.
although I have some questions on how much of a privacy gain this is.
Curious about your questions, can you share?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion list is public and also fairly small. So shouldn't an attacker be able to create a map of hashed URLs -> full URLs. This won't be perfect if there's a hash collision, but it seems like this would get around most of the privacy protections. Maybe adding a salt component would fix this though? Not sure on that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Your point makes total sense in this context.
I think for all those dismissal scenarios, we were not attempting to apply standard mechanisms (salting & encrypting) to secure the data or preserve users' privacy in the best effort. Rather, we meant to convert the data in a way such that the processed data is not obvious to read. Imagine one dismissed a URL from the New Tab page because they wanted to "hide" it, storing it in plain in another user visible place could cause some surprises & awkwardness later. Other benefits (such as smaller footprint & easier to index/query) are a plus so I think that's why decided to hash the raw URLs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we start with raw URLs
That sounds great to me!
f716d71
to
8cc30fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general shape of this looks great—thank you so much @bendk!
components/suggest/src/schema.rs
Outdated
} | ||
|
||
mod sql_fns { | ||
use md5::{Digest, Md5}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could grab a smaller portion of it (e.g. 16 bytes) which should offer enough entropy given the relatively smaller input size.
That was my other question 😅—it looks like you're only keeping the first 8 bytes for the primary key, so any collisions will exclude all the colliding URLs. Is that right? That still sounds OK for our case, but definitely subtle.
Even if we used the full 32-byte hash, though, it's still going to be shorter than any URLs—the shortest URL we have is https://en.wikipedia.org/wiki/LVM3
, which is 34 bytes.
For comparison, Places uses a non-cryptographic 48-bit hash, with a non-unique index—so if there are any collisions, it'll still compare URLs, but the index reduces the search space from tens of thousands to a handful. But that scheme pays off for larger tables—for smaller tables like ours, it can end up increasing the on-disk size.
Do you happen to know why we store hashes on Desktop, @ncloudioj and @0c0w3? Is it mostly for size benefits, or are there other reasons like privacy?
b39f9b8
to
f600a67
Compare
Updated this PR to not hash the URLs for now at least. AFAICT, these are the remaining open questions:
@0c0w3 @linabutler @ncloudioj what do you think about the remaining questions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Is the external API okay (passing a Suggestion object into dismiss_suggestion)?
The reason I was asking is b/c those comments, and wanted to confirm with Drew to see if it's possible for the consumer to feed the Suggestion object back for dismissal.
hashing only 63-bits to avoid signed integer issues (not currently relevant for this feature, but I figure we might as well check in the hash function in case we want to use it in the future).
Per Lina's comment, we'd like to avoid those legacy insecure hashers (e.g. MD5) in Places, shall we follow that here as well?
@@ -127,7 +127,7 @@ pub const SQL: &str = " | |||
-- Just store the MD5 hash of the dismissed suggestion. The collision rate is low and the | |||
-- impact of a collision is not showing a suggestion, which is not that bad. | |||
CREATE TABLE dismissed_suggestions ( | |||
url_hash INTEGER PRIMARY KEY | |||
url TEXT PRIMARY KEY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we're storing raw URLs, let's drop the above comment about hashing.
Good point. I just went ahead and removed that code, since it seems like we don't know exactly how we want to handle hashing in the future. |
f600a67
to
b238c57
Compare
Thank you so much for working through all this, @bendk! I did think of one downside for passing the full |
beea5c0
to
b1576a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ben!
Would it be a bad idea to have a single check-for-dismissals step at the end of fetch_suggestions()
? It seems a little onerous to have to add a subquery (or whatever logic) per suggestion type, especially as we start to scale up new types. Doing one more query at the end doesn't seem too bad especially since dismissed_suggestions
is keyed on the URL.
This doesn't handle Yelp suggestions, which is fine for a follow-up IMO. We'll need to define what constitutes a raw URL for Yelp since the way Yelp builds URLs is relatively complex (location params, query modifiers).
components/suggest/src/suggest.udl
Outdated
@@ -123,6 +123,12 @@ interface SuggestStore { | |||
[Throws=SuggestApiError] | |||
sequence<Suggestion> query(SuggestionQuery query); | |||
|
|||
[Throws=SuggestApiError] | |||
void dismiss_suggestion(string suggestion_url); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document here that suggestion_url
should be the raw URL and call it raw_url
or suggestion_raw_url
? It would be very easy for consumers to pass in cooked URLs and break their dismissal feature.
I don't know if people consult the udl when they're implementing features -- I sometimes do -- so maybe it won't make much of a difference in practice. I'm almost tempted to suggest introducing a new type for raw URLs or a dismissal token or something. :-D (or this function could take the suggestion itself)
| Self::Wikipedia { url, .. } | ||
| Self::Amo { url, .. } | ||
| Self::Yelp { url, .. } | ||
| Self::Mdn { url, .. } => Some(url), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Desktop adds UTM params to Pocket, Amo, Yelp, and Mdn URLs. I don't know whether mobile does too but if not it's conceivable, so we should look into moving those into Rust so we have proper raw URLs for those types too and consumers don't need to strip off UTM params before dismissing URLs.
Beyond that, we'll need to work out what "raw URL" means for Yelp, as I mentioned.
These are concerns for follow-ups but I think it's something for desktop (at least) to watch out for as we switch over to this new API.
Dismissed suggestion Url hashes are stored in the database and the suggestions are not returned again in subsequent queries. This currently works with most providers, but not Yelp or Weather.
b1576a6
to
b934153
Compare
This seems like a good idea to me, but do we need to worry about the limits? I'm slightly paranoid that the dismissed suggestions will push non-dismissed suggestions down the list and cause them to be filtered out. |
AFAICT, there's no major blockers left. How do others feel about merging this one, then following up on:
Are there any other remaining questions that I'm missing? |
Sounds good to me. Go ahead! |
(This is on top of #5546, let's merge that one first)
Dismissed suggestion Urls are stored in the database and not returned again in subsequent queries.
This currently works with most providers, but not Yelp or Weather.
Pull Request checklist
[ci full]
to the PR title.Branch builds: add
[firefox-android: branch-name]
to the PR title.