-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1672951 - Implement a Metrics Ping Scheduler in glean-core #1599
Conversation
c5f6156
to
5fa02a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the general approach is sound and indeed a 1:1 mapping of the "old" implementation.
I think if we treat this implementation as "experimental" (maybe even put it behind a feature with that name?) we can consider some (breaking) changes on it as followups.
I like the testing approach, though might bikeshed on the Policy
name (but don't have a better one yet).
(and of course we need to add some more tests)
I have a bunch of comments I left inline.
glean-core/src/lib.rs
Outdated
@@ -129,6 +134,11 @@ pub struct Configuration { | |||
pub max_events: Option<usize>, | |||
/// Whether Glean should delay persistence of data from metrics with ping lifetime. | |||
pub delay_ping_lifetime_io: bool, | |||
/// The application's build identifier. If this changes and use_core_mps is `true`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes when?
glean-core/src/scheduler.rs
Outdated
/// Must be called before draining the preinit queue. | ||
/// (We're at the Language Bindings' mercy for that) | ||
pub fn schedule(glean: &Glean) { | ||
let now = Local::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call is what's currently causing some crashes in FOG and it's why we have this awkward workaround now: https://github.com/mozilla/glean/blob/main/glean-core/src/util.rs#L53-L89
We should probably rely on the same workaround (probably possible to split local_now_with_offset
into local_now
and local_now_with_offset
which calls local_now
and turns DateTime<Local>
into DateTime<FixedOffset>
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...but won't we have problems with any constructed value of DateTime<Local>
no matter what? I mean, if Local
's timezone is problematic, there's not really much we can do with it.
I guess I'll rewrite scheduler to use DateTime<FixedOffset>
and use local_now_with_offset
to get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use the wrong timezone consistently then it's still fine.
But we shouldn't crash when encountering something wrong.
Using DateTime<FixedOffset>
throughout should be still correct usage.
glean-core/src/scheduler.rs
Outdated
if let Some(last_sent_build) = last_sent_build_metric.get_value(&glean, INTERNAL_STORAGE) { | ||
// XXX: If `app_build` is longer than StringMetric's max length, we will always | ||
// treat it as a changed build when really it isn't. | ||
if &last_sent_build != glean.app_build.as_ref().unwrap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we always have Some
app_build here?
Also we use it below again, so maybe put that into a variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have app_build
iff cfg.use_core_mps
. But I'm not happy with how it's only enforced by logic not language. cfg.app_build
is always present (often "unknown"), I was using Option
to signal more than the app_build... maybe I should add use_core_mps
to struct Glean
?
glean-core/src/scheduler.rs
Outdated
// treat it as a changed build when really it isn't. | ||
if &last_sent_build != glean.app_build.as_ref().unwrap() { | ||
last_sent_build_metric.set(&glean, glean.app_build.as_ref().unwrap()); | ||
log::info!("Builds don't match. Sending 'metrics' ping"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something more like "App build changed. Sending 'metrics' ping"?
glean-core/src/scheduler.rs
Outdated
Local::today().and_hms(SCHEDULED_HOUR, 0, 0) | ||
}; | ||
|
||
// Other MPSes cancel outstanding tasks here. I'm not sure we need to, since schedule() is only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO or bug for a followup
glean-core/src/scheduler.rs
Outdated
|
||
const SCHEDULED_HOUR: u32 = 4; | ||
|
||
static THREAD_GENERATION: OnceCell<Arc<AtomicU32>> = OnceCell::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a Lazy
and then we don't even need the explict init call later.
It also should have some docs on what it's for
glean-core/src/scheduler.rs
Outdated
let mut policy = TestPolicy { | ||
..Default::default() | ||
}; | ||
policy.app_build = "a build".to_string(); | ||
policy | ||
.last_sent_build | ||
.replace(Some("a different build".to_string())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably be simplified to:
let mut policy = TestPolicy {
app_buid: "a build".to_string(),
last_sent_build: Some("a different build".to_string)),
..Default::default()
};
glean-core/src/lib.rs
Outdated
} else { | ||
// Can only kick off the "metrics" ping scheduler after we have a global Glean. | ||
let glean = &GLEAN.get().unwrap().lock().unwrap(); | ||
scheduler::schedule(&glean); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably swap around the if
condition here:
if GLEAN.set(Mutex::new(glean)).is_ok() {
// Can only kick off the "metrics" ping scheduler after we have a global Glean.
let glean = &GLEAN.get().unwrap().lock().unwrap();
scheduler::schedule(&glean);
} else {
// ...
}
That is: handle the error case last
glean-core/src/scheduler.rs
Outdated
reason: Option<&'static str>, | ||
) { | ||
let thread_gen = Arc::clone(&THREAD_GENERATION.get_or_init(Default::default)); | ||
std::thread::Builder::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm a bit worried about this new thread. It's long running and sleeping all the time.
I think we had the problem before that Firefox tooling wants us to ensure threads are stopped before the main thread. As it stands we don't have a way to inform this thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Additionally I'd wish we would just have a way to hook into whatever-existing background-threaded timers the surrounding thing gives us :( )
5fa02a3
to
5dc05d1
Compare
Notable changes:
I should've caught all the previous comments. Lemme know if I missed any. ....and clippy immediately fails for something |
8377191
to
c189ff9
Compare
Had to add a lock for scheduler tests, which makes my nose twitch. Being able to cancel the scheduler from any thread without caring if the scheduler's got work to do is a feature, not a bug, but any time I have to do this sort of workaround it makes me wonder if I'm decreasing safety. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bunch of questions inline.
Overall this code looks good though and we're close to landing this.
I wonder if you already managed to test this out in use in FOG?
glean-core/src/scheduler.rs
Outdated
|
||
// `When` is responsible for date math. Let's make sure it's correct. | ||
#[test] | ||
fn test_when() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test_
prefix is noise in the output IMO. Can we make the function names a bit more self-describing?
glean-core/src/lib.rs
Outdated
@@ -233,6 +248,9 @@ impl Glean { | |||
max_events: cfg.max_events.unwrap_or(DEFAULT_MAX_EVENTS), | |||
is_first_run: false, | |||
debug: DebugOptions::new(), | |||
// Subprocess doesn't use "metrics" pings so has no need for a scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't apply anymore I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, it might, though. There shouldn't be a subprocess scheduler.
glean-core/src/scheduler.rs
Outdated
// Ensure that if we have a different build, we immediately submit an "upgrade" ping | ||
// and schedule a "reschedule" ping for tomorrow. | ||
#[test] | ||
fn test_different_app_builds() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Rust one usually doesn't use the test_
prefix for this functions.
Otherwise all functions in the output just start the same, making it harder to parse at a glance.
glean-core/src/scheduler.rs
Outdated
*cancelled_lock.lock().unwrap() = true; // Cancel the scheduler thread. | ||
condvar.notify_all(); // Notify any/all listening schedulers to check whether they were cancelled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this just call cancel()
? It's what the outside will call and does the same inside, so we don't need to repeat it over and over again in tests.
*cancelled_lock.lock().unwrap() = true; // Cancel the scheduler thread. | |
condvar.notify_all(); // Notify any/all listening schedulers to check whether they were cancelled. | |
super::cancel(); |
glean-core/src/scheduler.rs
Outdated
*cancelled_lock.lock().unwrap() = true; // Cancel the scheduler thread. | ||
condvar.notify_all(); // Notify any/all listening schedulers to check whether they were cancelled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above. Gonna stop calling it out.
c189ff9
to
4531fac
Compare
glean-core/src/lib.rs
Outdated
@@ -233,6 +248,9 @@ impl Glean { | |||
max_events: cfg.max_events.unwrap_or(DEFAULT_MAX_EVENTS), | |||
is_first_run: false, | |||
debug: DebugOptions::new(), | |||
// Subprocess doesn't use "metrics" pings so has no need for a scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, it might, though. There shouldn't be a subprocess scheduler.
pub fn cancel() { | ||
let (cancelled_lock, condvar) = &**TASK_CONDVAR; // One `*` for Lazy, the second for Arc | ||
*cancelled_lock.lock().unwrap() = true; // Cancel the scheduler thread. | ||
condvar.notify_all(); // Notify any/all listening schedulers to check whether they were cancelled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other news, I fully expect to have to store the task scheduler's joinhandle somewhere and join on it here when we start getting intermittent at-shutdown problems : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good now! From the chat I take you took this to a test drive, did this require any last changes or was that purely FOG work now?
Mostly a translation from other LBs' MPSes, but with a few changes. Most notably a split of scheduling from ping submission that makes for easier-to-proxy operations that are easier to test. Also notable is the use of a condvar for cancellable tasks.
4531fac
to
08ffed7
Compare
Co-authored-by: Jan-Erik Rediger <badboy@archlinux.us>
if !old_enabled && enabled { | ||
glean.start_metrics_ping_scheduler(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, interestingly we don't do this on Kotlin or Swift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's correct though, when re-enabled we should also let the MPS do its work again
The MPS has actually been written for a little while. I strove for as direct a translation of
MetricsPingScheduler.{kt|swift}
as was practicable (we can always refactor it later). The hard part has been tests.I've tried making this testable in a few ways. The first way you can see in the second commit and is straightforward, but it can't actually observe most of the behaviour we hope to test.
The second way apes the
Policy
approach of gecko JSMs. (I had some struggles with the borrow checker on this approach, so it's very possible this is more ornate than it needs to be) I think thisPolicy
approach is the better one because it allows us to inspect and test every operation of the MPS which, as history has taught us, is absolutely crucial for peace of mind.(( I also attempted to automock Glean, but it would require some serious modifications to make the Glean struct's impl suit the mocking library (mockall), so I abandoned it. ))
I'm looking for some honest feedback about this, both for whether the approach itself is distasteful as well as what patterns I've missed in pursuing things in this manner. I've taken heart from the Book suggesting RefCell-based interior mutability for test mocks itself to mean I'm not completely out to lunch here, but I might be.