Skip to content
This repository has been archived by the owner on Jul 25, 2022. It is now read-only.

Analyze lctl output for real snapshot listing failures #2282

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions iml-agent/src/action_plugins/lustre/snapshot.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@
// Use of this source code is governed by a MIT-style
// license that can be found in the LICENSE file.

use crate::{agent_error::ImlAgentError, lustre::lctl};
use crate::{
agent_error::ImlAgentError,
lustre::{lctl, lctl_retry},
};
use combine::{stream::easy, EasyParser};
use iml_wire_types::snapshot::{Create, Destroy, List, Mount, Snapshot, Unmount};

Expand All @@ -14,7 +17,7 @@ pub async fn list(l: List) -> Result<Vec<Snapshot>, ImlAgentError> {
args.push("--name");
args.push(name);
}
let stdout = lctl(args).await?;
let stdout = lctl_retry(args).await?;
let stdout = stdout.trim();

if stdout.is_empty() {
Expand Down Expand Up @@ -53,15 +56,15 @@ pub async fn create(c: Create) -> Result<(), ImlAgentError> {
args.push(cmnt);
}

lctl(args).await.map(drop)
lctl_retry(args).await.map(drop)
}

pub async fn destroy(d: Destroy) -> Result<(), ImlAgentError> {
let mut args = vec!["snapshot_destroy", "--fsname", &d.fsname, "--name", &d.name];
if d.force {
args.push("--force");
}
lctl(args).await.map(drop)
lctl_retry(args).await.map(drop)
}

pub async fn mount(m: Mount) -> Result<(), ImlAgentError> {
Expand Down
31 changes: 25 additions & 6 deletions iml-agent/src/daemon_plugins/snapshot.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ use futures::{
lock::Mutex,
Future, FutureExt,
};
use iml_cmd::CmdError;
use iml_tracing::tracing;
use iml_wire_types::snapshot::{List, Snapshot};
use std::collections::BTreeSet;
use std::{pin::Pin, sync::Arc, time::Duration};
Expand Down Expand Up @@ -50,7 +52,7 @@ async fn list() -> Result<Vec<Snapshot>, ()> {
let fss: Vec<String> = lctl(vec!["get_param", "-N", "mgs.MGS.live.*"])
.await
.map_err(|e| {
// XXX debug because of false positives
// XXX debug because of false positives. But this is still a failure.
tracing::debug!("listing filesystems failed: {}", e);
})
.map(|o| {
Expand Down Expand Up @@ -87,14 +89,31 @@ async fn list() -> Result<Vec<Snapshot>, ()> {
let really_failed_fss = errs
.into_iter()
.map(|x| x.unwrap_err())
.filter(|x| !snapshot_fsnames.contains(&x.0))
.filter(|x| {
tracing::debug!("listing for {} failed: {:?}", x.0, x.1);

if snapshot_fsnames.contains(&x.0) {
tracing::debug!("{} is a snapshot FS", x.0);
return false;
}

match &x.1 {
ImlAgentError::CmdError(CmdError::Output(o)) => {
// XXX lctl returns 1 no matter what, so have to read its output:
let stderr = String::from_utf8_lossy(&o.stderr);
stderr.find("Miss MDT0 in the config file").is_none()
}
_ => true,
}
})
.collect::<Vec<_>>();

if !really_failed_fss.is_empty() {
// XXX debug because of false positives
tracing::debug!("listing failed: {:?}", really_failed_fss);
if really_failed_fss.is_empty() {
Ok(snaps)
} else {
tracing::error!("listing failed: {:?}", really_failed_fss);
Err(())
}
Ok(snaps)
}

#[async_trait]
Expand Down
45 changes: 39 additions & 6 deletions iml-agent/src/lustre.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,56 @@

use crate::agent_error::ImlAgentError;
use futures::TryFutureExt;
use iml_cmd::{CheckedCommandExt, Command};
use iml_cmd::{CheckedCommandExt, CmdError, Command};
use liblustreapi::LlapiFid;
use std::ffi::OsStr;
use tokio::task::spawn_blocking;
use std::{ffi::OsStr, process::Output, time::Duration};
use tokio::{task::spawn_blocking, time::delay_for};

/// Runs lctl with given arguments
/// Execute insistently lctl with given arguments (retry if resource is temporarily unavailable)
pub async fn lctl_retry<I, S>(args: I) -> Result<String, ImlAgentError>
where
I: IntoIterator<Item = S> + Clone,
S: AsRef<OsStr>,
{
loop {
let r = invoke_lctl(args.clone()).await;

if let Err(CmdError::Output(o)) = &r {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this should wrap the lctl call as a new fn instead of being within it.

Copy link
Member Author

@ip1981 ip1981 Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it first, but the function was private anyway consisting only of Command::new("/usr/sbin/lctl") (after I removed premature decoding of output)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this fn lctl_retry, and keep the original lctl fn in place. the lctl fn is being used in a few other modules already, and it's behavior should stay the same I think.

Copy link
Member Author

@ip1981 ip1981 Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, other modules will just become more sound out of the box. I don't think there is sense in failing for a temporarily unavailable resource.

Initially there were two checks when listing snapshots, but this one (resource temporarily unavailable) seems the best be handled earlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we seen this on any other lctl call (and do we know the behavior is the same)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's valid question. As a precaution we indeed can have two version of lctl function. But the lctl executable is using perror() or strerror() consistently. And retrying temporary failures is still needed.

It is really simple: we would loop iff lctl prints "Resource temporarily unavailable" (to stderr + exit code, it could not be a snapshot comment :) )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the call into IML's liblustreapi library? Might be cleaner than doing a shell command.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it lctl_retry

let stderr = String::from_utf8_lossy(&o.stderr);
if stderr.find("Resource temporarily unavailable").is_some() {
const DUR: Duration = Duration::from_secs(3);
tracing::debug!("{}, waiting {:?} ...", stderr.trim(), DUR);
delay_for(DUR).await;
continue;
}
}
break r
.map_err(|e| e.into())
.map(|o| String::from_utf8_lossy(&o.stdout).to_string());
}
}

/// Execute lctl with given arguments
pub async fn lctl<I, S>(args: I) -> Result<String, ImlAgentError>
where
I: IntoIterator<Item = S>,
S: AsRef<OsStr>,
{
invoke_lctl(args)
.await
.map_err(|e| e.into())
.map(|o| String::from_utf8_lossy(&o.stdout).to_string())
}

async fn invoke_lctl<I, S>(args: I) -> Result<Output, CmdError>
where
I: IntoIterator<Item = S>,
S: AsRef<OsStr>,
{
Command::new("/usr/sbin/lctl")
.args(args)
.checked_output()
.err_into()
.await
.map(|o| String::from_utf8_lossy(&o.stdout).to_string())
}

/// Returns LlapiFid for a given device or mount path
Expand Down