polkadot proces panicked at 'can not convert float seconds to Duration: value is negative' #817

stakeworld · 2022-05-29T12:10:53Z

Server: Linux 00.stakeworld.nl 5.4.0-109-generic Bump libp2p-identify to 0.3.1 polkadot#123-Ubuntu SMP Fri Apr 8 09:10:54 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux , running on 6 core VPS with 16 MB memory,
polkadot --version: polkadot 0.9.22-17c7b9594ae
/usr/bin/polkadot --chain kusama --name STAKEWORLD/01 --unsafe-pruning --pruning 1000 --validator --port 30301 --rpc-port 9901 --ws-port 9801 --prometheus-port 9601 --prometheus-external --base-path /home/polkadot/stakeworld-01 --database rocksdb --telemetry-url 'wss://telemetry.polkadot.io/submit/ 1' --telemetry-url 'wss://telemetry-backend.w3f.community/submit 1'

While being in the active validator set the process crashed, it did restart and since is running and validating normal again.

Polkadot LOG
May 29 13:48:14 00.stakeworld.nl polkadot[6173]: Thread 'tokio-runtime-worker' panicked at 'can not convert float seconds to Duration: value is negative', /rustc/7737e0b5c4103216d6fd8cf94>
May 29 13:48:14 00.stakeworld.nl polkadot[6173]: This is a bug. Please report it at:
May 29 13:48:14 00.stakeworld.nl polkadot[6173]: https://github.com/paritytech/polkadot/issues/new

Dmesg output
[May29 13:48] tokio-runtime-w[6345]: segfault at 0 ip 0000000000000000 sp 00007f13ac76eba8 error 14 in polkadot[5621a0f89000+247000]
[ +0.000081] Code: Bad RIP value.

bkchr · 2022-05-29T20:17:45Z

Can you please show the entire stack trace? Or more logs around the error?

stakeworld · 2022-06-03T07:31:31Z

I am sorry, i don't have more info for now. I had a lot of problems on that particular node so i've moved to other hardware, since then I didn't see the problem so it could be hardware related. Will report back if it happens again.

polkalegos · 2022-07-25T12:47:04Z

I've got the same error running Docker images v0.9.26:

====================

Version: 0.9.26-d8785970175

   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: start_thread
  16: clone


Thread 'tokio-runtime-worker' panicked at 'can not convert float seconds to Duration: value is negative', /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/time.rs:782

This is a bug. Please report it at:

	https://github.com/paritytech/polkadot/issues/new

bkchr · 2022-07-25T17:07:01Z

@polkalegos does this happen multiple times?

polkalegos · 2022-07-25T17:15:38Z

So far just one @bkchr

ggwpez · 2022-07-25T18:39:42Z

Very interesting... the std documents this error as only happening in try_from_secs_f* or from_secs_f*.
For what its worth I did a grep -Rni vendor/ -e '.*from_secs_.*' on the Polkadot vendor/ directory after a cargo vendor.
But all these cases are in not-live code 🤔
So no idea where this is being called, not even in a dependency.

vendor/sc-sysinfo/src/lib.rs:70:	pub fn from_secs_f32(secs: f32) -> Self {
vendor/sc-sysinfo/src/lib.rs:71:		Self::MaxDuration(Duration::from_secs_f32(secs))
vendor/frame-benchmarking-cli/src/machine/mod.rs:144:		let verify_limit = ExecutionLimit::from_secs_f32(self.verify_duration);
vendor/frame-benchmarking-cli/src/machine/mod.rs:145:		let disk_limit = ExecutionLimit::from_secs_f32(self.disk_duration);
vendor/frame-benchmarking-cli/src/machine/mod.rs:146:		let hash_limit = ExecutionLimit::from_secs_f32(self.hash_duration);
vendor/frame-benchmarking-cli/src/machine/mod.rs:147:		let memory_limit = ExecutionLimit::from_secs_f32(self.memory_duration);

koute · 2022-07-27T14:13:45Z

@polkalegos Just for reference, can you tell me which exact Docker image you were running?

polkalegos · 2022-07-27T14:33:02Z

@koute latest tag version 0.9.26

polkalegos · 2022-10-02T11:39:25Z

@koute happening again on v0.9.29:

Version: 0.9.29-94078b44fb6

   0: sp_panic_handler::set::{{closure}}
   1: std::panicking::rust_panic_with_hook
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:702:17
   2: std::panicking::begin_panic_handler::{{closure}}
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:588:13
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/sys_common/backtrace.rs:138:18
   4: rust_begin_unwind
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
   5: core::panicking::panic_fmt
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
   6: core::panicking::panic_display
   7: <sc_finality_grandpa::communication::gossip::GossipValidator<Block> as sc_network_gossip::validator::Validator<Block>>::message_allowed::{{closure}}
   8: <sc_network_gossip::bridge::GossipEngine<B> as core::future::future::Future>::poll
   9: <sc_finality_grandpa::communication::NetworkBridge<B,N> as core::future::future::Future>::poll
  10: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
  11: tokio::runtime::task::raw::poll
  12: std::sys_common::backtrace::__rust_begin_short_backtrace
  13: core::ops::function::FnOnce::call_once{{vtable.shim}}
  14: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/alloc/src/boxed.rs:1872:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/alloc/src/boxed.rs:1872:9
      std::sys::unix::thread::Thread::new::thread_start
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/sys/unix/thread.rs:108:17
  15: start_thread
  16: clone


Thread 'tokio-runtime-worker' panicked at 'can not convert float seconds to Duration: value is negative', /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/time.rs:782

This is a bug. Please report it at:

	https://github.com/paritytech/polkadot/issues/new

koute · 2022-10-02T13:08:19Z

@polkalegos Thank you for the backtrace!

I'll take a look at this.

arkpar · 2022-10-03T09:20:44Z

I've ran into similar issue once. The problem was that Instance counters are not consistent across threads. I.e creating an instance with Instance::now and then calling elapsed on it in another thread may panic.

koute · 2022-10-03T10:53:44Z

I've ran into similar issue once. The problem was that Instance counters are not consistent across threads. I.e creating an instance with Instance::now and then calling elapsed on it in another thread may panic.

This is a good point, however AFAIK since rust 1.60 it's supposed to saturate to zero and not panic anymore. The elapsed substracts two Instants and that calls duration_since which does an unwrap_or_default.

Anyway, looking at the code, here's where the error's defined:

pub struct FromFloatSecsError {
    kind: FromFloatSecsErrorKind,
}

impl FromFloatSecsError {
    const fn description(&self) -> &'static str {
        match self.kind {
            FromFloatSecsErrorKind::Negative => {
                "can not convert float seconds to Duration: value is negative"
            }
            FromFloatSecsErrorKind::OverflowOrNan => {
                "can not convert float seconds to Duration: value is either too big or NaN"
            }
        }
    }
}

So we're getting hit by FromFloatSecsErrorKind::Negative.

The only place it's used is in this macro:

macro_rules! try_from_secs {
    (
        secs = $secs: expr,
        mantissa_bits = $mant_bits: literal,
        exponent_bits = $exp_bits: literal,
        offset = $offset: literal,
        bits_ty = $bits_ty:ty,
        double_ty = $double_ty:ty,
    ) => {{
        // ...
        if $secs.is_sign_negative() {
            return Err(FromFloatSecsError { kind: FromFloatSecsErrorKind::Negative });
        }

So the $secs passed here must be negative. This macro is used in these places:

impl Duration {
    // ...
    pub const fn try_from_secs_f32(secs: f32) -> Result<Duration, FromFloatSecsError> {
        try_from_secs!(
            secs = secs,
            mantissa_bits = 23,
            exponent_bits = 8,
            offset = 41,
            bits_ty = u32,
            double_ty = u64,
        )
    }
    // ...
    pub const fn try_from_secs_f64(secs: f64) -> Result<Duration, FromFloatSecsError> {
        try_from_secs!(
            secs = secs,
            mantissa_bits = 52,
            exponent_bits = 11,
            offset = 44,
            bits_ty = u64,
            double_ty = u128,
        )
    }
}

impl Duration {
    // ...
    pub const fn from_secs_f64(secs: f64) -> Duration {
        match Duration::try_from_secs_f64(secs) {
            Ok(v) => v,
            Err(e) => panic!("{}", e.description()),
        }
    }
    // ...
    pub const fn from_secs_f32(secs: f32) -> Duration {
        match Duration::try_from_secs_f32(secs) {
            Ok(v) => v,
            Err(e) => panic!("{}", e.description()),
        }
    }
    // ...
}

impl Duration {
    // ...
    pub const fn mul_f64(self, rhs: f64) -> Duration {
        Duration::from_secs_f64(rhs * self.as_secs_f64())
    }
    // ...
    pub const fn mul_f32(self, rhs: f32) -> Duration {
        Duration::from_secs_f32(rhs * self.as_secs_f32())
    }
    // ...
    pub const fn div_f64(self, rhs: f64) -> Duration {
        Duration::from_secs_f64(self.as_secs_f64() / rhs)
    }
    // ...
    pub const fn div_f32(self, rhs: f32) -> Duration {
        Duration::from_secs_f32(self.as_secs_f32() / rhs)
    }
}

So basically anywhere we either create a Duration from f32/f64, or multiply by f32/f64 the result must be positive, or we'll panic.

Now let's look at the GossipValidator::message_allowed in substrate/client/finality-grandpa/src/communication/gossip.rs. The backtrace has a closure so the issue's probably somewhere here:

Box::new(move |who, intent, topic, mut data| {
    if let MessageIntent::PeriodicRebroadcast = intent {
        return do_rebroadcast
    }

    let peer = match inner.peers.peer(who) {
        None => return false,
        Some(x) => x,
    };

    // if the topic is not something we're keeping at the moment,
    // do not send.
    let (maybe_round, set_id) = match inner.live_topics.topic_info(topic) {
        None => return false,
        Some(x) => x,
    };

    if let MessageIntent::Broadcast = intent {
        if maybe_round.is_some() {
            if !inner.round_message_allowed(who) {
                // early return if the vote message isn't allowed at this stage.
                return false
            }
        } else if !inner.global_message_allowed(who) {
            // early return if the global message isn't allowed at this stage.
            return false
        }
    }

    // if the topic is not something the peer accepts, discard.
    if let Some(round) = maybe_round {
        return peer.view.consider_vote(round, set_id) == Consider::Accept
    }

    // global message.
    let local_view = match inner.local_view {
        Some(ref v) => v,
        None => return false, // cannot evaluate until we have a local view.
    };

    match GossipMessage::<Block>::decode(&mut data) {
        Err(_) => false,
        Ok(GossipMessage::Commit(full)) => {
            // we only broadcast commit messages if they're for the same
            // set the peer is in and if the commit is better than the
            // last received by peer, additionally we make sure to only
            // broadcast our best commit.
            peer.view.consider_global(set_id, full.message.target_number) ==
                Consider::Accept && Some(&full.message.target_number) ==
                local_view.last_commit_height()
        },
        Ok(GossipMessage::Neighbor(_)) => false,
        Ok(GossipMessage::CatchUpRequest(_)) => false,
        Ok(GossipMessage::CatchUp(_)) => false,
        Ok(GossipMessage::Vote(_)) => false, // should not be the case.
    }
})

Let's look at the round_message_allowed and global_message_allowed which are the only places where the affected functions are used from what I can see:

fn round_message_allowed(&self, who: &PeerId) -> bool {
    let round_duration = self.config.gossip_duration * ROUND_DURATION;
    let round_elapsed = match self.local_view {
        Some(ref local_view) => local_view.round_start.elapsed(),
        None => return false,
    };

    if round_elapsed < round_duration.mul_f32(PROPAGATION_SOME) {
        self.peers.first_stage_peers.contains(who)
    } else if round_elapsed < round_duration.mul_f32(PROPAGATION_ALL) {
        self.peers.first_stage_peers.contains(who) ||
            self.peers.second_stage_peers.contains(who)
    } else {
        self.peers.peer(who).map(|info| !info.roles.is_light()).unwrap_or(false)
    }
}
// ...
fn global_message_allowed(&self, who: &PeerId) -> bool {
    let round_duration = self.config.gossip_duration * ROUND_DURATION;
    let round_elapsed = match self.local_view {
        Some(ref local_view) => local_view.round_start.elapsed(),
        None => return false,
    };

    if round_elapsed < round_duration.mul_f32(PROPAGATION_ALL) {
        self.peers.first_stage_peers.contains(who) ||
            self.peers.second_stage_peers.contains(who) ||
            self.peers.lucky_light_peers.contains(who)
    } else {
        true
    }
}

Both use mul_f32 which can panic with this error message!

But now here's the weird part; let me list out all of the relevant constants and variables:

const ROUND_DURATION: u32 = 5;
const PROPAGATION_SOME: f32 = 1.5;
const PROPAGATION_ALL: f32 = 3.0;

#[derive(Clone)]
pub struct Config {
    // ...
    pub gossip_duration: Duration,
    // ...
}

// In polkadot:
let gossip_duration = if ... { Duration::from_millis(2000) } else {Duration::from_millis(1000) }

// In substrate:
gossip_duration: std::time::Duration::from_millis(333)

Everything here is positive. So this is really weird; unless I'm missing something here this panic should be impossible to trigger, unless the Config structure gets corrupted, there's an hardware issue, or there's a bug in the kernel and it clobbers the floating point registers by mistake (since normally the floating point registers are not saved and reloaded on a context switch).

@polkalegos Can you give us more details as to the system on which you've seen this problem?

Where exactly are you running this? (AWS? GCP? Hetzner? etc.)
What are the hardware specs of the system?
Which distro are you using?
What's the kernel version on which you are running?
Does the system have ECC memory?
Can you do cat /proc/cpuinfo and paste us the output?
Do you rent the whole bare-metal system or do you just rent a VM? If you rent the whole bare-metal machine can you run edac-util -v (you might have to install this package) to check if there were any ECC errors?

stakeworld · 2022-10-03T12:46:28Z

The original report was on a hetzner VPS, which I think is very underperforming so it could be a related to hardware/memory problems. Since I moved to dedicated hardware I haven't seen the problem repeat.

ggwpez · 2022-10-04T14:38:47Z

(from Rust Doc)

To work around these bugs and platforms not offering monotonic clocks duration_since, elapsed and sub saturate to zero. In older Rust versions this lead to a panic instead. checked_duration_since can be used to detect and handle situations where monotonicity is violated, or Instants are subtracted in the wrong order.

Not sure what role Docker and VPS play here. If it runs in some Qemu or whatever maybe it breaks the monotic clock somehow.

From looking at the code I came to the same conclusion that only mul_f32 would cause this, but it is called with sane constants. So should be fine.

* Refactor types * more refactoring * refactor * refactor * refactor * refactor * refactor * update cumulus * Update benchmarks * Update template for generating benchmarking data * Fix bug in tests to prevent stack overflow * Fix for benchmark & More refactoring (paritytech#817) * Fix for benchmark & More refactoring * More refactoring * Add const_assert back --------- Co-authored-by: Ron <yrong1997@gmail.com>

Signed-off-by: koushiro <koushiro.cqx@gmail.com> Signed-off-by: koushiro <koushiro.cqx@gmail.com>

stakeworld closed this as not planned Won't fix, can't repro, duplicate, stale Jun 3, 2022

koute mentioned this issue Jul 28, 2022

Official docker polkadot image contains a polkadot executable with missing symbol table paritytech/polkadot#5828

Closed

koute reopened this Oct 2, 2022

koute self-assigned this Oct 2, 2022

ordian mentioned this issue Oct 3, 2022

Failed to assemble or send inherent data err=SendInherentDataTimeout #769

Closed

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

bkchr closed this as completed Aug 25, 2023

helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024

Export the format module (paritytech#817)

5067a8d

Signed-off-by: koushiro <koushiro.cqx@gmail.com> Signed-off-by: koushiro <koushiro.cqx@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polkadot proces panicked at 'can not convert float seconds to Duration: value is negative' #817

polkadot proces panicked at 'can not convert float seconds to Duration: value is negative' #817

stakeworld commented May 29, 2022 •

edited

Loading

bkchr commented May 29, 2022

stakeworld commented Jun 3, 2022

polkalegos commented Jul 25, 2022 •

edited

Loading

bkchr commented Jul 25, 2022

polkalegos commented Jul 25, 2022

ggwpez commented Jul 25, 2022 •

edited

Loading

koute commented Jul 27, 2022

polkalegos commented Jul 27, 2022

polkalegos commented Oct 2, 2022

koute commented Oct 2, 2022

arkpar commented Oct 3, 2022

koute commented Oct 3, 2022 •

edited

Loading

stakeworld commented Oct 3, 2022

ggwpez commented Oct 4, 2022 •

edited

Loading

polkadot proces panicked at 'can not convert float seconds to Duration: value is negative' #817

polkadot proces panicked at 'can not convert float seconds to Duration: value is negative' #817

Comments

stakeworld commented May 29, 2022 • edited Loading

bkchr commented May 29, 2022

stakeworld commented Jun 3, 2022

polkalegos commented Jul 25, 2022 • edited Loading

bkchr commented Jul 25, 2022

polkalegos commented Jul 25, 2022

ggwpez commented Jul 25, 2022 • edited Loading

koute commented Jul 27, 2022

polkalegos commented Jul 27, 2022

polkalegos commented Oct 2, 2022

koute commented Oct 2, 2022

arkpar commented Oct 3, 2022

koute commented Oct 3, 2022 • edited Loading

stakeworld commented Oct 3, 2022

ggwpez commented Oct 4, 2022 • edited Loading

stakeworld commented May 29, 2022 •

edited

Loading

polkalegos commented Jul 25, 2022 •

edited

Loading

ggwpez commented Jul 25, 2022 •

edited

Loading

koute commented Oct 3, 2022 •

edited

Loading

ggwpez commented Oct 4, 2022 •

edited

Loading