Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polkadot proces panicked at 'can not convert float seconds to Duration: value is negative' #817

Closed
stakeworld opened this issue May 29, 2022 · 14 comments
Assignees

Comments

@stakeworld
Copy link
Contributor

stakeworld commented May 29, 2022

  • Server: Linux 00.stakeworld.nl 5.4.0-109-generic Bump libp2p-identify to 0.3.1 polkadot#123-Ubuntu SMP Fri Apr 8 09:10:54 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux , running on 6 core VPS with 16 MB memory,
  • polkadot --version: polkadot 0.9.22-17c7b9594ae
  • /usr/bin/polkadot --chain kusama --name STAKEWORLD/01 --unsafe-pruning --pruning 1000 --validator --port 30301 --rpc-port 9901 --ws-port 9801 --prometheus-port 9601 --prometheus-external --base-path /home/polkadot/stakeworld-01 --database rocksdb --telemetry-url 'wss://telemetry.polkadot.io/submit/ 1' --telemetry-url 'wss://telemetry-backend.w3f.community/submit 1'

While being in the active validator set the process crashed, it did restart and since is running and validating normal again.

Polkadot LOG
May 29 13:48:14 00.stakeworld.nl polkadot[6173]: Thread 'tokio-runtime-worker' panicked at 'can not convert float seconds to Duration: value is negative', /rustc/7737e0b5c4103216d6fd8cf94>
May 29 13:48:14 00.stakeworld.nl polkadot[6173]: This is a bug. Please report it at:
May 29 13:48:14 00.stakeworld.nl polkadot[6173]: https://github.com/paritytech/polkadot/issues/new

Dmesg output
[May29 13:48] tokio-runtime-w[6345]: segfault at 0 ip 0000000000000000 sp 00007f13ac76eba8 error 14 in polkadot[5621a0f89000+247000]
[ +0.000081] Code: Bad RIP value.

@bkchr
Copy link
Member

bkchr commented May 29, 2022

Can you please show the entire stack trace? Or more logs around the error?

@stakeworld
Copy link
Contributor Author

I am sorry, i don't have more info for now. I had a lot of problems on that particular node so i've moved to other hardware, since then I didn't see the problem so it could be hardware related. Will report back if it happens again.

@stakeworld stakeworld closed this as not planned Won't fix, can't repro, duplicate, stale Jun 3, 2022
@polkalegos
Copy link

polkalegos commented Jul 25, 2022

I've got the same error running Docker images v0.9.26:

====================

Version: 0.9.26-d8785970175

   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: start_thread
  16: clone


Thread 'tokio-runtime-worker' panicked at 'can not convert float seconds to Duration: value is negative', /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/time.rs:782

This is a bug. Please report it at:

	https://github.com/paritytech/polkadot/issues/new

@bkchr
Copy link
Member

bkchr commented Jul 25, 2022

@polkalegos does this happen multiple times?

@polkalegos
Copy link

So far just one @bkchr

@ggwpez
Copy link
Member

ggwpez commented Jul 25, 2022

Very interesting... the std documents this error as only happening in try_from_secs_f* or from_secs_f*.
For what its worth I did a grep -Rni vendor/ -e '.*from_secs_.*' on the Polkadot vendor/ directory after a cargo vendor.
But all these cases are in not-live code 🤔
So no idea where this is being called, not even in a dependency.

vendor/sc-sysinfo/src/lib.rs:70:	pub fn from_secs_f32(secs: f32) -> Self {
vendor/sc-sysinfo/src/lib.rs:71:		Self::MaxDuration(Duration::from_secs_f32(secs))
vendor/frame-benchmarking-cli/src/machine/mod.rs:144:		let verify_limit = ExecutionLimit::from_secs_f32(self.verify_duration);
vendor/frame-benchmarking-cli/src/machine/mod.rs:145:		let disk_limit = ExecutionLimit::from_secs_f32(self.disk_duration);
vendor/frame-benchmarking-cli/src/machine/mod.rs:146:		let hash_limit = ExecutionLimit::from_secs_f32(self.hash_duration);
vendor/frame-benchmarking-cli/src/machine/mod.rs:147:		let memory_limit = ExecutionLimit::from_secs_f32(self.memory_duration);

@koute
Copy link
Contributor

koute commented Jul 27, 2022

@polkalegos Just for reference, can you tell me which exact Docker image you were running?

@polkalegos
Copy link

@koute latest tag version 0.9.26

@polkalegos
Copy link

@koute happening again on v0.9.29:

Version: 0.9.29-94078b44fb6

   0: sp_panic_handler::set::{{closure}}
   1: std::panicking::rust_panic_with_hook
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:702:17
   2: std::panicking::begin_panic_handler::{{closure}}
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:588:13
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/sys_common/backtrace.rs:138:18
   4: rust_begin_unwind
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
   5: core::panicking::panic_fmt
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
   6: core::panicking::panic_display
   7: <sc_finality_grandpa::communication::gossip::GossipValidator<Block> as sc_network_gossip::validator::Validator<Block>>::message_allowed::{{closure}}
   8: <sc_network_gossip::bridge::GossipEngine<B> as core::future::future::Future>::poll
   9: <sc_finality_grandpa::communication::NetworkBridge<B,N> as core::future::future::Future>::poll
  10: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
  11: tokio::runtime::task::raw::poll
  12: std::sys_common::backtrace::__rust_begin_short_backtrace
  13: core::ops::function::FnOnce::call_once{{vtable.shim}}
  14: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/alloc/src/boxed.rs:1872:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/alloc/src/boxed.rs:1872:9
      std::sys::unix::thread::Thread::new::thread_start
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/sys/unix/thread.rs:108:17
  15: start_thread
  16: clone


Thread 'tokio-runtime-worker' panicked at 'can not convert float seconds to Duration: value is negative', /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/time.rs:782

This is a bug. Please report it at:

	https://github.com/paritytech/polkadot/issues/new

@koute
Copy link
Contributor

koute commented Oct 2, 2022

@polkalegos Thank you for the backtrace!

I'll take a look at this.

@koute koute reopened this Oct 2, 2022
@koute koute self-assigned this Oct 2, 2022
@arkpar
Copy link
Member

arkpar commented Oct 3, 2022

I've ran into similar issue once. The problem was that Instance counters are not consistent across threads. I.e creating an instance with Instance::now and then calling elapsed on it in another thread may panic.

@koute
Copy link
Contributor

koute commented Oct 3, 2022

I've ran into similar issue once. The problem was that Instance counters are not consistent across threads. I.e creating an instance with Instance::now and then calling elapsed on it in another thread may panic.

This is a good point, however AFAIK since rust 1.60 it's supposed to saturate to zero and not panic anymore. The elapsed substracts two Instants and that calls duration_since which does an unwrap_or_default.

Anyway, looking at the code, here's where the error's defined:

pub struct FromFloatSecsError {
    kind: FromFloatSecsErrorKind,
}

impl FromFloatSecsError {
    const fn description(&self) -> &'static str {
        match self.kind {
            FromFloatSecsErrorKind::Negative => {
                "can not convert float seconds to Duration: value is negative"
            }
            FromFloatSecsErrorKind::OverflowOrNan => {
                "can not convert float seconds to Duration: value is either too big or NaN"
            }
        }
    }
}

So we're getting hit by FromFloatSecsErrorKind::Negative.

The only place it's used is in this macro:

macro_rules! try_from_secs {
    (
        secs = $secs: expr,
        mantissa_bits = $mant_bits: literal,
        exponent_bits = $exp_bits: literal,
        offset = $offset: literal,
        bits_ty = $bits_ty:ty,
        double_ty = $double_ty:ty,
    ) => {{
        // ...
        if $secs.is_sign_negative() {
            return Err(FromFloatSecsError { kind: FromFloatSecsErrorKind::Negative });
        }

So the $secs passed here must be negative. This macro is used in these places:

impl Duration {
    // ...
    pub const fn try_from_secs_f32(secs: f32) -> Result<Duration, FromFloatSecsError> {
        try_from_secs!(
            secs = secs,
            mantissa_bits = 23,
            exponent_bits = 8,
            offset = 41,
            bits_ty = u32,
            double_ty = u64,
        )
    }
    // ...
    pub const fn try_from_secs_f64(secs: f64) -> Result<Duration, FromFloatSecsError> {
        try_from_secs!(
            secs = secs,
            mantissa_bits = 52,
            exponent_bits = 11,
            offset = 44,
            bits_ty = u64,
            double_ty = u128,
        )
    }
}
impl Duration {
    // ...
    pub const fn from_secs_f64(secs: f64) -> Duration {
        match Duration::try_from_secs_f64(secs) {
            Ok(v) => v,
            Err(e) => panic!("{}", e.description()),
        }
    }
    // ...
    pub const fn from_secs_f32(secs: f32) -> Duration {
        match Duration::try_from_secs_f32(secs) {
            Ok(v) => v,
            Err(e) => panic!("{}", e.description()),
        }
    }
    // ...
}
impl Duration {
    // ...
    pub const fn mul_f64(self, rhs: f64) -> Duration {
        Duration::from_secs_f64(rhs * self.as_secs_f64())
    }
    // ...
    pub const fn mul_f32(self, rhs: f32) -> Duration {
        Duration::from_secs_f32(rhs * self.as_secs_f32())
    }
    // ...
    pub const fn div_f64(self, rhs: f64) -> Duration {
        Duration::from_secs_f64(self.as_secs_f64() / rhs)
    }
    // ...
    pub const fn div_f32(self, rhs: f32) -> Duration {
        Duration::from_secs_f32(self.as_secs_f32() / rhs)
    }
}

So basically anywhere we either create a Duration from f32/f64, or multiply by f32/f64 the result must be positive, or we'll panic.

Now let's look at the GossipValidator::message_allowed in substrate/client/finality-grandpa/src/communication/gossip.rs. The backtrace has a closure so the issue's probably somewhere here:

Box::new(move |who, intent, topic, mut data| {
    if let MessageIntent::PeriodicRebroadcast = intent {
        return do_rebroadcast
    }

    let peer = match inner.peers.peer(who) {
        None => return false,
        Some(x) => x,
    };

    // if the topic is not something we're keeping at the moment,
    // do not send.
    let (maybe_round, set_id) = match inner.live_topics.topic_info(topic) {
        None => return false,
        Some(x) => x,
    };

    if let MessageIntent::Broadcast = intent {
        if maybe_round.is_some() {
            if !inner.round_message_allowed(who) {
                // early return if the vote message isn't allowed at this stage.
                return false
            }
        } else if !inner.global_message_allowed(who) {
            // early return if the global message isn't allowed at this stage.
            return false
        }
    }

    // if the topic is not something the peer accepts, discard.
    if let Some(round) = maybe_round {
        return peer.view.consider_vote(round, set_id) == Consider::Accept
    }

    // global message.
    let local_view = match inner.local_view {
        Some(ref v) => v,
        None => return false, // cannot evaluate until we have a local view.
    };

    match GossipMessage::<Block>::decode(&mut data) {
        Err(_) => false,
        Ok(GossipMessage::Commit(full)) => {
            // we only broadcast commit messages if they're for the same
            // set the peer is in and if the commit is better than the
            // last received by peer, additionally we make sure to only
            // broadcast our best commit.
            peer.view.consider_global(set_id, full.message.target_number) ==
                Consider::Accept && Some(&full.message.target_number) ==
                local_view.last_commit_height()
        },
        Ok(GossipMessage::Neighbor(_)) => false,
        Ok(GossipMessage::CatchUpRequest(_)) => false,
        Ok(GossipMessage::CatchUp(_)) => false,
        Ok(GossipMessage::Vote(_)) => false, // should not be the case.
    }
})

Let's look at the round_message_allowed and global_message_allowed which are the only places where the affected functions are used from what I can see:

fn round_message_allowed(&self, who: &PeerId) -> bool {
    let round_duration = self.config.gossip_duration * ROUND_DURATION;
    let round_elapsed = match self.local_view {
        Some(ref local_view) => local_view.round_start.elapsed(),
        None => return false,
    };

    if round_elapsed < round_duration.mul_f32(PROPAGATION_SOME) {
        self.peers.first_stage_peers.contains(who)
    } else if round_elapsed < round_duration.mul_f32(PROPAGATION_ALL) {
        self.peers.first_stage_peers.contains(who) ||
            self.peers.second_stage_peers.contains(who)
    } else {
        self.peers.peer(who).map(|info| !info.roles.is_light()).unwrap_or(false)
    }
}
// ...
fn global_message_allowed(&self, who: &PeerId) -> bool {
    let round_duration = self.config.gossip_duration * ROUND_DURATION;
    let round_elapsed = match self.local_view {
        Some(ref local_view) => local_view.round_start.elapsed(),
        None => return false,
    };

    if round_elapsed < round_duration.mul_f32(PROPAGATION_ALL) {
        self.peers.first_stage_peers.contains(who) ||
            self.peers.second_stage_peers.contains(who) ||
            self.peers.lucky_light_peers.contains(who)
    } else {
        true
    }
}

Both use mul_f32 which can panic with this error message!

But now here's the weird part; let me list out all of the relevant constants and variables:

const ROUND_DURATION: u32 = 5;
const PROPAGATION_SOME: f32 = 1.5;
const PROPAGATION_ALL: f32 = 3.0;

#[derive(Clone)]
pub struct Config {
    // ...
    pub gossip_duration: Duration,
    // ...
}

// In polkadot:
let gossip_duration = if ... { Duration::from_millis(2000) } else {Duration::from_millis(1000) }

// In substrate:
gossip_duration: std::time::Duration::from_millis(333)

Everything here is positive. So this is really weird; unless I'm missing something here this panic should be impossible to trigger, unless the Config structure gets corrupted, there's an hardware issue, or there's a bug in the kernel and it clobbers the floating point registers by mistake (since normally the floating point registers are not saved and reloaded on a context switch).

@polkalegos Can you give us more details as to the system on which you've seen this problem?

  1. Where exactly are you running this? (AWS? GCP? Hetzner? etc.)
  2. What are the hardware specs of the system?
  3. Which distro are you using?
  4. What's the kernel version on which you are running?
  5. Does the system have ECC memory?
  6. Can you do cat /proc/cpuinfo and paste us the output?
  7. Do you rent the whole bare-metal system or do you just rent a VM? If you rent the whole bare-metal machine can you run edac-util -v (you might have to install this package) to check if there were any ECC errors?

@stakeworld
Copy link
Contributor Author

The original report was on a hetzner VPS, which I think is very underperforming so it could be a related to hardware/memory problems. Since I moved to dedicated hardware I haven't seen the problem repeat.

@ggwpez
Copy link
Member

ggwpez commented Oct 4, 2022

(from Rust Doc)

To work around these bugs and platforms not offering monotonic clocks duration_since, elapsed and sub saturate to zero. In older Rust versions this lead to a panic instead. checked_duration_since can be used to detect and handle situations where monotonicity is violated, or Instants are subtracted in the wrong order.

Not sure what role Docker and VPS play here. If it runs in some Qemu or whatever maybe it breaks the monotic clock somehow.

From looking at the code I came to the same conclusion that only mul_f32 would cause this, but it is called with sane constants. So should be fine.

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@bkchr bkchr closed this as completed Aug 25, 2023
claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
* Refactor types

* more refactoring

* refactor

* refactor

* refactor

* refactor

* refactor

* update cumulus

* Update benchmarks

* Update template for generating benchmarking data

* Fix bug in tests to prevent stack overflow

* Fix for benchmark & More refactoring (paritytech#817)

* Fix for benchmark & More refactoring

* More refactoring

* Add const_assert back

---------

Co-authored-by: Ron <yrong1997@gmail.com>
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
Signed-off-by: koushiro <koushiro.cqx@gmail.com>

Signed-off-by: koushiro <koushiro.cqx@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants
@koute @bkchr @arkpar @ggwpez @polkalegos @stakeworld and others