Skip to content

Crash when using linux+asan+static-swift-stdlib+swift 5.8 #1118

@mannuch

Description

@mannuch

Hello!

Ran into an issue when running the service in a release configuration on Linux via docker.

After some digging, I believe I've isolated the issue to when the cluster is initialized. I have a reproduction of the issue with a simple main.swift:

import DistributedCluster

let clusterSystem = await ClusterSystem("TestRunCluster")
try await Task.sleep(for: .seconds(5))

When running with Backtrace installed, I get the following:

Received signal 11. Backtrace:
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] ClusterSystem [TestRunCluster] initialized, listening on: sact://TestRunCluster@127.0.0.1:7337: _ActorRef<ClusterShell.Message>(/system/cluster)
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .autoLeaderElection: LeadershipSelectionSettings(underlying: DistributedCluster.ClusterSystemSettings.LeadershipSelectionSettings.(unknown context at $aaaad5a3b1dc)._LeadershipSelectionSettings.lowestReachable(minNumberOfMembers: 2))
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .downingStrategy: DowningStrategySettings(underlying: DistributedCluster.DowningStrategySettings.(unknown context at $aaaad5a3979c)._DowningStrategySettings.timeout(DistributedCluster.TimeoutBasedDowningStrategySettings(downUnreachableMembersAfter: 1.0 seconds)))
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .onDownAction: OnDownActionStrategySettings(underlying: DistributedCluster.OnDownActionStrategySettings.(unknown context at $aaaad5a3971c)._OnDownActionStrategySettings.gracefulShutdown(delay: 3.0 seconds))
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Binding to: [sact://TestRunCluster@127.0.0.1:7337]
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster/leadership cluster/node=sact://TestRunCluster@127.0.0.1:7337 leadership/election=DistributedCluster.Leadership.LowestReachableMember [DistributedCluster] Not enough members [1/2] to run election, members: [Member(sact://TestRunCluster:2481186327279040895@127.0.0.1:7337, status: joining, reachability: reachable)]
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Bound to [IPv4]127.0.0.1/127.0.0.1:7337

With the backtrace sending a signal 11, I tried using AddressSanitizer to see if I could get more information, which ended up giving me:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==1==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0xffff819de570 sp 0xffff819de560 T3)
==1==Hint: pc points to the zero page.
==1==The signal is caused by a READ memory access.
==1==Hint: address points to the zero page.
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] ClusterSystem [TestRunCluster] initialized, listening on: sact://TestRunCluster@127.0.0.1:7337: _ActorRef<ClusterShell.Message>(/system/cluster)
    #0 0x0  (<unknown module>)
    #1 0xaaaac2c62014  (/CrashingCluster+0x1e82014)
    #2 0xaaaac2c62754  (/CrashingCluster+0x1e82754)
    #3 0xaaaac2c2008c  (/CrashingCluster+0x1e4008c)
    #4 0xaaaac2c1fdf4  (/CrashingCluster+0x1e3fdf4)
    #5 0xaaaac2c2c098  (/CrashingCluster+0x1e4c098)
    #6 0xffff85f7d5c4  (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #7 0xffff85fe5d18  (/lib/aarch64-linux-gnu/libc.so.6+0xe5d18) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)

AddressSanitizer can not provide additional info.
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .autoLeaderElection: LeadershipSelectionSettings(underlying: DistributedCluster.ClusterSystemSettings.LeadershipSelectionSettings.(unknown context at $aaaac374b1dc)._LeadershipSelectionSettings.lowestReachable(minNumberOfMembers: 2))
SUMMARY: AddressSanitizer: SEGV (<unknown module>)
Thread T3 created by T1 here:
    #0 0xaaaac149fb68  (/CrashingCluster+0x6bfb68)
    #1 0xaaaac2c28478  (/CrashingCluster+0x1e48478)
    #2 0xaaaac2c2b694  (/CrashingCluster+0x1e4b694)
    #3 0xaaaac2c24c04  (/CrashingCluster+0x1e44c04)
    #4 0xaaaac2c2c098  (/CrashingCluster+0x1e4c098)
    #5 0xffff85f7d5c4  (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #6 0xffff85fe5d18  (/lib/aarch64-linux-gnu/libc.so.6+0xe5d18) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)

Thread T1 created by T0 here:
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .downingStrategy: DowningStrategySettings(underlying: DistributedCluster.DowningStrategySettings.(unknown context at $aaaac374979c)._DowningStrategySettings.timeout(DistributedCluster.TimeoutBasedDowningStrategySettings(downUnreachableMembersAfter: 1.0 seconds)))
    #0 0xaaaac149fb68  (/CrashingCluster+0x6bfb68)
    #1 0xaaaac2c28478  (/CrashingCluster+0x1e48478)
    #2 0xaaaac2c634cc  (/CrashingCluster+0x1e834cc)
    #3 0xaaaac2c6293c  (/CrashingCluster+0x1e8293c)
    #4 0xaaaac2c62014  (/CrashingCluster+0x1e82014)
    #5 0xaaaac2c62754  (/CrashingCluster+0x1e82754)
    #6 0xaaaac18ce5b4  (/CrashingCluster+0xaee5b4)
    #7 0xffff85f273f8  (/lib/aarch64-linux-gnu/libc.so.6+0x273f8) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #8 0xffff85f274c8  (/lib/aarch64-linux-gnu/libc.so.6+0x274c8) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #9 0xaaaac143efac  (/CrashingCluster+0x65efac)

2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .onDownAction: OnDownActionStrategySettings(underlying: DistributedCluster.OnDownActionStrategySettings.(unknown context at $aaaac374971c)._OnDownActionStrategySettings.gracefulShutdown(delay: 3.0 seconds))
==1==ABORTING

As far as I can tell, the problem only seems to arise when running on Linux with this Dockerfile:

# ================================
# Build image
# ================================
FROM swift:5.8-jammy as builder

RUN mkdir /workspace
WORKDIR /workspace

COPY . /workspace

RUN swift build --sanitize=address -c release -Xswiftc -g --static-swift-stdlib

# ================================
# Run image
# ================================
FROM ubuntu:jammy

COPY --from=builder /workspace/.build/release/CrashingCluster /

EXPOSE 7337

ENTRYPOINT ["./CrashingCluster"]

This reproduction, along with the Dockerfile, can be found in this repo, if it helps.

Thanks for all the work on this!

Metadata

Metadata

Assignees

No one assigned

    Labels

    0 - newNot sure yet if we should work on it or notasan

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions