First ping/UDP packet to destination fails with EHOSTDOWN #5

iangudger · 2018-05-02T06:12:17Z

The first ping or UDP packet sent to a destination will fail with EHOSTDOWN (host is down).

gvisor/pkg/tcpip/transport/ping/endpoint.go

Lines 256 to 264 in 797cda3

    
           if err == tcpip.ErrWouldBlock { 
        
           	// Link address needs to be resolved. Resolution was triggered the 
        
           	// background. Better luck next time. 
        
           	// 
        
           	// TODO: queue up the request and send after link address 
        
           	// is resolved. 
        
           	route.RemoveWaker(waker) 
        
           	return 0, tcpip.ErrNoLinkAddress 
        
           }

gvisor/pkg/tcpip/transport/udp/endpoint.go

Lines 289 to 297 in 797cda3

    
           if err == tcpip.ErrWouldBlock { 
        
           	// Link address needs to be resolved. Resolution was triggered the background. 
        
           	// Better luck next time. 
        
           	// 
        
           	// TODO: queue up the request and send after link address 
        
           	// is resolved. 
        
           	route.RemoveWaker(waker) 
        
           	return 0, tcpip.ErrNoLinkAddress 
        
           }

glibc's malloc also uses SYS_TIME. Permit it. #0 0x0000000000de6267 in time () #1 0x0000000000db19d8 in get_nprocs () #2 0x0000000000d8a31a in arena_get2.part () #3 0x0000000000d8ab4a in malloc () #4 0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) () #5 0x0000000000d4cd70 in __tsan_go_start () #6 0x00000000004617a3 in racecall () #7 0x00000000010f4ea0 in runtime.findfunctab () #8 0x000000000043f193 in runtime.racegostart () Signed-off-by: Dmitry Vyukov <dvyukov@google.com> [mpratt@google.com: updated comments and commit message] Signed-off-by: Michael Pratt <mpratt@google.com> Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a PiperOrigin-RevId: 203042627

glibc's malloc also uses SYS_TIME. Permit it. #0 0x0000000000de6267 in time () google#1 0x0000000000db19d8 in get_nprocs () google#2 0x0000000000d8a31a in arena_get2.part () google#3 0x0000000000d8ab4a in malloc () google#4 0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) () google#5 0x0000000000d4cd70 in __tsan_go_start () google#6 0x00000000004617a3 in racecall () google#7 0x00000000010f4ea0 in runtime.findfunctab () google#8 0x000000000043f193 in runtime.racegostart () Signed-off-by: Dmitry Vyukov <dvyukov@google.com> [mpratt@google.com: updated comments and commit message] Signed-off-by: Michael Pratt <mpratt@google.com> Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a PiperOrigin-RevId: 203042627

Previously, if address resolution for UDP or Ping sockets required sending packets using Write in Transport layer, Resolve would return ErrWouldBlock and Write would return ErrNoLinkAddress. Meanwhile startAddressResolution would run in background. Further calls to Write using same address would also return ErrNoLinkAddress until resolution has been completed successfully. Since Write is not allowed to block and System Calls need to be interruptible in System Call layer, the caller to Write is responsible for blocking upon return of ErrWouldBlock. Now, when startAddressResolution is called a notification channel for the completion of the address resolution is returned. The channel will traverse up to the calling function of Write as well as ErrNoLinkAddress. Once address resolution is complete (success or not) the channel is closed. The caller would call Write again to send packets and check if address resolution was compeleted successfully or not. Fixes #5 Change-Id: Idafaf31982bee1915ca084da39ae7bd468cebd93 PiperOrigin-RevId: 214962200

glibc's malloc also uses SYS_TIME. Permit it. #0 0x0000000000de6267 in time () #1 0x0000000000db19d8 in get_nprocs () #2 0x0000000000d8a31a in arena_get2.part () #3 0x0000000000d8ab4a in malloc () google#4 0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) () google#5 0x0000000000d4cd70 in __tsan_go_start () google#6 0x00000000004617a3 in racecall () google#7 0x00000000010f4ea0 in runtime.findfunctab () google#8 0x000000000043f193 in runtime.racegostart () Signed-off-by: Dmitry Vyukov <dvyukov@google.com> [mpratt@google.com: updated comments and commit message] Signed-off-by: Michael Pratt <mpratt@google.com> Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a PiperOrigin-RevId: 203042627 Upstream-commit: 6144751

Previously, if address resolution for UDP or Ping sockets required sending packets using Write in Transport layer, Resolve would return ErrWouldBlock and Write would return ErrNoLinkAddress. Meanwhile startAddressResolution would run in background. Further calls to Write using same address would also return ErrNoLinkAddress until resolution has been completed successfully. Since Write is not allowed to block and System Calls need to be interruptible in System Call layer, the caller to Write is responsible for blocking upon return of ErrWouldBlock. Now, when startAddressResolution is called a notification channel for the completion of the address resolution is returned. The channel will traverse up to the calling function of Write as well as ErrNoLinkAddress. Once address resolution is complete (success or not) the channel is closed. The caller would call Write again to send packets and check if address resolution was compeleted successfully or not. Fixes google#5 Change-Id: Idafaf31982bee1915ca084da39ae7bd468cebd93 PiperOrigin-RevId: 214962200 Upstream-commit: c17ea8c

Below command under hostinet network will lead to panic: $ cat /proc/net/tcp It's caused by the wrong SizeOfTCPInfo. #0 runtime.panicindex() google#1 encoding/binary.littleEndian.Uint64 google#2 encoding/binary.(*littleEndian).Uint64 google#3 gvisor.dev/gvisor/pkg/binary.unmarshal google#4 gvisor.dev/gvisor/pkg/binary.unmarshal google#5 gvisor.dev/gvisor/pkg/binary.Unmarshal google#6 gvisor.dev/gvisor/pkg/sentry/socket/hostinet.(*socketOperations).State google#7 gvisor.dev/gvisor/pkg/sentry/fs/proc.(*netTCP).ReadSeqFileData Correct SizeOfTCPInfo from 104 to 192 to fix it. Fixes google#640 Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>

* Fix "Read the docs" to use consistent case with the rest of the callouts at the bottom of the page. * Add a comma to the Zero Configuration callout to make it read better. * Remove a few extra spaces from the HTML.

fuse: Fix FUSE_READDIR offset issue

Distributed training isn't working with PyTorch on certain A100 nodes. Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training. ## Reproduction This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB. - **NVIDIA Driver Version**: 550.54.15 - **CUDA Version**: 12.4 - **NVIDIA device**: NVIDIA A100 80GB PCIe ### Steps 1. **Install gvisor** ```bash URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}" wget -nc "${URL}/runsc" "${URL}/runsc.sha512" chmod +x runsc sudo cp runsc /usr/local/bin/runsc sudo /usr/local/bin/runsc install sudo systemctl reload docker ``` 2. **Add GPU enabling gvisor options** ```json { "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] }, "runsc": { "path": "/usr/local/bin/runsc", "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"] } } } ``` Reload configs with `sudo systemctl reload docker`. 3. **Run reproduction NCCL test** This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL. ```Dockerfile # Dockerfile FROM python:3.9.15-slim-bullseye RUN pip install torch numpy COPY <<EOF repro.py import argparse import datetime import os import torch import torch.distributed as dist import torch.multiprocessing as mp def setup(rank, world_size): os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600)) torch.cuda.set_device(rank) def cleanup(): dist.destroy_process_group() def send_tensor(rank, world_size): try: setup(rank, world_size) # rank receiving all tensors target_rank = world_size - 1 dist.barrier() tensor = torch.ones(5).cuda(rank) if rank < target_rank: print(f"[RANK {rank}] sending tensor: {tensor}") dist.send(tensor=tensor, dst=target_rank) elif rank == target_rank: for other_rank in range(target_rank): tensor = torch.zeros(5).cuda(target_rank) dist.recv(tensor=tensor, src=other_rank) print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}") print("PASS: NCCL working.") except Exception as e: print(f"[RANK {rank}] error in send_tensor: {e}") raise finally: cleanup() def main(world_size: int = 2): mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True) if __name__ == "__main__": parser = argparse.ArgumentParser(description="Run torch-based NCCL tests") parser.add_argument("world_size", type=int, help="number of GPUs to run test on") args = parser.parse_args() if args.world_size < 2: raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}") main(args.world_size) EOF ENTRYPOINT ["python", "repro.py", "4"] ``` Build image with: ``` docker build -f Dockerfile . ``` Then run it with: ``` sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1 ``` #### Failure (truncated) ``` ... Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so) <omitting python frames> frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python) . This may indicate a possible application crash on rank 0 or a network set up issue. ... ``` ### Fix gvisor debug logs show: ``` W0702 20:36:17.577055 445833 uvm.go:148] [ 22: 84] nvproxy: unknown uvm ioctl 66 = 0x42 ``` I've implemented that ioctl in this PR. This is the output after the fix. ``` [RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2') [RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0') [RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1') [RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3') [RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3') [RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3') PASS: NCCL working. ``` FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734 PiperOrigin-RevId: 649146570

fvoznika added the good first issue Good for newcomers label Jun 23, 2018

shentubot closed this as completed in google/netstack@f33542f Sep 28, 2018

tanjianfeng mentioned this issue Aug 2, 2019

fix wrong SizeOfTCPInfo #643

Closed

craig08 pushed a commit to craig08/gvisor that referenced this issue Aug 20, 2020

Merge pull request google#5 from craig08/fuse-readdir-fix-offset

b939b6e

fuse: Fix FUSE_READDIR offset issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First ping/UDP packet to destination fails with EHOSTDOWN #5

First ping/UDP packet to destination fails with EHOSTDOWN #5

iangudger commented May 2, 2018 •

edited

Loading

First ping/UDP packet to destination fails with EHOSTDOWN #5

First ping/UDP packet to destination fails with EHOSTDOWN #5

Comments

iangudger commented May 2, 2018 • edited Loading

iangudger commented May 2, 2018 •

edited

Loading