Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIN] Parallel tests execution fails because of locked files #2777

Closed
gshimansky opened this issue Nov 20, 2024 · 15 comments · Fixed by #3041 · May be fixed by #2787
Closed

[WIN] Parallel tests execution fails because of locked files #2777

gshimansky opened this issue Nov 20, 2024 · 15 comments · Fixed by #3041 · May be fixed by #2787
Assignees
Labels
bug Something isn't working windows

Comments

@gshimansky
Copy link
Contributor

Describe the bug

Trying to execute multiple unit tests in parallel with xdist -n X on Windows leads to failures. It happens most likely because one worker compiles and starts executing a kernel using a launcher DLL (with pyd extension on windows) from ~/.triton/cache folder while another worker tries to compile the same kernel and write the a launcher DLL into the same folder. On Windows a DLL that is loaded into a process is locked and cannot be modified, so 2nd worker that tries to write a .pyd file gets an IO error and fails. This doesn't happen on Linux because Linux doesn't lock files that are open by running processes.

Any ideas on how to reliably solve this are welcome.

Environment details

Triton on any GPU running on Windows.

@gshimansky gshimansky added the bug Something isn't working label Nov 20, 2024
@anmyachev
Copy link
Contributor

Maybe try to implement pytest' session-level fixture with lock files, in which the launcher will be built only once?
For ref: https://pytest-xdist.readthedocs.io/en/stable/how-to.html#making-session-scoped-fixtures-execute-only-once

@gshimansky
Copy link
Contributor Author

Maybe try to implement pytest' session-level fixture with lock files, in which the launcher will be built only once? For ref: https://pytest-xdist.readthedocs.io/en/stable/how-to.html#making-session-scoped-fixtures-execute-only-once

It is better to fix Triton itself then fix unit tests because this error may happen in real scenario if user attempts to run the same kernel from multiple threads or processes.

@gshimansky
Copy link
Contributor Author

Ok it looks like the problem is with how os.replace is implemented on Windows. It is not atomic and consists of two system calls: SetRenameInformationFile and CloseFile between which if some other process tries to open this file (for reading or writing doesn't matter) it gets a SHARING VIOLATION error. I wrote two small tests that demonstrate this behavior:

import os

count = 0
while True:
    name = f"{count}.txt"
    f = open(name, "w")
    f.write("test\n")
    f.close()

    try:
        os.replace(name, "test.txt")
    except PermissionError:
        print(f"Failed with {name}")

    count += 1
from pathlib import Path

count = 0
while True:
    try:
        line = Path("test.txt").read_text()
    except PermissionError:
        print(f"Failed {count}")
    count += 1

When they run together the second program produces errors that we see in triton cache.

@gshimansky
Copy link
Contributor Author

Also what should be taken into account is that VS Code IDE often opens and locks files because it monitors filesystem for changes. So if you are running the above tests make sure that you are not running VS Code (Code.exe process name).

@gshimansky
Copy link
Contributor Author

Trace from Process Monitor that demonstrates the contention. Process 95332 renames temp file in two system calls and process 199284 tries to read it between these calls, gets an error.
image

@alexbaden
Copy link
Contributor

If you remove the DLL from the equation (maybe by moving the directory?) do you have this issue with the cached IR and generated device code?

@gshimansky
Copy link
Contributor Author

If you remove the DLL from the equation (maybe by moving the directory?) do you have this issue with the cached IR and generated device code?

The problem happens with all cached files, *.json, *.llir, *.spv, *.ttir, *.ttgir, etc. Any of them can potentially trigger this exception. This is an example of test failures when running them on 16 workers:

========================================================================================= FAILURES =========================================================================================
____________________________________________________________________________ test_bitwise_op[1-int64-uint16-^0] ____________________________________________________________________________
[gw9] win32 -- Python 3.11.10 C:\Users\sdp\miniforge3\envs\triton\python.exe

dtype_x = 'int64', dtype_y = 'uint16', op = '^', num_ctas = 1, device = 'xpu'

    @pytest.mark.interpreter
    @pytest.mark.parametrize("dtype_x, dtype_y, op", [  #
        (dtype_x, dtype_y, op)
        for op in ['&', '|', '^']
        for dtype_x in dtypes + dtypes_with_bfloat16
        for dtype_y in dtypes + dtypes_with_bfloat16
    ])
    @pytest.mark.parametrize("num_ctas", num_ctas_list)
    def test_bitwise_op(dtype_x, dtype_y, op, num_ctas, device):
        expr = f'x {op} y'
        if (dtype_x in uint_dtypes and dtype_y in int_dtypes and _bitwidth(dtype_x) >= _bitwidth(dtype_y)):
            numpy_expr = f'x.astype(np.{dtype_x}) {op} y.astype(np.{dtype_x})'
        elif (dtype_y in uint_dtypes and dtype_x in int_dtypes and _bitwidth(dtype_y) >= _bitwidth(dtype_x)):
            numpy_expr = f'x.astype(np.{dtype_y}) {op} y.astype(np.{dtype_y})'
        else:
            numpy_expr = None
        if 'float' in dtype_x + dtype_y:
            # The CompilationError must have been caused by a C++ exception with this text.
            with pytest.raises(triton.TritonError, match='invalid operands of type'):
                _test_binary(dtype_x, dtype_y, expr, numpy_expr='np.array([])', device=device, num_ctas=num_ctas)
        else:
>           _test_binary(dtype_x, dtype_y, expr, numpy_expr, device=device, num_ctas=num_ctas)

language\test_core.py:642:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
language\test_core.py:427: in _test_binary
    do_test(x, y, kernel)
language\test_core.py:404: in do_test
    kernel_fn[(1, )](z_tri, x_tri, y_tri, SIZE=SIZE, num_warps=4, num_ctas=num_ctas)
..\..\triton\runtime\jit.py:330: in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
..\..\triton\runtime\jit.py:623: in run
    kernel = self.compile(
..\..\triton\compiler\compiler.py:310: in compile
    return CompiledKernel(src, metadata_group, hash)
..\..\triton\compiler\compiler.py:375: in __init__
    self.asm = AsmDict({
..\..\triton\compiler\compiler.py:376: in <dictcomp>
    file.suffix[1:]: file.read_bytes() if file.suffix[1:] == binary_ext else file.read_text()
C:\Users\sdp\miniforge3\envs\triton\Lib\pathlib.py:1050: in read_bytes
    with self.open(mode='rb') as f:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = WindowsPath('C:/Users/sdp/.triton/cache/AAL7R62R3IQLI3KYQOZ6AGT5C7FPBZGYWRTH5RC46BA6YPJ4ZUHA/kernel.spv'), mode = 'rb', buffering = -1, encoding = None, errors = None, newline = None

    def open(self, mode='r', buffering=-1, encoding=None,
             errors=None, newline=None):
        """
        Open the file pointed by this path and return a file object, as
        the built-in open() function does.
        """
        if "b" not in mode:
            encoding = io.text_encoding(encoding)
>       return io.open(self, mode, buffering, encoding, errors, newline)
E       PermissionError: [Errno 13] Permission denied: 'C:\\Users\\sdp\\.triton\\cache\\AAL7R62R3IQLI3KYQOZ6AGT5C7FPBZGYWRTH5RC46BA6YPJ4ZUHA\\kernel.spv'

C:\Users\sdp\miniforge3\envs\triton\Lib\pathlib.py:1044: PermissionError
___________________________________________________________________________ test_bitwise_op[1-uint32-uint64-^2] ____________________________________________________________________________
[gw10] win32 -- Python 3.11.10 C:\Users\sdp\miniforge3\envs\triton\python.exe

dtype_x = 'uint32', dtype_y = 'uint64', op = '^', num_ctas = 1, device = 'xpu'

    @pytest.mark.interpreter
    @pytest.mark.parametrize("dtype_x, dtype_y, op", [  #
        (dtype_x, dtype_y, op)
        for op in ['&', '|', '^']
        for dtype_x in dtypes + dtypes_with_bfloat16
        for dtype_y in dtypes + dtypes_with_bfloat16
    ])
    @pytest.mark.parametrize("num_ctas", num_ctas_list)
    def test_bitwise_op(dtype_x, dtype_y, op, num_ctas, device):
        expr = f'x {op} y'
        if (dtype_x in uint_dtypes and dtype_y in int_dtypes and _bitwidth(dtype_x) >= _bitwidth(dtype_y)):
            numpy_expr = f'x.astype(np.{dtype_x}) {op} y.astype(np.{dtype_x})'
        elif (dtype_y in uint_dtypes and dtype_x in int_dtypes and _bitwidth(dtype_y) >= _bitwidth(dtype_x)):
            numpy_expr = f'x.astype(np.{dtype_y}) {op} y.astype(np.{dtype_y})'
        else:
            numpy_expr = None
        if 'float' in dtype_x + dtype_y:
            # The CompilationError must have been caused by a C++ exception with this text.
            with pytest.raises(triton.TritonError, match='invalid operands of type'):
                _test_binary(dtype_x, dtype_y, expr, numpy_expr='np.array([])', device=device, num_ctas=num_ctas)
        else:
>           _test_binary(dtype_x, dtype_y, expr, numpy_expr, device=device, num_ctas=num_ctas)

language\test_core.py:642:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
language\test_core.py:436: in _test_binary
    do_test(x[:1].reshape(()), y, kernel_broadcast_lhs)
language\test_core.py:404: in do_test
    kernel_fn[(1, )](z_tri, x_tri, y_tri, SIZE=SIZE, num_warps=4, num_ctas=num_ctas)
..\..\triton\runtime\jit.py:330: in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
..\..\triton\runtime\jit.py:623: in run
    kernel = self.compile(
..\..\triton\compiler\compiler.py:310: in compile
    return CompiledKernel(src, metadata_group, hash)
..\..\triton\compiler\compiler.py:375: in __init__
    self.asm = AsmDict({
..\..\triton\compiler\compiler.py:376: in <dictcomp>
    file.suffix[1:]: file.read_bytes() if file.suffix[1:] == binary_ext else file.read_text()
C:\Users\sdp\miniforge3\envs\triton\Lib\pathlib.py:1050: in read_bytes
    with self.open(mode='rb') as f:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = WindowsPath('C:/Users/sdp/.triton/cache/B7LLTTZFRRBM5QE4A2D7Z5A4CZK6J2YDIPQ3BI3DUVXHKMKPMCHA/kernel_broadcast_lhs.spv'), mode = 'rb', buffering = -1, encoding = None, errors = None
newline = None

    def open(self, mode='r', buffering=-1, encoding=None,
             errors=None, newline=None):
        """
        Open the file pointed by this path and return a file object, as
        the built-in open() function does.
        """
        if "b" not in mode:
            encoding = io.text_encoding(encoding)
>       return io.open(self, mode, buffering, encoding, errors, newline)
E       PermissionError: [Errno 13] Permission denied: 'C:\\Users\\sdp\\.triton\\cache\\B7LLTTZFRRBM5QE4A2D7Z5A4CZK6J2YDIPQ3BI3DUVXHKMKPMCHA\\kernel_broadcast_lhs.spv'

C:\Users\sdp\miniforge3\envs\triton\Lib\pathlib.py:1044: PermissionError
===================================================================================== warnings summary =====================================================================================
test/unit/language/test_core.py: 95 warnings
  C:\b\tr\python\test\unit\language\test_core.py:394: RuntimeWarning: overflow encountered in cast
    z_ref = z_ref.astype(dtype_z)

test/unit/language/test_core.py::test_math_op[float32-sqrt-x]
test/unit/language/test_core.py::test_math_op[float64-sqrt-x]
  <string>:1: RuntimeWarning: invalid value encountered in sqrt

test/unit/language/test_core.py::test_scan2d[cummax-float32-shape298-0-False-4]
test/unit/language/test_core.py::test_scan2d[cummax-float32-shape304-0-False-4]
test/unit/language/test_core.py::test_scan2d[cummax-float32-shape718-1-False-16]
  C:\b\tr\python\test\unit\language\test_core.py:2454: RuntimeWarning: invalid value encountered in cast
    z = z.astype(np.int64)

test/unit/language/test_core.py: 30 warnings
  C:\b\tr\python\test\unit\language\test_core.py:1491: RuntimeWarning: overflow encountered in scalar negative
    x[idx] = -np.max(np.abs(x)) - 1

test/unit/language/test_core.py::test_shapes_as_params
  C:\b\tr\python\triton\language\core.py:1515: UserWarning: view is deprecated, please use reshape with can_reorder being true.
    warn("view is deprecated, please use reshape with can_reorder being true.")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================================= short test summary info ==================================================================================
FAILED language\test_core.py::test_bitwise_op[1-int64-uint16-^0] - PermissionError: [Errno 13] Permission denied: 'C:\\Users\\sdp\\.triton\\cache\\AAL7R62R3IQLI3KYQOZ6AGT5C7FPBZGYWRTH5RC46BA6YPJ4ZUHA\\kernel.spv'
FAILED language\test_core.py::test_bitwise_op[1-uint32-uint64-^2] - PermissionError: [Errno 13] Permission denied: 'C:\\Users\\sdp\\.triton\\cache\\B7LLTTZFRRBM5QE4A2D7Z5A4CZK6J2YDIPQ3BI3DUVXHKMKPMCHA\\kernel_broadcast_lhs.spv'
================================================== 2 failed, 11426 passed, 1304 skipped, 571 xfailed, 131 warnings in 1464.87s (0:24:24) ===================================================

@whitneywhtsang
Copy link
Contributor

Can we add a lock for os.replace at https://github.com/intel/intel-xpu-backend-for-triton/blob/main/python/triton/runtime/cache.py#L134, to ensure atomic behavior on Windows?

@gshimansky
Copy link
Contributor Author

Can we add a lock for os.replace at https://github.com/intel/intel-xpu-backend-for-triton/blob/main/python/triton/runtime/cache.py#L134, to ensure atomic behavior on Windows?

Yes file locking is one possible approach that we can implement. The problem with it is that adding file locking in just this one place is not enough. We need to also guard read access to cached files with locks because currently exception happens when another process tries to open this file for reading. There is more than one read location and we need to find all of them.

@pbchekin
Copy link
Contributor

Interestingly, there are no such errors in this run: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12355194841/job/34478248043. The only PermissionError i noticed is #3019, which is simple to fix.

@gshimansky
Copy link
Contributor Author

Interestingly, there are no such errors in this run: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12355194841/job/34478248043. The only PermissionError i noticed is #3019, which is simple to fix.

These exceptions are pretty rare, they happen about 2-4 times per 10k tests in core test suite when I run them on 16 workers. You used only 2 workers and probably just were lucky this time.

@pbchekin
Copy link
Contributor

Interestingly, there are no such errors in this run: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12355194841/job/34478248043. The only PermissionError i noticed is #3019, which is simple to fix.

These exceptions are pretty rare, they happen about 2-4 times per 10k tests in core test suite when I run them on 16 workers. You used only 2 workers and probably just were lucky this time.

Right, I am re-running with 16 workers, trying to reproduce.

@pbchekin
Copy link
Contributor

Reproduced in https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12361948103.

PermissionError: [WinError 5] Access is denied: 'C:\\Users\\vagrant\\.triton\\cache\\64ML3ZM6GIJOPBOHKJ7BC3NE5FSANALHDMOW36QYF4ILQMPA2BUA\\tmp.pid_26440_47b27b34-b36a-4e1c-911c-7666b42124c5\\kernel_broadcast_rhs.ttir' -> 'C:\\Users\\vagrant\\.triton\\cache\\64ML3ZM6GIJOPBOHKJ7BC3NE5FSANALHDMOW36QYF4ILQMPA2BUA\\kernel_broadcast_rhs.ttir'

@gshimansky
Copy link
Contributor Author

Reproduced in https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12361948103.

PermissionError: [WinError 5] Access is denied: 'C:\\Users\\vagrant\\.triton\\cache\\64ML3ZM6GIJOPBOHKJ7BC3NE5FSANALHDMOW36QYF4ILQMPA2BUA\\tmp.pid_26440_47b27b34-b36a-4e1c-911c-7666b42124c5\\kernel_broadcast_rhs.ttir' -> 'C:\\Users\\vagrant\\.triton\\cache\\64ML3ZM6GIJOPBOHKJ7BC3NE5FSANALHDMOW36QYF4ILQMPA2BUA\\kernel_broadcast_rhs.ttir'

Here you are getting an error when two processes contend on os.replace function call. This is a much more common error and happens for me in hundreds of cases when run on multiple workers. I've fixed this error in this patch https://github.com/intel/intel-xpu-backend-for-triton/pull/2787/files. This patch however doesn't fix a problem when one worker executes os.replace and another tries to open the same file for reading.

@pbchekin
Copy link
Contributor

pbchekin commented Dec 18, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment