Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add numerical stabilization for difference of exponentials #1399

Merged
merged 4 commits into from
Jan 31, 2023

Conversation

Smit-create
Copy link
Member

@Smit-create Smit-create commented Jan 20, 2023

Closes #1303.

@Smit-create
Copy link
Member Author

I'm unable to test the changes locally because of some lock issues in macOS. If I run the following command:

% pytest -v tests/tensor/test_math.py::test_log1mexp_grad_lim
Timeout Error

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <filelock._unix.UnixFileLock object at 0x1368913c0>, timeout = 120
poll_interval = 0.05

    def acquire(
        self,
        timeout: float | None = None,
        poll_interval: float = 0.05,
        *,
        poll_intervall: float | None = None,
        blocking: bool = True,
    ) -> AcquireReturnProxy:
        """
        Try to acquire the file lock.
    
        :param timeout: maximum wait time for acquiring the lock, ``None`` means use the default :attr:`~timeout` is and
         if ``timeout < 0``, there is no timeout and this method will block until the lock could be acquired
        :param poll_interval: interval of trying to acquire the lock file
        :param poll_intervall: deprecated, kept for backwards compatibility, use ``poll_interval`` instead
        :param blocking: defaults to True. If False, function will return immediately if it cannot obtain a lock on the
         first attempt. Otherwise this method will block until the timeout expires or the lock is acquired.
        :raises Timeout: if fails to acquire lock within the timeout period
        :return: a context object that will unlock the file when the context is exited
    
        .. code-block:: python
    
            # You can use this method in the context manager (recommended)
            with lock.acquire():
                pass
    
            # Or use an equivalent try-finally construct:
            lock.acquire()
            try:
                pass
            finally:
                lock.release()
    
        .. versionchanged:: 2.0.0
    
            This method returns now a *proxy* object instead of *self*,
            so that it can be used in a with statement without side effects.
    
        """
        # Use the default timeout, if no timeout is provided.
        if timeout is None:
            timeout = self.timeout
    
        if poll_intervall is not None:
            msg = "use poll_interval instead of poll_intervall"
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            poll_interval = poll_intervall
    
        # Increment the number right at the beginning. We can still undo it, if something fails.
        with self._thread_lock:
            self._lock_counter += 1
    
        lock_id = id(self)
        lock_filename = self._lock_file
        start_time = time.monotonic()
        try:
            while True:
                with self._thread_lock:
                    if not self.is_locked:
                        _LOGGER.debug("Attempting to acquire lock %s on %s", lock_id, lock_filename)
                        self._acquire()
    
                if self.is_locked:
                    _LOGGER.debug("Lock %s acquired on %s", lock_id, lock_filename)
                    break
                elif blocking is False:
                    _LOGGER.debug("Failed to immediately acquire lock %s on %s", lock_id, lock_filename)
                    raise Timeout(self._lock_file)
                elif 0 <= timeout < time.monotonic() - start_time:
                    _LOGGER.debug("Timeout on acquiring lock %s on %s", lock_id, lock_filename)
>                   raise Timeout(self._lock_file)
E                   filelock._error.Timeout: The file lock 'The file lock '/Users/thebigbool/.aesara/compiledir_macOS-13.0-x86_64-i386-64bit-i386-3.10.8-64/.lock' could not be acquired.
E                   Apply node that caused the error: Elemwise{Composite{Switch(IsInf(Composite{(i0 / expm1((-i1)))}(i0, i1)), i2, Composite{(i0 / expm1((-i1)))}(i0, i1))}}(TensorConstant{-1.0}, x, TensorConstant{-inf})
E                   Toposort index: 0
E                   Inputs types: [TensorType(float64, ()), TensorType(float64, ()), TensorType(float32, ())]
E                   
E                   HINT: Use a linker other than the C linker to print the inputs' shapes and strides.
E                   HINT: Re-running with most Aesara optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the Aesara flag 'optimizer=fast_compile'. If that does not work, Aesara optimizations can be disabled with 'optimizer=None'.
E                   HINT: Use the Aesara flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.' could not be acquired.

../../opt/anaconda3/envs/aesara-dev/lib/python3.10/site-packages/filelock/_api.py:183: Timeout
============================= slowest 50 durations =============================
120.03s call     tests/tensor/test_math.py::test_log1mexp_grad_lim

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
=========================== short test summary info ============================
FAILED tests/tensor/test_math.py::test_log1mexp_grad_lim - filelock._error.Timeout: The file lock 'The file lock '/Users/thebigbool/.a...
======================== 1 failed in 140.14s (0:02:20) =========================

@Smit-create
Copy link
Member Author

Seems like some OS issue as the tests already run on CI. Any pointers on this @brandonwillard?

@brandonwillard
Copy link
Member

Seems like some OS issue as the tests already run on CI. Any pointers on this @brandonwillard?

I would first check that ~/.aesara—or wherever your Aesara cache directory (i.e. base_compiledir) is located—isn't pointing to a remote directory or something similar that could trip up filelock.

Does this happen literally every time something is compiled in Aesara, or only when that particular test is run?

tests/tensor/test_math.py Outdated Show resolved Hide resolved
aesara/tensor/rewriting/math.py Outdated Show resolved Hide resolved
aesara/tensor/rewriting/math.py Outdated Show resolved Hide resolved
aesara/tensor/rewriting/math.py Outdated Show resolved Hide resolved
@Smit-create
Copy link
Member Author

Does this happen literally every time something is compiled in Aesara, or only when that particular test is run?

This happens to me whenever I try to run the tests.

@brandonwillard
Copy link
Member

Does this happen literally every time something is compiled in Aesara, or only when that particular test is run?

This happens to me whenever I try to run the tests.

Very interesting! Don't fix anything just yet; I would like to get on a call first and check it out. If it's an Aesara bug, this would be a great opportunity to squash it.

In the meantime, you can always test things using the Python backend by setting the Aesara cxx setting to an empty string.

@codecov
Copy link

codecov bot commented Jan 26, 2023

Codecov Report

Merging #1399 (8cbed4d) into main (b64cb85) will increase coverage by 0.00%.
The diff coverage is 80.95%.

❗ Current head 8cbed4d differs from pull request most recent head 5fbbc8a. Consider uploading reports for the commit 5fbbc8a to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1399   +/-   ##
=======================================
  Coverage   74.69%   74.69%           
=======================================
  Files         194      194           
  Lines       49730    49751   +21     
  Branches    10527    10532    +5     
=======================================
+ Hits        37145    37162   +17     
- Misses      10262    10264    +2     
- Partials     2323     2325    +2     
Impacted Files Coverage Δ
aesara/tensor/rewriting/math.py 85.95% <80.95%> (-0.07%) ⬇️

@brandonwillard
Copy link
Member

brandonwillard commented Jan 30, 2023

Does this happen literally every time something is compiled in Aesara, or only when that particular test is run?

This happens to me whenever I try to run the tests.

Very interesting! Don't fix anything just yet; I would like to get on a call first and check it out. If it's an Aesara bug, this would be a great opportunity to squash it.

In the meantime, you can always test things using the Python backend by setting the Aesara cxx setting to an empty string.

This was solved by re-constructing the environment using the environment-arm.yml discussed in #909.

@brandonwillard brandonwillard marked this pull request as ready for review January 30, 2023 20:08
Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; we just need to squash before/while merging.

@brandonwillard brandonwillard enabled auto-merge (squash) January 30, 2023 20:13
@rlouf rlouf disabled auto-merge January 31, 2023 15:44
@rlouf rlouf merged commit e518463 into aesara-devs:main Jan 31, 2023
@rlouf
Copy link
Member

rlouf commented Jan 31, 2023

Looks good, merging!

@Smit-create Smit-create deleted the i-1303 branch February 3, 2023 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request graph rewriting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add numerical stabilization for difference of exponentials
3 participants