add path_generator #59

Spectre5 · 2020-12-26T06:02:52Z

I have a python script that outputs file(s) to a specific directory and may be running on multiple computers within a cluster. The output files should all be named like name.#.csv where # is an increasing number: 1, 2, 3, etc. I don't want any files to be overwritten, but I also don't want the write to fail. Instead, it should just increment the file number until the write succeeds. I originally used a dumb loop when writing the file to find the next available file to use, but I'd run into race conditions with many processes running simultaneously and files would somewhat frequently get overwritten.

My solution was the edit I've added here as PR. Basically it adds a "path generator" that should be a generator which sequentially returns the next filename to try. This is flexible, as the user could make the generator return paths based on something entirely different than a sequential numbering. With this, my script works great and I haven't seen anymore files getting overwritten, so I thought I'd add a PR in case others find it useful. If not, feel free to just close this!

Here is an example of how I use it:

import itertools
from atomicwrites import AtomicWriter

def file_incrementor(base_filename):
    for i in itertools.count(1):
        # change i to 1 to see error raised by library to avoid infinite loop
        yield base_filename.format(i)

writer = AtomicWriter(path='name.{}.csv', path_generator=file_incrementor)
with writer.open() as output:
    # calculations with intermittent writes to output
    output.write('example text')
# it also saves the actual path that was saved (regardless of whether path_generator was used)
print(writer.final_path)

WhyNotHugo · 2020-12-26T12:28:06Z

I originally used a dumb loop when writing the file to find the next available file to use, but I'd run into race conditions with many processes running simultaneously and files would somewhat frequently get overwritten.

That same loop with overwrite=False should work as you expect it.

untitaker · 2020-12-26T14:08:31Z

The approach Hugo describes should work 100%, however I think there's merit in an api that does not force you to write the temp file multiple times.

…

On Sat, Dec 26, 2020, at 13:28, Hugo Barrera wrote: > I originally used a dumb loop when writing the file to find the next available file to use, but I'd run into race conditions with many processes running simultaneously and files would somewhat frequently get overwritten. That same loop with `overwrite=False` should work as you expect it. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#59 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMPRI4UVHKNN4XTBFQBE3SWXJGFANCNFSM4VJURAKQ>.

Spectre5 · 2020-12-26T18:54:04Z

When I mentioned that "dumb loop", I was not using atomicwrites at the time. Instead I'd just look in the directory to find the next # available and use it with the normal open command. I don't see how I would cleanly achieve this with atomicwrites as is. I can still do that and attempt to write the file and catch the error, but I cannot execute the entire "with block" of code again as it is expensive and sometimes very time consuming (see code at bottom). So it isn't even just an issue of writing and deleting multiple temporary files, but for my application it simply isn't feasible to do that.

I could just write to a temporary file myself (outside of atomicwrites) and then use the move_atomic function from atomicwrites in a loop to rename the temporary file to its final name, but then I'd lose some of the other nice aspects of the AtomicWriter class such as automatic clean up. Of course I can implement that stuff myself, but then I'm just re-implementing most of the class. Sub-classing it is another option, but I'd need to re-write half the methods on it to make sure I can get the final_path that was written. Having this functionality in the library seems reasonable to me, but I understand that you're trying to keep the library as simple as possible too.

Or perhaps I'm totally missing what you guys had in mind? I'm assuming you're suggesting something like the code below. The file_incrementor generator could be smarter and scan the directory to start at the first "available" file, but if a collision happens then the "with block" would get executed again on the next file attempt.

for path in file_incrementor('name.{}.csv'):
    try:
        with atomic_write(path, overwrite=False) as output:
            # long and expensive calculations (this block needs to only run 1 time)
            output.write('example text')
    except OSError as exc:
        if exc.errno == errno.EEXIST:
            pass
        else:
            raise

untitaker · 2020-12-26T21:45:50Z

I don't think you're missing anything. Could you demonstrate how you'd subclass? Perhaps this will give me a better understanding of how that part of the current API needs to evolve.

Spectre5 · 2020-12-26T23:15:52Z

Basically I'd make the subclass behave the same as the AtomicWriter class in my PR code. To implement my PR code as a subclass, I'd override the _open method to store the new final_path attribute and update commit to get successive path's to try from the user provided path generator function (only used if overwrite=False, there may be a better way to handle that). An quick (untested) subclass might be:

class AtomicWriterPathGenerator(AtomicWriter):
    def __init__(self, *args, path_generator=None, **kwargs):
        # would need to use old super method for python2 support
        super().__init__(*args, **kwargs)  
        self._path_generator = path_generator
        self.final_path = None

    @contextlib.contextmanager
    def _open(self, get_fileobject):
        f = None  # make sure f exists even if get_fileobject() fails
        try:
            success = False
            with get_fileobject(**self._open_kwargs) as f:
                yield f
                self.sync(f)
            self.final_path = self.commit(f)
            success = True
        finally:
            if not success:
                self.final_path = None
                try:
                    self.rollback(f)
                except Exception:
                    pass

    def commit(self, f):
        '''Move the temporary file to the target location.'''
        if self._overwrite:
            replace_atomic(f.name, self._path)
            return self._path
        else:
            if self._path_generator is not None:
                seen = set()
                for path in self._path_generator(self._path):
                    if path in seen:
                        # avoid infinite loop if the path generator returns a
                        # path that was already attempted
                        raise ValueError(
                            'path_generator must return unique values, but'
                            '{} was returned multiple times.'.format(path)
                        )
                    seen.add(path)
                    try:
                        move_atomic(f.name, path)
                    except OSError as exc:
                        if exc.errno == errno.EEXIST:
                            pass
                        else:
                            raise
                    else:
                        return path
            else:
                move_atomic(f.name, self._path)
                return self._path

Spectre5 · 2020-12-26T23:23:26Z

In the current form, my PR will keep trying new files until it either succeeds for the generator function returns a previously attempted file. It fails if a previously attempted path is provided to avoid an infinite loop. Another potential change could be to add a maximum number of attempts to make, after which it would raise an exception. One way to implement that might be to have the generator function return None. Then if None is ever returned as the path, an exception is raised.

Spectre5 · 2021-01-24T07:24:12Z

Any further thoughts on this idea? Or another way to add similar functionality. I want to use a context manager for a long running/expensive operation (with intermittent writes to the file during the long process) and then be sure that the file definitely is saved, and saved atomically.

For what it's worth, my edited version (the code in the PR) has been working great for me so far, for over a month. I've run it with up to roughly 100 identical python processes saving files to the same directory. As obvious from the diff, I added a final_path attribute so that each process knows what actual filename it ended up writing, which is also important for my use case since the process may need to read that file later in the code too.

Spectre5 added 4 commits December 25, 2020 21:58

add path_generator

e68f542

remove f-string

c0b2588

import order

e3dcbcd

import order

d803fdf

add doc string for path_generator

b3b64a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add path_generator #59

add path_generator #59

Spectre5 commented Dec 26, 2020 •

edited

Loading

WhyNotHugo commented Dec 26, 2020

untitaker commented Dec 26, 2020 via email

Spectre5 commented Dec 26, 2020 •

edited

Loading

untitaker commented Dec 26, 2020

Spectre5 commented Dec 26, 2020 •

edited

Loading

Spectre5 commented Dec 26, 2020

Spectre5 commented Jan 24, 2021

add path_generator #59

Are you sure you want to change the base?

add path_generator #59

Conversation

Spectre5 commented Dec 26, 2020 • edited Loading

WhyNotHugo commented Dec 26, 2020

untitaker commented Dec 26, 2020 via email

Spectre5 commented Dec 26, 2020 • edited Loading

untitaker commented Dec 26, 2020

Spectre5 commented Dec 26, 2020 • edited Loading

Spectre5 commented Dec 26, 2020

Spectre5 commented Jan 24, 2021

Spectre5 commented Dec 26, 2020 •

edited

Loading

Spectre5 commented Dec 26, 2020 •

edited

Loading

Spectre5 commented Dec 26, 2020 •

edited

Loading