Skip to content
This repository has been archived by the owner on Jul 16, 2022. It is now read-only.

add path_generator #59

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Spectre5
Copy link

@Spectre5 Spectre5 commented Dec 26, 2020

I have a python script that outputs file(s) to a specific directory and may be running on multiple computers within a cluster. The output files should all be named like name.#.csv where # is an increasing number: 1, 2, 3, etc. I don't want any files to be overwritten, but I also don't want the write to fail. Instead, it should just increment the file number until the write succeeds. I originally used a dumb loop when writing the file to find the next available file to use, but I'd run into race conditions with many processes running simultaneously and files would somewhat frequently get overwritten.

My solution was the edit I've added here as PR. Basically it adds a "path generator" that should be a generator which sequentially returns the next filename to try. This is flexible, as the user could make the generator return paths based on something entirely different than a sequential numbering. With this, my script works great and I haven't seen anymore files getting overwritten, so I thought I'd add a PR in case others find it useful. If not, feel free to just close this!

Here is an example of how I use it:

import itertools
from atomicwrites import AtomicWriter

def file_incrementor(base_filename):
    for i in itertools.count(1):
        # change i to 1 to see error raised by library to avoid infinite loop
        yield base_filename.format(i)

writer = AtomicWriter(path='name.{}.csv', path_generator=file_incrementor)
with writer.open() as output:
    # calculations with intermittent writes to output
    output.write('example text')
# it also saves the actual path that was saved (regardless of whether path_generator was used)
print(writer.final_path)

@WhyNotHugo
Copy link
Contributor

I originally used a dumb loop when writing the file to find the next available file to use, but I'd run into race conditions with many processes running simultaneously and files would somewhat frequently get overwritten.

That same loop with overwrite=False should work as you expect it.

@untitaker
Copy link
Owner

untitaker commented Dec 26, 2020 via email

@Spectre5
Copy link
Author

Spectre5 commented Dec 26, 2020

When I mentioned that "dumb loop", I was not using atomicwrites at the time. Instead I'd just look in the directory to find the next # available and use it with the normal open command. I don't see how I would cleanly achieve this with atomicwrites as is. I can still do that and attempt to write the file and catch the error, but I cannot execute the entire "with block" of code again as it is expensive and sometimes very time consuming (see code at bottom). So it isn't even just an issue of writing and deleting multiple temporary files, but for my application it simply isn't feasible to do that.

I could just write to a temporary file myself (outside of atomicwrites) and then use the move_atomic function from atomicwrites in a loop to rename the temporary file to its final name, but then I'd lose some of the other nice aspects of the AtomicWriter class such as automatic clean up. Of course I can implement that stuff myself, but then I'm just re-implementing most of the class. Sub-classing it is another option, but I'd need to re-write half the methods on it to make sure I can get the final_path that was written. Having this functionality in the library seems reasonable to me, but I understand that you're trying to keep the library as simple as possible too.

Or perhaps I'm totally missing what you guys had in mind? I'm assuming you're suggesting something like the code below. The file_incrementor generator could be smarter and scan the directory to start at the first "available" file, but if a collision happens then the "with block" would get executed again on the next file attempt.

for path in file_incrementor('name.{}.csv'):
    try:
        with atomic_write(path, overwrite=False) as output:
            # long and expensive calculations (this block needs to only run 1 time)
            output.write('example text')
    except OSError as exc:
        if exc.errno == errno.EEXIST:
            pass
        else:
            raise

@untitaker
Copy link
Owner

I don't think you're missing anything. Could you demonstrate how you'd subclass? Perhaps this will give me a better understanding of how that part of the current API needs to evolve.

@Spectre5
Copy link
Author

Spectre5 commented Dec 26, 2020

Basically I'd make the subclass behave the same as the AtomicWriter class in my PR code. To implement my PR code as a subclass, I'd override the _open method to store the new final_path attribute and update commit to get successive path's to try from the user provided path generator function (only used if overwrite=False, there may be a better way to handle that). An quick (untested) subclass might be:

class AtomicWriterPathGenerator(AtomicWriter):
    def __init__(self, *args, path_generator=None, **kwargs):
        # would need to use old super method for python2 support
        super().__init__(*args, **kwargs)  
        self._path_generator = path_generator
        self.final_path = None

    @contextlib.contextmanager
    def _open(self, get_fileobject):
        f = None  # make sure f exists even if get_fileobject() fails
        try:
            success = False
            with get_fileobject(**self._open_kwargs) as f:
                yield f
                self.sync(f)
            self.final_path = self.commit(f)
            success = True
        finally:
            if not success:
                self.final_path = None
                try:
                    self.rollback(f)
                except Exception:
                    pass

    def commit(self, f):
        '''Move the temporary file to the target location.'''
        if self._overwrite:
            replace_atomic(f.name, self._path)
            return self._path
        else:
            if self._path_generator is not None:
                seen = set()
                for path in self._path_generator(self._path):
                    if path in seen:
                        # avoid infinite loop if the path generator returns a
                        # path that was already attempted
                        raise ValueError(
                            'path_generator must return unique values, but'
                            '{} was returned multiple times.'.format(path)
                        )
                    seen.add(path)
                    try:
                        move_atomic(f.name, path)
                    except OSError as exc:
                        if exc.errno == errno.EEXIST:
                            pass
                        else:
                            raise
                    else:
                        return path
            else:
                move_atomic(f.name, self._path)
                return self._path

@Spectre5
Copy link
Author

In the current form, my PR will keep trying new files until it either succeeds for the generator function returns a previously attempted file. It fails if a previously attempted path is provided to avoid an infinite loop. Another potential change could be to add a maximum number of attempts to make, after which it would raise an exception. One way to implement that might be to have the generator function return None. Then if None is ever returned as the path, an exception is raised.

@Spectre5
Copy link
Author

Any further thoughts on this idea? Or another way to add similar functionality. I want to use a context manager for a long running/expensive operation (with intermittent writes to the file during the long process) and then be sure that the file definitely is saved, and saved atomically.

For what it's worth, my edited version (the code in the PR) has been working great for me so far, for over a month. I've run it with up to roughly 100 identical python processes saving files to the same directory. As obvious from the diff, I added a final_path attribute so that each process knows what actual filename it ended up writing, which is also important for my use case since the process may need to read that file later in the code too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants