-
Notifications
You must be signed in to change notification settings - Fork 45
add path_generator #59
base: master
Are you sure you want to change the base?
Conversation
That same loop with |
The approach Hugo describes should work 100%, however I think there's merit in an api that does not force you to write the temp file multiple times.
…On Sat, Dec 26, 2020, at 13:28, Hugo Barrera wrote:
> I originally used a dumb loop when writing the file to find the next available file to use, but I'd run into race conditions with many processes running simultaneously and files would somewhat frequently get overwritten.
That same loop with `overwrite=False` should work as you expect it.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#59 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMPRI4UVHKNN4XTBFQBE3SWXJGFANCNFSM4VJURAKQ>.
|
When I mentioned that "dumb loop", I was not using atomicwrites at the time. Instead I'd just look in the directory to find the next # available and use it with the normal open command. I don't see how I would cleanly achieve this with atomicwrites as is. I can still do that and attempt to write the file and catch the error, but I cannot execute the entire "with block" of code again as it is expensive and sometimes very time consuming (see code at bottom). So it isn't even just an issue of writing and deleting multiple temporary files, but for my application it simply isn't feasible to do that. I could just write to a temporary file myself (outside of atomicwrites) and then use the Or perhaps I'm totally missing what you guys had in mind? I'm assuming you're suggesting something like the code below. The for path in file_incrementor('name.{}.csv'):
try:
with atomic_write(path, overwrite=False) as output:
# long and expensive calculations (this block needs to only run 1 time)
output.write('example text')
except OSError as exc:
if exc.errno == errno.EEXIST:
pass
else:
raise |
I don't think you're missing anything. Could you demonstrate how you'd subclass? Perhaps this will give me a better understanding of how that part of the current API needs to evolve. |
Basically I'd make the subclass behave the same as the class AtomicWriterPathGenerator(AtomicWriter):
def __init__(self, *args, path_generator=None, **kwargs):
# would need to use old super method for python2 support
super().__init__(*args, **kwargs)
self._path_generator = path_generator
self.final_path = None
@contextlib.contextmanager
def _open(self, get_fileobject):
f = None # make sure f exists even if get_fileobject() fails
try:
success = False
with get_fileobject(**self._open_kwargs) as f:
yield f
self.sync(f)
self.final_path = self.commit(f)
success = True
finally:
if not success:
self.final_path = None
try:
self.rollback(f)
except Exception:
pass
def commit(self, f):
'''Move the temporary file to the target location.'''
if self._overwrite:
replace_atomic(f.name, self._path)
return self._path
else:
if self._path_generator is not None:
seen = set()
for path in self._path_generator(self._path):
if path in seen:
# avoid infinite loop if the path generator returns a
# path that was already attempted
raise ValueError(
'path_generator must return unique values, but'
'{} was returned multiple times.'.format(path)
)
seen.add(path)
try:
move_atomic(f.name, path)
except OSError as exc:
if exc.errno == errno.EEXIST:
pass
else:
raise
else:
return path
else:
move_atomic(f.name, self._path)
return self._path |
In the current form, my PR will keep trying new files until it either succeeds for the generator function returns a previously attempted file. It fails if a previously attempted path is provided to avoid an infinite loop. Another potential change could be to add a maximum number of attempts to make, after which it would raise an exception. One way to implement that might be to have the generator function return |
Any further thoughts on this idea? Or another way to add similar functionality. I want to use a context manager for a long running/expensive operation (with intermittent writes to the file during the long process) and then be sure that the file definitely is saved, and saved atomically. For what it's worth, my edited version (the code in the PR) has been working great for me so far, for over a month. I've run it with up to roughly 100 identical python processes saving files to the same directory. As obvious from the diff, I added a |
I have a python script that outputs file(s) to a specific directory and may be running on multiple computers within a cluster. The output files should all be named like
name.#.csv
where#
is an increasing number: 1, 2, 3, etc. I don't want any files to be overwritten, but I also don't want the write to fail. Instead, it should just increment the file number until the write succeeds. I originally used a dumb loop when writing the file to find the next available file to use, but I'd run into race conditions with many processes running simultaneously and files would somewhat frequently get overwritten.My solution was the edit I've added here as PR. Basically it adds a "path generator" that should be a generator which sequentially returns the next filename to try. This is flexible, as the user could make the generator return paths based on something entirely different than a sequential numbering. With this, my script works great and I haven't seen anymore files getting overwritten, so I thought I'd add a PR in case others find it useful. If not, feel free to just close this!
Here is an example of how I use it: