Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement changes in #57: improve behaviour of dumped files #59

Merged
merged 13 commits into from
Aug 21, 2014

Conversation

matsjoyce
Copy link
Contributor

Here's a start on #57.
Questions:

  1. When a file in read or append mode does not exists, the file currently points to os.devnull. Is this acceptable?
  2. Do you think the test file is enough?
  3. Should the IOErrors be replaced by UnpicklingErrors?
  4. When the position is off the end of the file, in safe mode, should an exception be raised, or the position moved to the end?

@mmckerns
Copy link
Member

I'm not convinced that mode w+ should become r+, and w should become r+. It makes sense, in many cases, but I need you to convince me it's the right thing to do (or I'll convince myself). I was able to spend a little time with this today, but not much. Will get back to it tomorrow. Also, is there a different keyword we can use rather than safe_file? Maybe something that is general enough that it could be used on other objects (like buffer objects?) if the case arises for whatever reason… and also something that is one word (or oneword)? I'm being a bit picky about the interface… but I'm trying to be forward looking so that it doesn't need to be changed later, in favor of something more general if the need arises (too specific is often bad for backward compatibility). Note there's already a safe in pickles, and it's with regard to throwing an error or not (which should be better documented -- oops).

@mmckerns
Copy link
Member

RE: your questions:

  1. Probably
  2. It's a pretty good start if it's not. In trying to break the model, we will build any remaining improvements to the test file.
  3. weakref throws a ReferenceErrordill also can commonly throw TypeError or AssertionError. I tried to keep error same as pickle throws in the relevant cases… so IOError may be the right thing.
  4. Don't know yet.

@matsjoyce
Copy link
Contributor Author

The idea is that if there is an app that does logging (for example), and some remotely executed code unpickles the log file (e.g. the main app opens it at the start, and gives the code a handle, which is the pickled), currently, after it finishes, the log will only contain its log, as the log from the rest of the program will be deleted. There is probably a better example, but that's what came to my head first.

@mmckerns
Copy link
Member

What I do in tricky cases that need some working through is to build a more extensive test suite, testing any case that I can think of that it might be used for -- that's what I mean by "play with it a bit". I have yet to do this, as promised above. I won't be convinced of the "right solution", until this code has really been kicked around a good bit, using a bunch of different test cases.

You could split the test cases into a separate PR, then we can both work on the tests until we have the results that "feel" right. I'm ok with merging your test case now, and having tests fail until we work out the logic is "correct".

@matsjoyce
Copy link
Contributor Author

So you want me to move matsjoyce/dill@90fdf15 to a separate PR?

@mmckerns
Copy link
Member

mmckerns commented Aug 1, 2014

Yeah, I think so. That would mean adding any accompanying kwargs in dill.dill… but currently set to be ignored, unless your branch is active. I don't care if some of the tests fail to start off.

@matsjoyce matsjoyce mentioned this pull request Aug 1, 2014
@mmckerns
Copy link
Member

mmckerns commented Aug 1, 2014

Ok, I'm about halfway through test_file… I've added several scenarios and associated tests for potential outcomes, and tried to make sure they are consistent (i.e. all (2)'s would be the same conceptual change, same for all (3)'s, and so on). I'll probably get back to it, and go through the rest of test_file this evening. I committed so you could have a look, and do some pondering if you like. Thus far, I'm partial to (2) (i.e. act like a new file handle was created, ignoring position)… and do not yet have a preference for what to do if position is not ignored. I think (2) should at least be an option, unless something surprising comes up in the rest of the test cases.

If you can think of any other test cases, please add them as PR or a comment here.

@matsjoyce
Copy link
Contributor Author

Starting at line 61, for (3), I chose r+ as with a mode, if you try to seek in the file, the data will still be appended onto the end. (5) would increase pickle sizes considerably, but would be an interesting option. Looks good!

@mmckerns
Copy link
Member

mmckerns commented Aug 1, 2014

With regard to (3), I'm still not sure that r+ is the best choice… and the note XXX there is just a reminder that it requires more thought. With regard to (5), I have seen industry code where this was the selected option -- it works fairly cleanly to pickle the contents as a string… and it's definitely conceptually different that the rest.

I usually like take the simplest solutions as being the best choices. (1) is simple… but apparently wrong. (2) and (5) are also pretty simple. (3) and (4) are less simple. I guess we'll see.

@mmckerns
Copy link
Member

mmckerns commented Aug 1, 2014

After getting this far, I'm liking (2), (5), and (4b)… mainly for the reason that they don't appear to end up in the state where the file has buffer filler \x00\x00\x00\x00\x00. I think my (1) and (4a), and your (3), all bother me in one way or the other, unfortunately. I do quite like the idea of your "file safe mode"… although, I'd still prefer a we come up with a better name for it.

I think reasoning out the test cases for 'a', as well as seeing what happens when the file is added to or replaced with a longer file, are both worthwhile to investigate. I'd feel comfortable making a decision at that point, and only possibly earlier. It also looks like (2) and (4b) are fairly similar, thus far. I am not at all opposed to having two or more options, as long as they are very clear (i.e. intuitive) options.

@mmckerns
Copy link
Member

mmckerns commented Aug 1, 2014

Also, this has not been mentioned previously… there's a new file mode in python 3.xhttps://docs.python.org/3/library/functions.html#open, with 'x' being similar (sort of) to "safe" mode. Also, the docs on open note that 'a' seems to behave differently on certain platforms. Yuck. Again, that tells me simpler is better.

@mmckerns
Copy link
Member

mmckerns commented Aug 2, 2014

only the test cases for append are missing… (i.e. the original file having mode='a'). I might not get to it for a few days. We'll see.

@matsjoyce
Copy link
Contributor Author

x? I didn't spot that. It seems that it's just a variant of w, but that ensures a new file is created, not an old one that is truncated, so we could handle it similar to w. I'll see if I can make any a cases.

@matsjoyce
Copy link
Contributor Author

As for other names:

  • cautious
  • safeio
  • nocreation

@mmckerns
Copy link
Member

mmckerns commented Aug 4, 2014

I added the test cases for variants (1) - (5) for test_file, and checked the test cases against (1) -- all tests pass. I also checked against (3) -- the above PR -- and all but two tests pass (see the FIXME). One of the tests is clearly not matching to your table with spec 'w+' becomes 'r+' (although it looks as if you didn't code to that table completely). The other (3) failure seems to be an overzealous IOError.

Going through the "append" cases didn't change my mind, I'm very much in favor of (2), (5), and possibly (4b). I'll wait on your feedback on which cases to go forward with. For me, it seems reasonably clear right now we want at least (2) and (5) -- and not (1) and (3).

In short, I'm probably not going to accept the PR as is (using logic 3), but after we close debate, there should be an agreed-upon solution (i.e. option to use (2) or (5) and possibly (4b)… as well as safe mode).

@matsjoyce
Copy link
Contributor Author

One of the tests is clearly not matching to your table with spec 'w+' becomes 'r+' (although it looks as if you didn't code to that table completely).

I set it that if the file does not exist, it keeps the w/w+ mode, letting it create the file.

I've fixed the IOError.

I like (3) and (5) best. However, maybe (3) should turn to (2) when the file DNE, as this would avoid the heaps of \x00 chars. That way the behaviour would be:

File exists (with same or larger contents):

Mode Result
r Open as normal; seek to position
a Open as normal; seek to position
w Open as r+; seek to position

File exists, but smaller

Mode Result
r Open as normal; seek to EOF if pos > EOF else seek to pos
a Open as normal; seek to EOF if pos > EOF else seek to pos
w Open as r+; seek to EOF if pos > EOF else seek to pos

File DNE

Mode Result
r Open os.devnull
a Create new file; seek to 0
w Create new file; seek to 0

@mmckerns
Copy link
Member

mmckerns commented Aug 4, 2014

I agree with your choice to get rid of the \x00 characters in (3). I don't think an acceptable solution should have buffer characters in them. The next thing to do would be to give the pros/cons for keeping each one of the options I've outlined as test cases… then make a decision. Again, I'm not opposed to multiple options, as long as they all make sense and are clear.

@matsjoyce
Copy link
Contributor Author

I've implemented the above changes, and changes the test, so you can see the kind of output it produces (basically the same, but no \x00 chars, due to EOF testing).

@mmckerns
Copy link
Member

Ok, that's better…

I think I might accept this as is... but I want add some of the other options immediately afterward (hence the delay). I don't think (3) should be the default. It's better than (1), but it's not simple to explain… hence I would probably not choose this as an option at all were I coding it myself. To me, it feels like it might be a good choice for certain use cases… but the level of complexity is high enough that it seems like there might be trouble with corner cases down the road… and that leads to changes that break backward compatibility.

I'm looking for an explanation of the entire behavior in a single sentence, like...
"Option 5 ('pickle_contents') pickles the file handle, preserving mode and position, as well as the file contents."
or…
"Option 2 ('new_handle') pickles the file handle, preserving mode, with position of the unpickled object as for a new file handle."

These are short, and fully describe the behavior of the option… can you come up with something similarly short and simple for (3)?

I think (2) might be the best default, with (5) as an option. (2) has some flaws, but at least the behavior is straightforward, and very consistent with python… and one should very quickly understand what to expect in all cases.

If I'm not mistaken, (4b) and (3) look like they are now the same, except for the choice of modes and the os.devnull bit. I'd say that either (4b) and (3) should be the 3rd option, and I'd probably prefer (4b) unless there's a significant use case hole that not picking (3) creates.

@matsjoyce
Copy link
Contributor Author

  • Option 3 ('preserve data') preserves existing data or create file if is does not exist, with position = min(pickled position, EOF), and mode which preserves behaviour
  • Option 3 ('preserve data') preserves behaviour of file object, and creates file if it does not exist

I'll try and add (2) and (5).

@mmckerns
Copy link
Member

@matsjoyce: that would be great. They should both be fairly straightforward, and with (5) I think you could probably just use a second file handle to read all the data in at load, or write it before the dump.

What you have above is a good summary of (3), but "mode which preserves behavior" is not really precise -- for a file handle of mode = 'w', why is moving to mode = 'r+' preferred over something like 'a', 'a+', or 'w+'? Should 'r+' be the clear choice? Sorry for the foot-dragging on this... but dill is a tight little package, and I believe it is that way because I try not to settle for finding a "good solution".

Would you be OK with (2) as the default, (5) as the other extreme, and (3) as the in-between?
You could probably use a single switch for all three… default = 'None', then True/False if the kwd is given as an override (to choose between 3 and 5). Something like refdata = None for (2), True for (3), and False for (5)?

Or are there two flags (and thus 4 cases!) to deal with? (I don't know what I would call those flags.) Using string names of the options is a bad idea; however T/F is good, or if there's a clear gradation 0,1,2 might also be good.

@matsjoyce
Copy link
Contributor Author

What you have above is a good summary of (3), but "mode which preserves behavior" is not really precise -- for a file handle of mode = 'w', why is moving to mode = 'r+' preferred over something like 'a', 'a+', or 'w+'? Should 'r+' be the clear choice?

  • a => on some OSs, seek does not change where data is written (does not preserve behaviour)
  • a+ => same as above
  • w => truncates file (does not preserve data)
  • w+ => same as above
  • r => cannot write (does not preserve behaviour)
  • r+ => only option left

Sorry for the foot-dragging on this... but dill is a tight little package, and I believe it is that way because I try not to settle for finding a "good solution".

I like tight little packages! 😃

What are we going to name the safeIO flag?

@mmckerns
Copy link
Member

I know you have obviously made a good choice with r+, I was actually asking more rhetorically… I guess I should have said: "is r+ going to be the clear choice (i.e. not surprising) to users?" I guess it will just need to be well documented. I know you have added a table, in the docs but maybe that's not the best way to convey the choice. It was more of a poke at you to think about explaining your choice. Sorry for not being clear.

RE: naming for the safeIO flag -- have not thought about it. I think nocreate is in the right vein, except you should never use negative names for flags, as it gets into bad english… nocreate=False Yuck.

@matsjoyce
Copy link
Contributor Author

allowcreate=True?

Well, I would of thought r+ is the obvious choice to replace w+, as they are the same, but r+ does not have the os.O_TRUNC flag. For w, r+ is not so obvious, as it allows reading, while w does not, but it is the closest without using os.open.

@matsjoyce
Copy link
Contributor Author

I'll also note that I believe your solution will take dill from being 2.5-2.7 and 3.1-3.4 compatible down to 2.7,3.3,3.4 I believe (I've not checked). I'd prefer to not break compatibility… with a possible short-term out being that one of the FMODE do not exist for older versions with no opener flag.

Oh. I tried using plain os.open, which would be backwards compatible, but then f.name is set to the file descriptor number, not the filename, which would be a severe problem, as further pickling would then fail. I'll see if there are any other options.

@mmckerns
Copy link
Member

Backward compatibility has been a pain in the behind, but necessary due to the target use in HPC. There are institutional-scale HPC platforms still using python 2.5 as the default. Many of those are being phased out, or have been phased out, but python 2.6 is very very common, as are earlier versions of 3.x.

@matsjoyce
Copy link
Contributor Author

Yup. And I just noticed that it doesn't work in python 2.7 either, so Its back to the drawing board for that bit. What I was trying before was (http://stackoverflow.com/a/10352231):

import os
f = os.fdopen(os.open(name, os.O_CREAT | os.O_WRONLY), mode)

The problem is that:

f.name == 3  # fd, not name

which will break everything more than r+ did...
Trying to set f.name doesn't help either:

>>> f.name = "a.txt"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: attribute 'name' of '_io.TextIOWrapper' objects is not writable

See http://bugs.python.org/issue1625576

@mmckerns
Copy link
Member

Yes, I'm aware of that unfortunate name='<fdopen>' issue in python.
It gets worse.

>>> _f = f.__new__(FileType, 'foo.txt', mode='w')
>>> _f
<closed file '<uninitialized file>', mode '<uninitialized file>' at 0x1018638a0>
>>> _f.name
'<uninitialized file>'
>>> _f.mode
'<uninitialized file>'

Which just seems half-assed. Or at least, it's a really dark dark corner of the language.

Like I said, easy out for now could be to make that file mode unavailable for certain versions of python.

@matsjoyce
Copy link
Contributor Author

In python 3, we can do:

f.buffer.raw.name = "a.txt"

Still working on python 2.

@matsjoyce
Copy link
Contributor Author

The only way I think f.name can be set in python2 is using a good dose of ctypes, as file init contains:

static PyObject *
fill_file_fields(PyFileObject *f, FILE *fp, PyObject *name, char *mode,
         int (*close)(FILE *))
{
    assert(name != NULL);
    assert(f != NULL);
    assert(PyFile_Check(f));
    assert(f->f_fp == NULL);

    Py_DECREF(f->f_name);
    Py_DECREF(f->f_mode);
    Py_DECREF(f->f_encoding);

        Py_INCREF(name);
        f->f_name = name;

    f->f_mode = PyString_FromString(mode);

    f->f_close = close;
    f->f_softspace = 0;
    f->f_binary = strchr(mode,'b') != NULL;
    f->f_buf = NULL;
    f->f_univ_newline = (strchr(mode, 'U') != NULL);
    f->f_newlinetypes = NEWLINE_UNKNOWN;
    f->f_skipnextlf = 0;
    Py_INCREF(Py_None);
    f->f_encoding = Py_None;

    if (f->f_mode == NULL)
        return NULL;
    f->f_fp = fp;
        f = dircheck(f);
    return (PyObject *) f;
}

As this shows, the only way to set name is thought C, making it very tricky.
We could use r+ for python 2, and f.buffer.raw.name = "a.txt" for python 3?

@mmckerns
Copy link
Member

https://github.com/albertz/pydbattach/blob/master/pythonhdr.py#L181
https://groups.google.com/forum/#!topic/cython-users/1jawl6UCZ3g

This has had the same interface since python2.3, but not available in python 3.

@matsjoyce
Copy link
Contributor Author

This works:

import os
>>> f = os.fdopen(os.open("a.txt", os.O_CREAT | os.O_WRONLY), "w")
>>> f
<open file '<fdopen>', mode 'w' at 0x7f0694f76540>
>>> import ctypes
>>> fname_offset = ctypes.sizeof(ctypes.c_size_t) + ctypes.sizeof(ctypes.py_object) + ctypes.sizeof(ctypes.c_voidp)
>>> fname_offset
24
>>> ctypes.py_object.from_address(id(f) + fname_offset)
py_object('<fdopen>')
>>> n=ctypes.py_object.from_address(id(f) + fname_offset)
>>> n.value="a"
>>> f
<open file 'a', mode 'w' at 0x7f0694f76540>
>>> 

Not very portable for other python implementations though.

if not HAS_CTYPES:
raise RuntimeError("Need ctypes to set file name")
ctypes.cast(id(f), ctypes.POINTER(FILE)).contents.name = name
ctypes.cast(id(name), ctypes.POINTER(PyObject)).contents.ob_refcnt += 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this works, but looks a bit fragile to me.

@matsjoyce
Copy link
Contributor Author

It now passes the test on python 2.7, and as it now doesn't use any new features, it ought to be backwards compatible.

@mmckerns
Copy link
Member

I'm ok with it as is for python2.7 and python2.6… (you could opt to use the 'opener' if it exists, if that makes things cleaner for 3.x).

It needs an easy fix for python2.5:

dude@hilbert>$ python25
dude@hilbert>$ python test_file.py 
test_file.py:18: Warning: 'with' will become a reserved keyword in Python 2.6
  File "test_file.py", line 18
    with open(fname, "w") as f:
            ^
SyntaxError: invalid syntax

It's also broken now for 3.1 and 3.2 (same small error)

dude@hilbert>$ python31
dude@hilbert>$ python test_file.py 
Traceback (most recent call last):
…
  File "/Users/mmckerns/dev/svn/pathos/dill/dill/branch/dill.py", line 886, in save_module
    prefix = sys.base_prefix if PY3 else sys.prefix
AttributeError: 'module' object has no attribute 'base_prefix'

Looks like 3.3 and 3.4 pass, but print to stdout, as you mentioned earlier. This bug will get fixed, and probably shouldn't be worried about it if it only occurs when a Traceback is thrown -- meaning, it never appears when the error is caught… so you only see a print resulting from the bug.

Following up on the remaining points:

Once the file is created with mode x, it acts just like w, so currently the x is changed into a w. The only disadvantage of this approach is that mode before != mode after (eg xb => wb).

See my opinion from last week… should throw an error, not change x to w.

What should safeio be called?

I'm good with current solution, I guess.

Currently, due to a bug (?) in pickle.py (https://github.com/python/cpython/blob/master/Lib/pickle.py#L1384-1385), raising exceptions from a loader causes messages to be printed. Should test_file.py try to "absorb" the messages, or leave stdout a bit "cluttered"?

Yes, it's better to catch stdout. Fine if you leave it as a follow-up item. I have a code snippet that can be applied here fairly easily.

Should os.devnull be used for reading a non-existent file for options (2) and (3)?

Honestly, I don't know. I typically look at what python would do, and just do that.
I'll have to think about this a bit more.

Should we test this on Windows? It ought to be completely portable, but you never know with Windows?

File behavior is slightly different on windows…. however, the solution does look portable. I'm ok with assuming it works. (Yikes)

@matsjoyce
Copy link
Contributor Author

It needs an easy fix for python2.5:

dude@hilbert>$ python25
dude@hilbert>$ python test_file.py 
test_file.py:18: Warning: 'with' will become a reserved keyword in Python 2.6
  File "test_file.py", line 18
    with open(fname, "w") as f:
            ^
SyntaxError: invalid syntax

Fixed.

See my opinion from last week… should throw an error, not change x to w.

Also fixed.

It's also broken now for 3.1 and 3.2 (same small error)

dude@hilbert>$ python31
dude@hilbert>$ python test_file.py 
Traceback (most recent call last):
  File "/Users/mmckerns/dev/svn/pathos/dill/dill/branch/dill.py", line 886, in save_module
    prefix = sys.base_prefix if PY3 else sys.prefix
AttributeError: 'module' object has no attribute 'base_prefix'

Oops, that looks like a problem with #41. Shall I make a new PR, or shall we leave that to be fixed in #62 ?

What should safeio be called?

I'm good with current solution, I guess.

Is that keep safe_file or switch to safeio?

Yes, it's better to catch stdout. Fine if you leave it as a follow-up item. I have a code snippet that can be applied here fairly easily.

Or fix #3 and use unittest.TextTestRunner(buffer=True).

@matsjoyce
Copy link
Contributor Author

Whoops, looks like we've got a merge conflict now... 😯 Shall I try to remove 4edc3c0?

@mmckerns
Copy link
Member

I already made the change in test_file.py to be 2.5 compatible, as it was already accepted code. Merge the change into your PR.

Make a new PR for the 3.1/3.2 patch, or I'll patch it. It should go straight in (instead of waiting in #62).

I'd like to get this PR resolved, as it's really close. I don't want to have too many PR's outstanding for you to have to manage... If you like safeio better, use it -- I think it's a little better.

I am in the process of switching all my tests across the board to produce no output, and then further to use unittest. I had someone who volunteered, but bailed… so I'm on it.

@matsjoyce
Copy link
Contributor Author

I already made the change in test_file.py to be 2.5 compatible, as it was already accepted code. Merge the change into your PR.

OK, I think I've made it work now...

If you like safeio better, use it -- I think it's a little better.

OK, switched.

I am in the process of switching all my tests across the board to produce no output, and then further to use unittest. I had someone who volunteered, but bailed… so I'm on it.

That would be great, as then you could enable Travis, and so run all the tests on each version of python, for each PR, which would help with bugs and the like. The best I do at the moment is:

for i in tests/*.py; do python $i; done
for i in tests/*.py; do python2 $i; done

mmckerns added a commit that referenced this pull request Aug 21, 2014
Implement changes in #57: improve behaviour of dumped files.
@mmckerns mmckerns merged commit 80a80e3 into uqfoundation:master Aug 21, 2014
@mmckerns
Copy link
Member

Some remaining cleanup to be noted in #57.

@matsjoyce matsjoyce deleted the new_file_handling branch August 22, 2014 08:27
@mmckerns mmckerns modified the milestone: dill-0.2.2 Nov 25, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants