-
-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabled reading raw coordinates from xyzSile #483
Conversation
PS: I'm implementing the calculation of vibrational spectra from trajectories, which I think would be cool to add to sisl. For that it would be good to have some "standard" for dealing with trajectories, so that I can adhere to it :) |
Codecov Report
@@ Coverage Diff @@
## main #483 +/- ##
==========================================
- Coverage 86.91% 86.91% -0.01%
==========================================
Files 346 346
Lines 44311 44314 +3
==========================================
+ Hits 38513 38515 +2
- Misses 5798 5799 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There is quite some code already for this? Have you seen? That might be reused in some way... |
Ah, well code for displacement vectors etc... Not the other thing, sounds cool! |
No! Where is it? Hahah |
Sorry, my 2nd comment corrected what I wrote ;) |
A thought related to this. Since this might be handy in other siles, would it be optimal to have a |
Yes, probably it would, but I wasn't sure if you wanted to introduce a new method. Also, would it mean that all methods that have
I just started doing this so maybe I'm missing some parts, but I don't see how is it possible or if it makes sense to compute a |
Hmm, probably yes, I can't really figure out what to do about this. How slow is it compared to the direct read? Perhaps I can see if the initialization could be improved... Z, cell, xyz = read_geometry(ret_ndarray=True) Or? I could easily imagine one also wants to quickly read the lattice vectors. |
Reading thousands of frames basically goes from reasonable to unfeasible. Here is some code to check the differences for reading 100 frames: import sisl
def read_trajectory_xyz(traj_file):
sile = sisl.get_sile(traj_file, cls=sisl.io.xyzSile)
coords = []
with sile as f:
for i in range(100):
try:
xyz = f.read_geometry(ret_xyz=True)
coords.append(xyz)
except Exception as e:
break
def read_trajectory_geom(traj_file):
sile = sisl.get_sile(traj_file, cls=sisl.io.xyzSile)
coords = []
with sile as f:
for i in range(100):
try:
xyz = f.read_geometry().xyz
coords.append(xyz)
except Exception as e:
break
traj_file = "siesta.ANI"
%timeit -n 1 -r 1 read_trajectory_xyz(traj_file)
%timeit -n 1 -r 1 read_trajectory_geom(traj_file) 41.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) It is really the Lines 142 to 143 in 2ce01bc
that takes almost all the time. If I comment that line, 61.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) |
I guess this could be overcome then by passing an import sisl
def read_trajectory_geom(traj_file):
sile = sisl.get_sile(traj_file, cls=sisl.io.xyzSile)
geom = sile.read_geometry()
atoms = geom.atoms
coords = []
with sile as f:
for i in range(100):
try:
xyz = f.read_geometry(atoms=atoms).xyz
coords.append(xyz)
except Exception as e:
break gives the timing: 109 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) Perhaps this is enough of a workaround for now and there's no need to add new functionality. However I think a way of dealing with trajectories in sisl is necessary in the long run, since it would be much more straightforward to users (and more efficient in doing things). |
The atoms argument would be a good solution I think? It could also come in handy when one has ghost atoms (since most formats does not intrinsically handle non-valid species). So the atoms argument could be necessary for all io files. |
@pfebrer could you have a look at the branch |
I don't know if the wrapper is a good idea, it does remove some code and ensures that everywhere the user can provide a Also, that doesn't solve anything here, right? It's the same functionality written differently, or not? |
Yes, but with the amount of io routines that reads geometries it might be much better due to maintenance. And changing the return values is a small change.
Yes, it is the same as your 2nd test by passing atoms argument. This was more to make it globally available |
It is merely because I have long wanted to unify the general io routines so that one shouldn't think about which code one is using. And a wrapper around the routines seems like the easiest way, especially if the generic changes can help performance as in this case. In any case, if one is developing a new io routine, I think this is a small trade-off to allow generic usage, I mean later we could allow the I am very eager to solve the performance issue, but my worry is that this will be present in any of the file formats due to the |
That is a good point! I think it is worth it then. In this case we are wrapping to modify the output, but we could also wrap it to modify the input. What I mean is that |
So you are thinking of a function that will itself determine the code? I am not sure I completely follow? |
No no, just a function like def read_geometry(file_handle, *args, **kwargs):
...read from the open file handle
return cell, atoms, xyz Then, a sisl wrapper would provide the file handle to use it in siles, just as the wrapper you proposed provides a way of parsing the raw data into a sisl object ( |
So you would have the function determine the filename, and then redirect the read to the sile? Or? I agree this would be nice! |
Noo, what I imagine is that in general you don't need the instantiated sile to parse a file. So a generic function that given an open file reads the data should be fine: def read_geometry(file_handle):
# Read number of atoms
na = int(file_handle.readline())
cell = ...
# Read atoms and coordinates
sp = [None] * na
xyz = np.empty([na, 3], np.float64)
line = file_handle.readline
for ia in range(na):
l = line().split(maxsplit=5)
sp[ia] = l[0]
xyz[ia, :] = l[1:4]
return xyz, sp, cell And then when you add the wrapper for sisl, as you are doing on your new branch, you would do: def wrap_read_geometry(func):
""" Wraps the `read_geometry` method by adding a set of default arguments and flags
This unifies the arguments for the various codes IO handling such that
one can expect the same from all of them
The following arguments are added:
- atoms
A replacement argument for the `atoms` in the returned geometry.
- sc
A replacement argument for the `sc` in the returned geometry.
The `read_geometry` should then instead of returning a `Geometry`
return a tuple consisting of `sc, atoms, xyz` then the wrapped
method will coherently replace the required details and create
the `Geometry`
"""
# We should also add keyword arguments to the documentation of func
@wraps(func)
def _func(self, *args, **kwargs):
user_sc = kwargs.pop("sc", None)
user_atoms = kwargs.pop("atoms", None)
# Open the file handle
file_handle = self.fh
# And pass it to the parser
sc, atoms, xyz = func(file_handle)
if user_sc is not None:
sc = user_sc
if user_atoms is not None:
atoms = user_atoms
return Geometry(xyz, atoms, sc=sc)
return _func Now this gives more flexibility to use the parser in whatever situation. One example could be another code that manages IO differently or some file that contains many things, and at some point includes a geometry in xyz format. In that case you would do something like: from sisl.io.xyz import read_geometry
with open("myfile.out", "r") as f:
# Move to the place where the geometry starts
line = "s"
while line != "":
line = f.readline()
# Use sisl's parser to read the structure information
read_geometry(f)
|
I agree that your approach would be nice. So that would require end-users to rely on explicit imports from sisl.io.xyz import read_geometry as read_xyz_geom
from sisl.io.siesta.xv import read_geometry as read_xv_geom
... and that each and every file-type has a Wouldn't it be easier to allow |
However I much better prefer |
Nooo I actually like the automatic discovery of parsers , i.e. I just think that it would be better that internally it would work using these functions that can be reused or explicitly imported for a more flexible use if needed. The siles framework will make life much easier for users 99% of the time, but for the other 1% I think it can be very nice to expose the raw parser. And of course as always the more you split functionalities (file parsing vs extension management) the nicer the code design is and the easier it is to build on top of it/refactor :) |
Wouldn't the flexibility be enabled if the siles could use file handles? I am not too keen on having the same routine name in all submodules, it calls for problems.. Or perhaps a function that accepts a file handle and will return the sile as though it was already opened? |
Hmm yes I agree that having equally named functions all around is not nice, some solution must be found for that. So probably in a sense a reasonable interface could be I believe to help the discussion it is nice to understand what is the difference between a "parser" and a |
Actually I was thinking that |
Yes, some siles have different formats depending on some parts... I agree that methods would have been nice, but it just doesn't cut it now. I think the most understandable thing for end users would be a single As it is, wouldn't my branch solve your performance problem? While the decorator makes the code a bit weird, I think it would aid in the general picture because of argument fixing. |
A single
Yes, well my performance problem specifically with Anyway this PR can be discarded, and hopefully another PR will come at some point making available parsers that do not depend on siles :) |
Some time ago you made me notice that I could read successive geometries in a xyz file by keeping the file handle open:
However, this is unfeasible for reasonably sized trajectories due to the overhead of initializing
Geometry
objects. As stated in #167, it would be great to have some generalized way of manipulating time series/trajectory data, specially for geometries. While there is no such thing, I thinksisl
users could benefit from being able to at least read trajectories from.xyz
files, even if there is no way of treating them in SIESTA. I introduced theret_xyz
keyword argument to skip geometry instantiation, which makes it feasible (and very quick) to read trajectories.If you agree this is OK (I think it's the least intrusive way to allow users to do this without new functionality), I would add a test for it.