Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabled reading raw coordinates from xyzSile #483

Closed
wants to merge 1 commit into from

Conversation

pfebrer
Copy link
Contributor

@pfebrer pfebrer commented Aug 17, 2022

Some time ago you made me notice that I could read successive geometries in a xyz file by keeping the file handle open:

import sisl

xyz_sile = sisl.get_sile("trajectory.xyz")
traj = []
with xyz_sile as s:
    while True:
        try:
            geom = s.read_geometry()
            traj.append(geom.xyz)
        except:
            # No more geometries to read
            break

However, this is unfeasible for reasonably sized trajectories due to the overhead of initializing Geometry objects. As stated in #167, it would be great to have some generalized way of manipulating time series/trajectory data, specially for geometries. While there is no such thing, I think sisl users could benefit from being able to at least read trajectories from .xyz files, even if there is no way of treating them in SIESTA. I introduced the ret_xyz keyword argument to skip geometry instantiation, which makes it feasible (and very quick) to read trajectories.

import sisl

xyz_sile = sisl.get_sile("trajectory.xyz")
traj = []
with xyz_sile as s:
    while True:
        try:
            xyz = s.read_geometry(ret_xyz=True)
            traj.append(xyz)
        except:
            # No more geometries to read
            break

If you agree this is OK (I think it's the least intrusive way to allow users to do this without new functionality), I would add a test for it.

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 17, 2022

PS: I'm implementing the calculation of vibrational spectra from trajectories, which I think would be cool to add to sisl. For that it would be good to have some "standard" for dealing with trajectories, so that I can adhere to it :)

@codecov
Copy link

codecov bot commented Aug 17, 2022

Codecov Report

Merging #483 (9e0d7b4) into main (0545e6e) will decrease coverage by 0.00%.
The diff coverage is 80.00%.

@@            Coverage Diff             @@
##             main     #483      +/-   ##
==========================================
- Coverage   86.91%   86.91%   -0.01%     
==========================================
  Files         346      346              
  Lines       44311    44314       +3     
==========================================
+ Hits        38513    38515       +2     
- Misses       5798     5799       +1     
Impacted Files Coverage Δ
sisl/io/xyz.py 95.83% <80.00%> (-1.27%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@zerothi
Copy link
Owner

zerothi commented Aug 17, 2022

PS: I'm implementing the calculation of vibrational spectra from trajectories, which I think would be cool to add to sisl. For that it would be good to have some "standard" for dealing with trajectories, so that I can adhere to it :)

There is quite some code already for this? Have you seen? That might be reused in some way...

@zerothi
Copy link
Owner

zerothi commented Aug 17, 2022

Ah, well code for displacement vectors etc... Not the other thing, sounds cool!

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 17, 2022

There is quite some code already for this? Have you seen?

No! Where is it? Hahah

@zerothi
Copy link
Owner

zerothi commented Aug 17, 2022

There is quite some code already for this? Have you seen?

No! Where is it? Hahah

Sorry, my 2nd comment corrected what I wrote ;)
I was thinking about displacement vectors and the vibrational frequencies from that... So nothing from a trajectory... ;) But please use the DynamicalMatrix as a return value.

@zerothi
Copy link
Owner

zerothi commented Aug 17, 2022

A thought related to this.

Since this might be handy in other siles, would it be optimal to have a read_xyz/read_cartesian? Instead of adding flags all-around?

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 18, 2022

Since this might be handy in other siles, would it be optimal to have a read_xyz/read_cartesian? Instead of adding flags all-around?

Yes, probably it would, but I wasn't sure if you wanted to introduce a new method. Also, would it mean that all methods that have read_geometry should have read_xyz?

I was thinking about displacement vectors and the vibrational frequencies from that... So nothing from a trajectory... ;) But please use the DynamicalMatrix as a return value.

I just started doing this so maybe I'm missing some parts, but I don't see how is it possible or if it makes sense to compute a DynamicalMatrix. I'm doing this with liquids in mind, where molecules vibrate but also translate. The way you do this is by computing the velocities during the MD and then computing their time autocorrelation. The FFT of that gives you the vibration frequencies. How do you compute a DynamicalMatrix during that process?

@zerothi
Copy link
Owner

zerothi commented Aug 18, 2022

Since this might be handy in other siles, would it be optimal to have a read_xyz/read_cartesian? Instead of adding flags all-around?

Yes, probably it would, but I wasn't sure if you wanted to introduce a new method. Also, would it mean that all methods that have read_geometry should have read_xyz?

Hmm, probably yes, I can't really figure out what to do about this. How slow is it compared to the direct read? Perhaps I can see if the initialization could be improved...
Perhaps the read geometry could return numpy arrays instead? Then it is easy to grab what you need?

Z, cell, xyz = read_geometry(ret_ndarray=True)

Or? I could easily imagine one also wants to quickly read the lattice vectors.

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 19, 2022

How slow is it compared to the direct read?

Reading thousands of frames basically goes from reasonable to unfeasible. Here is some code to check the differences for reading 100 frames:

import sisl

def read_trajectory_xyz(traj_file):
    sile = sisl.get_sile(traj_file, cls=sisl.io.xyzSile)
    coords = []
    with sile as f:
        for i in range(100):
            try:
                xyz = f.read_geometry(ret_xyz=True)
                coords.append(xyz)
            except Exception as e:
                break
                
def read_trajectory_geom(traj_file):
    sile = sisl.get_sile(traj_file, cls=sisl.io.xyzSile)
    coords = []
    with sile as f:
        for i in range(100):
            try:
                xyz = f.read_geometry().xyz
                coords.append(xyz)
            except Exception as e:
                break
                
traj_file = "siesta.ANI"

%timeit -n 1 -r 1 read_trajectory_xyz(traj_file)
%timeit -n 1 -r 1 read_trajectory_geom(traj_file)

41.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
2.29 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

It is really the Atoms initialization here:

sisl/sisl/geometry.py

Lines 142 to 143 in 2ce01bc

# Create the local Atoms object
self._atoms = Atoms(atoms, na=self.na)

that takes almost all the time. If I comment that line, read_trajectory_geom gives a very similar timing to the raw reading (with some overhead of course):

61.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 19, 2022

I guess this could be overcome then by passing an Atoms instance to the atoms argument of read_geometry().

import sisl

def read_trajectory_geom(traj_file):
    sile = sisl.get_sile(traj_file, cls=sisl.io.xyzSile)
    geom = sile.read_geometry()
    atoms = geom.atoms
    coords = []
    with sile as f:
        for i in range(100):
            try:
                xyz = f.read_geometry(atoms=atoms).xyz
                coords.append(xyz)
            except Exception as e:
                break

gives the timing:

109 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Perhaps this is enough of a workaround for now and there's no need to add new functionality. However I think a way of dealing with trajectories in sisl is necessary in the long run, since it would be much more straightforward to users (and more efficient in doing things).

@zerothi
Copy link
Owner

zerothi commented Aug 19, 2022

The atoms argument would be a good solution I think? It could also come in handy when one has ghost atoms (since most formats does not intrinsically handle non-valid species). So the atoms argument could be necessary for all io files.

@zerothi
Copy link
Owner

zerothi commented Aug 19, 2022

@pfebrer could you have a look at the branch def-read-geometry-options? It tries to unify the methods by a wrapper, could this be a solution + improvement to all io stuff?

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 20, 2022

I don't know if the wrapper is a good idea, it does remove some code and ensures that everywhere the user can provide a sc and/or atoms, but in return it makes the code a bit more confusing, doesn't it?

Also, that doesn't solve anything here, right? It's the same functionality written differently, or not?

@zerothi
Copy link
Owner

zerothi commented Aug 20, 2022

I don't know if the wrapper is a good idea, it does remove some code and ensures that everywhere the user can provide a sc and/or atoms, but in return it makes the code a bit more confusing, doesn't it?

Yes, but with the amount of io routines that reads geometries it might be much better due to maintenance. And changing the return values is a small change.

Also, that doesn't solve anything here, right? It's the same functionality written differently, or not?

Yes, it is the same as your 2nd test by passing atoms argument. This was more to make it globally available

@zerothi
Copy link
Owner

zerothi commented Aug 20, 2022

It is merely because I have long wanted to unify the general io routines so that one shouldn't think about which code one is using.

And a wrapper around the routines seems like the easiest way, especially if the generic changes can help performance as in this case. In any case, if one is developing a new io routine, I think this is a small trade-off to allow generic usage, I mean later we could allow the names etc. to be present?

I am very eager to solve the performance issue, but my worry is that this will be present in any of the file formats due to the Atoms construct. So something generic seems like a better idea than doing the same thing over and over again?

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 20, 2022

It is merely because I have long wanted to unify the general io routines so that one shouldn't think about which code one is using.

That is a good point! I think it is worth it then.

In this case we are wrapping to modify the output, but we could also wrap it to modify the input. What I mean is that read_geometry could be a function that accepts an open file handle as an argument, instead of being a method. Then the parsing functions could be completely reusable without sisl. And the other way around, we could reuse parsers from other codes. Perhaps there are some corner cases where you need a Sile state, but we could still keep those as methods (?).

@zerothi
Copy link
Owner

zerothi commented Aug 21, 2022

So you are thinking of a function that will itself determine the code? I am not sure I completely follow?

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 21, 2022

No no, just a function like

def read_geometry(file_handle, *args, **kwargs):
    ...read from the open file handle
    return cell, atoms, xyz

Then, a sisl wrapper would provide the file handle to use it in siles, just as the wrapper you proposed provides a way of parsing the raw data into a sisl object (Geometry).

@zerothi
Copy link
Owner

zerothi commented Aug 21, 2022

So you would have the function determine the filename, and then redirect the read to the sile? Or? I agree this would be nice!

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 22, 2022

Noo, what I imagine is that in general you don't need the instantiated sile to parse a file. So a generic function that given an open file reads the data should be fine:

def read_geometry(file_handle):
    # Read number of atoms
    na = int(file_handle.readline())

    cell = ...

    # Read atoms and coordinates
    sp = [None] * na
    xyz = np.empty([na, 3], np.float64)
    line = file_handle.readline
    for ia in range(na):
        l = line().split(maxsplit=5)
        sp[ia] = l[0]
        xyz[ia, :] = l[1:4]

    return xyz, sp, cell

And then when you add the wrapper for sisl, as you are doing on your new branch, you would do:

def wrap_read_geometry(func):
    """ Wraps the `read_geometry` method by adding a set of default arguments and flags
    This unifies the arguments for the various codes IO handling such that
    one can expect the same from all of them
    The following arguments are added:
    - atoms
      A replacement argument for the `atoms` in the returned geometry.
    - sc
      A replacement argument for the `sc` in the returned geometry.
    The `read_geometry` should then instead of returning a `Geometry`
    return a tuple consisting of `sc, atoms, xyz` then the wrapped
    method will coherently replace the required details and create
    the `Geometry`
    """

    # We should also add keyword arguments to the documentation of func
    @wraps(func)
    def _func(self, *args, **kwargs):
        user_sc = kwargs.pop("sc", None)
        user_atoms = kwargs.pop("atoms", None)
        
        # Open the file handle
        file_handle = self.fh
        # And pass it to the parser
        sc, atoms, xyz = func(file_handle)

        if user_sc is not None:
            sc = user_sc
        if user_atoms is not None:
            atoms = user_atoms

        return Geometry(xyz, atoms, sc=sc)

    return _func

Now this gives more flexibility to use the parser in whatever situation. One example could be another code that manages IO differently or some file that contains many things, and at some point includes a geometry in xyz format. In that case you would do something like:

from sisl.io.xyz import read_geometry

with open("myfile.out", "r") as f:
    # Move to the place where the geometry starts
    line = "s"
    while line != "":
        line = f.readline()

    # Use sisl's parser to read the structure information
    read_geometry(f)

@zerothi
Copy link
Owner

zerothi commented Aug 22, 2022

I agree that your approach would be nice. So that would require end-users to rely on explicit imports

from sisl.io.xyz import read_geometry as read_xyz_geom
from sisl.io.siesta.xv import read_geometry as read_xv_geom

...

and that each and every file-type has a read_geometry function.

Wouldn't it be easier to allow Sile's to be passed a file-handle and act on that? Agreed the class-structure is a bit overkill, but on the other hand the explicitness may be too much?

@zerothi
Copy link
Owner

zerothi commented Aug 22, 2022

However I much better prefer Geometry.read as a function usage, so your idea would be fully enabled if siles would enable reading streams as well.

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 22, 2022

Nooo I actually like the automatic discovery of parsers , i.e. get_sile and Geometry.read.

I just think that it would be better that internally it would work using these functions that can be reused or explicitly imported for a more flexible use if needed. The siles framework will make life much easier for users 99% of the time, but for the other 1% I think it can be very nice to expose the raw parser. And of course as always the more you split functionalities (file parsing vs extension management) the nicer the code design is and the easier it is to build on top of it/refactor :)

@zerothi
Copy link
Owner

zerothi commented Aug 23, 2022

Wouldn't the flexibility be enabled if the siles could use file handles? I am not too keen on having the same routine name in all submodules, it calls for problems.. Or perhaps a function that accepts a file handle and will return the sile as though it was already opened?

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 23, 2022

Hmm yes I agree that having equally named functions all around is not nice, some solution must be found for that.

So probably in a sense a reasonable interface could be Geometry.read(file_handle, format="xyz").
But I don't think it makes sense that accepting a file handle is implemented at the level of the Sile.read_geometry method.

I believe to help the discussion it is nice to understand what is the difference between a "parser" and a Sile. In my view, the difference is that the Sile's state keeps information about the path of the file (and it could actually contain other file metadata like the last read for example) while a parser could be a static thing that needs no state. So to me it would be confusing that you would need to initialize an empty Sile (i.e. no file information) just to be able to parse sections of files from arbitrary file handles.

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 23, 2022

Actually I was thinking that Sile could be a single class and then on initialization it would pick up the format specific parser. But I think there are sile classes that have format specific state, right?

@zerothi
Copy link
Owner

zerothi commented Aug 25, 2022

Yes, some siles have different formats depending on some parts...

I agree that methods would have been nice, but it just doesn't cut it now. I think the most understandable thing for end users would be a single read_geometry with appropriate arguments which will defer to the specific code for reading, i.e. it could open siles internally, or whatever method.
In this take one could also have private methods for each file type which does what you want, and which the siles use. I am just not too keen on having too many methods lying around which are private...

As it is, wouldn't my branch solve your performance problem? While the decorator makes the code a bit weird, I think it would aid in the general picture because of argument fixing.

@pfebrer
Copy link
Contributor Author

pfebrer commented Aug 28, 2022

I agree that methods would have been nice, but it just doesn't cut it now. I think the most understandable thing for end users would be a single read_geometry with appropriate arguments which will defer to the specific code for reading, i.e. it could open siles internally, or whatever method.
In this take one could also have private methods for each file type which does what you want, and which the siles use. I am just not too keen on having too many methods lying around which are private...

A single read_geometry sounds nice as well. I don't know what the optimal organization should be, but I don't think going through siles is a good idea. Since siles are one level of abstraction above just reading files, I think it would make the code flow in a very strange way. The functionalities would become very entangled and it would be harder to separate them in the future if there is the need to.

As it is, wouldn't my branch solve your performance problem? While the decorator makes the code a bit weird, I think it would aid in the general picture because of argument fixing.

Yes, well my performance problem specifically with xyzSile was already solved, but I guess this would solve it for reading sequences of structures with other siles if the situation arises. I think that when wrapping we should modify the signature of the method (and perhaps the documentation?) so that the user can easily understand that they can pass custom atoms and sc.

Anyway this PR can be discarded, and hopefully another PR will come at some point making available parsers that do not depend on siles :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants