Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading in frames from an IOBuffer #46

Closed
joegilkes opened this issue Nov 5, 2023 · 8 comments
Closed

Reading in frames from an IOBuffer #46

joegilkes opened this issue Nov 5, 2023 · 8 comments

Comments

@joegilkes
Copy link
Contributor

I have some code that generates XYZ geometries, and I need these to be converted into ExtXYZ frames. I can write these geometries to temporary files and read them back in as frames from those files, but this becomes incredibly taxing on disk IO when passing around thousands of XYZs, which I unfortunately have to do quite regularly.

I can write each geometry to a string, but I can't find a way to then read this string back in as a frame, since read_frames only works with a string for the path to the XYZ file, or with an IOStream to a file.

Is there some way of extending the current read_frames implementation to work with IOBuffers? That way I could write the string-form XYZ directly to an IOBuffer and read back in as a frame entirely within memory, which would be much faster.

@jameskermode
Copy link
Member

The I/O is actually all done in C code in libextxyz using POSIX file descriptors with functions like fprintf(), fgets(), etc. If it were possible to get a valid file descriptor from a Julia IOBuffer then we could pass this along, similar to the way I allow Julia IOStream objects to be passed in via this cfopen() method

https://github.com/libAtoms/ExtXYZ.jl/blob/master/src/fileio.jl#L16

From a quick look, however, there is no fd(::IOBuffer) method defined. If you can find/suggest a way to do this I'd be happy to add it.

@jameskermode
Copy link
Member

As a workaround you could create a ramdisk and do the I/O there: that might not be an option on machines where you are not root, however.

@jameskermode
Copy link
Member

jameskermode commented Nov 6, 2023

Alternatively there is a pure Python implementation of read_frame() here (pass use_cextxyz=False to avoid using the C extension). This version could be fairly easily modified to read from in-memory strings rather than doing I/O from disk. I would guess that this would still be slower than the fast C code, but I could be wrong about that.

A pure Julia parser based on the ExtXYZ Lark grammar would be possible using Lerch.jl, but this is a much bigger undertaking.

@joegilkes
Copy link
Contributor Author

As far as I can tell, there's no way of associating a file descriptor with an IOBuffer - they're just not set up in a way where that makes sense. I could be wrong, I'll ask around a bit.

A ramdisk would be usable, but yeah this is mostly problematic on networked nodes where I don't have root access. The performance hit is manageable on my laptop where all IO is done on an NVMe drive, but anything slower is a bit painful.

The Python implementation could be workable, I can have a look at that.

@joegilkes
Copy link
Contributor Author

On second thoughts, the Python version would require a bit more modification, since it doesn't currently appear to have an option to return the frames as Dicts, instead converting them straight to ASE Atoms objects.

@jameskermode
Copy link
Member

From memory I think the dict version is there too, just one level deeper: read_frame_dicts() perhaps.

@joegilkes
Copy link
Contributor Author

Ah yep, got it. I'd seen that function but didn't realise it was extracting the data in the same way. It would require a bit of reorganisation on my end to get the arrays key behaving the same way, but otherwise everything maps over nicely.

@joegilkes
Copy link
Contributor Author

Something like this does the trick:

using PyCall

pyextxyz = pyimport("extxyz")

function xyz_to_frame(xyz_str::String)
    iob = IOBuffer(xyz_str)
    na, info, arrays, _ = pyextxyz.extxyz.read_frame_dicts(split(String(take!(iob)), '\n'), use_regex=false)
    close(iob)

    info = Dict{String, Any}(key => val for (key, val) in info)
    if na == 1
        arrays = Dict{String, Any}("species" => [arrays.item()[1]], "pos" => reduce(hcat, [arrays.item()[2]]))
    else
        arrays = Dict{String, Any}("species" => [a[1] for a in arrays], "pos" => reduce(hcat, [a[2] for a in arrays]))
    end
    frame = Dict{String, Any}("N_atoms" => na, "info" => info, "arrays" => arrays)
    return frame
end

There were a few tricky hurdles: for my input string-form XYZs I had to use use_regex=false (XYZs generated by OpenBabel, not sure why but I got errors with it enabled), and a clause for monoatomic XYZs had to be added since they end up yielding 0D NumPy arrays which have to be handled differently. This works well enough that the following holds:

# ExtXYZ.jl implementation
frame = read_frame(xyz_file)

# Python extxyz extension
my_frame = xyz_to_frame(xyz_str)

@assert frame == my_frame

I haven't done any formal benchmarking to see the performance hit/gain yet, but anecdotally loading in a large chemical reaction network's worth of XYZs was much speedier than before.

I have a lot of points in my workflow that require geometries to be passed back and forth between ExtXYZ form and OpenBabel (through PyCall), so I also implemented the reverse case (molecules loaded in as ExtXYZ frames with read_frames -> OpenBabel Molecules without writing to disk):

function frame_to_xyz(frame::Dict{String, Any})
    na = frame["N_atoms"]
    s = "$na\n"
    comment = join(["$key=$value" for (key, value) in frame["info"]], " ")
    s *= "$comment\n"
    for i in 1:na
        s *= "$(frame["arrays"]["species"][i]) $(frame["arrays"]["pos"][1, i]) $(frame["arrays"]["pos"][2, i]) $(frame["arrays"]["pos"][3, i])\n"
    end
    return s
end

pbmol = pybel.readstring("xyz", frame_to_xyz(my_frame))

This may also be contributing a lot to the potential speedup. I'll mark this as closed here, but if you're interested in the potential speedup then I can have a go at benchmarking the difference more thoroughly later in the week.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants