-
Notifications
You must be signed in to change notification settings - Fork 667
GSoC 2016 Project Ideas
The project ideas can be roughly categorized as
- New analysis functionality
- Increasing performance
- New input formats
- Increase platform availability
- Increase ease-of-use
- Improve the library core
Or work on your your own idea! Get in contact with us to propose an idea and we will work with you to flesh it out into a full project. Raise an issue in the Issue Tracker or contact us via the developer Google group.
Difficulty: Hard
Mentors: Max, Richard, Manuel, Jonathan
MDAnalysis already comes with a range of different standard analysis tools but currently lacks an implementation of a general dimension reduction algorithm, that can select an arbitrary number of dimensions of interest. 3 common general techniques are
- Time Independent Component Analysis
- [Diffusion Maps] (http://arxiv.org/abs/1506.06259)
- Principle Component Analysis
There are python implementations for all of these algorithms but none of them currently work with MDAnalysis out of the box. This is because the current python implementations work on normal numpy arrays that stores a complete trajectory in memory, but MDAnalysis never loads the whole trajectory but only one frame at a time. This approach allows MDAnalysis to treat very large system on a normal laptop or workstation. A new dimension reduction should be implemented as a class and inherit from analysis.base.
Of course you can also suggest us another dimension reduction algorithm that you would like to implement.
Difficulty: Hard
Mentors: Max, Richard, Manuel, Jonathan
Molecular simulation trajectories are very often analyzed frame-by-frame. This is frequently an embarrassingly parallel procedure, in which work can be efficiently divided simply by splitting the trajectory and letting each worker process one of the chunks. The goal of this project is to implement a parallelization framework that automates all the trajectory splitting, work distribution, and eventual result collection.
A parallelization framework should put the least burden possible on the end-user, so that minimal changes are required to turn serial code into parallel. Likewise, the parallelization framework must blend naturally with the analysis API of MDAnalysis. In this way, analyses written using analysis.base will automatically become parallelizable.
Implementing parallelization in Python code can be done in many ways. Aspects to consider when choosing one or several approaches are:
- Most users will primarily have access to SMP parallelization;
- Notwithstanding the above point, many users also typically have access to multi-node HPC clusters, and we should be able to leverage their use;
- In an analysis context, being able to write results to shared memory will improve the memory usage footprint and simplify result collection;
- GPU parallelization is attractive for its wide availability (though possibly more complex to implement in a meaningful way).
Difficulty: Hard
Mentors: Max, Richard, Manuel, Jonathan
To analyze molecular simulations it is often helpful which atoms are close to each other. For this we calculate distance matrices where the distances between every atom pair is calculated. This is a very expensive operation that grows quadratic with the number of atoms involved.
Since we are only interested in atoms that are close to each other we can use some algorithms run faster after some initial analysis of the coordinates. One class of these algorithms are domain-decomposition algorithms. The basic idea of this type of algorithms is to decompose the volume occupied by the atoms into different cells and then only calculate distances for atoms in neighboring cells. If atoms are not in neighboring cells we already know that the distance is to big for us to be interesting. A theoretical description of these algorithm can be found in this book Appendix F
One domain decomposition algorithm is cell grids.
In this project you would integrate the cell grid algorithm into MDAnalysis.
Dificulty: Medium
Mentors: Max, Richard, Manuel, Jonathan
One of the strengths of MDAnalysis is its ability to support a wide range of different MD-formats. But we are still missing some like the new TNG file format from Gromacs , H5MD or the HALMD format. Alternatively, you can also add a format that you want to use personally in MDAnalysis. This project will familiarize you with working with and connecting different APIs, as well as giving insight into how modern portable data storage file formats work.
Difficulty: Hard
Mentors: Max, Richard, Manuel, Jonathan
To check if a new analysis-method works as intended it is often a good idea to use it with a random walk in different simple energy landscapes (A flat energy, harmonic well, double well). In this project you would develop a 'Reader' that produces random trajectories.
For analysis of molecular data a comparison against random data can be very useful for several reasons. The first is that we want to test if our analysis can distinguish between a simulation and random noise. It can also be interesting to see what general analysis methods like Principle Component Analysis produce with random data.
The first random trajectory generator would just be a random walk in 3N dimensions (N is the number of particles in the simulation to compare to). The second would be to implement langevin dynamics in either predefined energy landscapes and/or arbitrary ones. Langevin dynamics in a energy landscape are close to the conformational dynamics of proteins, see [1]. As a first start you could implement a integrator for Langevin dynamics and later have the trajectory 'reader' use the integrator to dynamically generate the trajectory.
Please note that this project does require a background in statistical physics or mathematics.
[1] Robert Zwanzig. Nonequilibrium statistical mechanics. Oxford University Press, 2001
Difficulty: Easy
Mentors: Max, Richard, Manuel, Jonathan
Python 3 is getting adopted by a wider range of users and unix distributions are starting to switch. MDAnalysis can't run right now under Python 3 mostly due to it's C/Cython extensions, we currently try to move our C-extensions to cython which supports Python 2 and 3 with one source. See also #260.
Missing here right now is the DCD trajectory readers. There exists an incomplete work to enable Python 2/3 of the DCD reader. In this project you would finish this work by either writing finishing this work or by rewriting the DCD interface in cython.
The second part of this project is to remove all other incompatibilities with Python 3 we currently have. For this you should work that our test-suite passes on Python 3.
Difficulty: Medium
Mentors: Max, Richard, Manuel, Jonathan
Currently MDAnalysis exists only as a framework, however making common tasks available via the command line would make certain work flows easier. As an example, the conversion of trajectories between formats could take the form:
mda convert --topology adk.psf -i adk_dims.dcd -o adk_dims.xtc
This project would involve creating a template for these command line utilities to follow and implementing a foolproof user interface for navigating them using a popular command line parsing library.
Difficulty: Hard
Mentors: Max, Richard, Manuel, Jonathan
MDAnalysis is using ångström and picoseconds as default units. Our Reader/Writer objects are only aware of units to the extend that they convert other MD-formats to our default units. But we can also read the coordinates in the native units. This can make it hard to remember what units the coordinates of an AtomGroup have, to fix this you should switch from pure numpy arrays to a unit aware numpy-ndarray wrapper. See Issue #596