-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node-based pyiron-base #175
Comments
Fundamental conceptsThanks @liamhuber for all the enhancements. At the IPAM workshop at UCLA, I had the chance to discuss with Jan concepts for bringing together various ideas from pyiron, ironflow and other workflow modules. These ideas are aimed to develop a more modular and easier-to-maintain pyiron. The basic building block will be nodes. The framework must therefore provide all tools to create, manage, and upscale them easily. In the following a brief summary of the concept is given: @pyiron.node
def multiply(x=1, y=2):
return x * y The main idea is that a node can be represented by a function with well-defined input, output and execution body. Rather than having to define explicitly the input and output when using a class this structure appears more pythonic and is used e.g. by dask. A main advantage would be the low entry barrier - all a new user would have to do is to write a common python function and decorate it with the pyiron.node. This approach can be also easily extended to provide type/ontology information via typing: from pyiron import onto
@pyiron.node(register(onto.atomistic), log_input=True)
def multiply(x: onto.types.atomistic.BulkModule=1 , y: int=2): -> float:
return x * y The idea would be to provide ontological types like you introduced in the latest versions of ironflow. The nodes can be individually used and connected similar like in dask without having to explicitly define a workflow. A decorated node would be delayed and executed only once the highest level node is executed (similar like the delayed mode in dask): c = multiply(2, 3) # no execution, only delayed object created
d = multiply(c, 4) # no execution, only delayed object created
d.run() # only now c and d will be evaluated Pyiron-related conceptsWhile the node-based pyiron could be used without explicitly defining a workflow, a common simulation should probably have one: from pyiron import Workflow
wf = Workflow(name='test', onto=onto.atomistic)
Al = wf.create.structure.bulk('Al')
job = wf.create.job.Lammps(name='Lammps_job', structure='Al') The new syntax would be very similar to the existing one, so users should very easily adapt to it. A couple of things are to note:
Serial executionThe following example is an extreme case, and may be a good test case for checking the concept and code: structure = wf.create.structure.bulk('Al')
job_old = None
for i_step in range(10): # cannot be captured by workflow (write as macrocode)
job = wf.create.job.Lammps(name=‘Lammps_job',
structure=job.get_structure(-1))
if job_old is not None: # write decorated function to capture it in workflows
if np.abs(job.output.energy_tot - job_old.output.energy_tot) < eps:
break Such a construct would fail in dask delay. VASP exampleRegister nodesIn contrast to having a complex module like VASP in a single node we should have smaller and more flexible ones. This is sketched below together with ideas regarding node registration: @pyiron.node(register(onto.atomistic.code.vasp.exe))
def VASP_exe(incar:FilePath, poscar:FilePath, potcar:FilePath, kpoints:FilePath):
work = WorkingDirectory(path='.',
files=[incar, poscar, potcar, kpoints]
)
work.run('vasp.exe -f my_mode')
return work
@pyiron.node(register(onto.atomistic.code.vasp.parser_outcar))
def VASP_parser_outcar(outcar:FilePath, select=[], exclude=[]):
out_dict = my_parser(outcar, select)
return out_dict # or iodata object
@pyiron.node(register(onto.atomistic.code.vasp.parser_incar))
def VASP_parser_incar(incar:DataIO):
incar_str = my_parser(DataIO)
return incar_str
@pyiron.node(register(onto.atomistic.code.vasp.parser_poscar))
def VASP_parser_poscar(structure: onto.atomistic.structure):
poscar_str = my_parser(structure)
return poscar_str Create VASP (macro node) @pyiron.node(register(onto.atomistic.code.vasp))
def VASP(incar: DataIO,
structure: onto.atomistic.structure,
calculator: onto.node.calculator
):
onto_vasp = onto.atomistic.code.vasp
vasp = onto_vasp.VASP_exe(incar=onto_vasp.VASP_parser_incar(incar=incar),
structure=onto_vasp.VASP_parser_structure(structure=structure)
)
out_dict = onto_vasp.VASP_parser_outcar(vasp.outcar, select=['energy_tot'], exclude=[])
return out_dict This is only a sketch of first ideas. Comments and suggestions are very welcome. |
A few more ideas and pseudo code regarding node based pyiron. The examples demonstrate typical workflows we use. Example workflowsSingle
Note that we could add many options in the workflow creation. Examples are whether to store the nodes in a database, whether and which input and output data to put into hdf5, etc.
Note that the new nodes allow to provide all input via function parameters. ParallelRun over large number of structure stored in a structure container:
or run over a list of temperatures:
Serial
and apply nodes:
|
Hi Jörg, Super, yep, I'm on board with this. I'll be on vacation through Wednesday so I won't be able to look at this in depth, but a bunch of the earlier stuff you suggest is already implemented over in contrib, minus the syntactic sugar of doing it with a decorator (which should be easy). One thing that is notably missing in your examples is labels for outputs. IMO these are absolutely critical for constructing complex graphs where inputs and outputs can be connected with any complexity -- the only way to get around this is to restrict these functions to only return a single output value, which I think is too harsh. Concretely, look at your example setting a kwarg "structure=wf.create.structure.bulk('Al')" -- in this case "bulk" is a node, and in principal may return multiple values (although in this case it only returns one). So I really feel extremely strongly that we need something like "structure = wf.create.structure.bulk('Al').structure", and to require specifying labels for the output, eg in the decorator like "@pyiron.node('structure')" or "@pyiron.node(('energy_pot', 'forces')". If you and Jan haven't yet, please run the contrib workflow notebook(s?) to see what parts of your plans the existing infrastructure already already covers. All the starting stuff I think is just a matter of adding decorator syntactic sugar to existing functionality. The later stuff with specifying parallelization, database connection, etc all still needs to be done. On a first read through, the absence of output naming is really only thing that worries me here, the rest of it looks brilliant so far. Very practically, I want to first prioritize some performance enhancements for ironflow (decoupling port status model logic from the draw call should be sufficient), but once that's done I am excited to go over to contrib and start integrating the ideas here into the graph infrastructure 👍👍👍 |
Actually I have a second concern: the use of a global variable ('wf') in the final example. I think this is a closely connected concern to my worries about Marvin's contrib work where he allows entire nodes to be passed as input to other nodes. I completely agree that we need this type of macro functionality, I just think we'll need to be a little out careful about the implementation. |
Hi Liam, Thanks for your super-quick and very positive reply. I am glad that we both see the advantages and potential of these formulations and that so much is already realized and implemented in ironflow. Once you are back from your vacation it would be good to have Zoom meeting to discuss the next steps in more detail. With respect to your questions/topics some first thoughts below:
The types O1, O2 etc. can be ontologically enriched using concepts of the typing module link. In particular the Annotation function looks promising to provide metadata.
One final question. In your comment you mentioned 'contrib workflow notebook'. Are these the notebooks in the ironflow repository (or the one in the pyiron_ontology)? |
Hi Joerg, A live chat sounds great! Re annotations, I think that works nicely for adding onto typing in addition to data typing, although annotations currently do not support kwargs (like "o type=onto.foo"), so we would need to force a fixed ordering in annotations. For this reason I lean towards using a dict, eg as a kwarg in the decorator with keys matching the variable names, to provide this data. The big problem I see using annotations to label output data is that I want output labels to be absolutely mandatory, and forcing people to learn about annotations and use them seems harder to understand than adding a positional arg to the node decorator. Also we could probably enforce it as a requirement, but we'd need to add extra rail guards, where adding it as an arg to the decorator just right away makes it clear. Re the demo notebook, this is actually not at all integrated with ironflow yet, it's over in contrib in notebooks/workflow_example.ipynb |
Missed the global var discussion because of indentation. It may indeed be possible to provide it by the scope of the class, but then you'd still need something like 'self.wf', and self will (at a minimum) look very strange in the context of a function definition (even if the decorator means we return a class instance). I bet we can find a solution, it will just take some thinking to get it both functional and intuitive. |
Let's try to schedule a meeting on Thursday or Friday. I am presently at the DPG meeting in Dresden and will be back in Düsseldorf on Friday. It would be definitely good to explore possible option to provide the labels for the output variables. Decorators may be a good option. We should only make sure that our solution is as close to standard python, so that the barrier for users is as low as possible. Thanks also for the latest development on ironflow. The ontologic features are really great and it is exciting to play with it. The only issue is the sluggish behavior, which makes it often hard to know whether a click did not work or something is still happening. I am therefore looking very much forward to the next development on speeding things up. Then one can fully enjoy the really cool features and the great concepts that you have already implemented. Really great work! |
Hi Liam, I had now a look at your workflow notebook in pyiron_contrib. Really very nice! I see also the strong links to my thoughts. An important task of the decorator would be to make the following statement more intuitive and python-like:
With the new formulation this could read like
The last line is only an example to show that all constructions in your notebook should work, i.e., the decorator converted the function into a node object. |
One more thought regarding your notebook. For code applications, it may be helpful to offer a lazy mode, i.e., the following statement should just build the workflow but not run it:
To actually run it one would have to call the following line:
Here one could also specify where to run it, i.e., the queue, the number of cores etc. It would be also nice to have an option to convert the code into a graph and vice versa:
|
While I am currently at the IPAM workshop, I would like to join the meeting to synchronize the discussions, so just keep me in the loop. |
Hi Jan, great that you will join. We have not yet set up a meeting but a good choice may be Friday afternoon when I will be back at home. |
I have a couple of recent developments from the IPAM workshop, which might also be helpful for this discussion:
While these developments focus more on the scalability of pyiron, I could see them being beneficial in simplifying the development of complex workflows and hopefully provide the ideal test bed for the developments discussed above. |
I'm just on mobile so my responses are pretty limited in depth, sorry. Re meeting: Friday sounds good. At present I can be free any time. We should keep @pmrv in the loop here too in case he wants to attend; there is both synergy and some conflict between the graph stuff and the tinybase stuff. Re standard python/decorators/link: indeed, I think getting the existing graph stuff working with decorators should be super fast, then adding the fancier bits on top can be more iterative. I'm super excited about this direction. Re lazy evaluation: there is some support for this! My existing node stuff has init flags for turning on/off the initial run and running automatically on update. Definitely not as smooth as your example, but the groundwork exists at least. Re structuretoolkit pyiron_lammps, etc: I am super excited to get this, as the coupling between current pyiron jobs and nodes is a huge pain to manage! Personally I would be happy to only support Lammps forever, but we will need to think carefully about data storage and making sure we can still accommodate more expensive codes like vasp. But for now getting it all on-the-fly as facilitated by pyiron_lammps is super exciting |
On Friday I'll still be in the train by the time @liamhuber and @jan-janssen would be able to join, so I'd prefer Thursday. |
From my side any time Thursday is also currently fine. |
Although if it's going to be first thing Thursday morning (pst) then I'll need to know inside the next five hours, which seems unlikely... |
Today (Thursday) does not work for me since I have to attend several talks and committee meetings at DPG. Tomorrow afternoon would work for me. |
Friday >=1500 CET is good for me. |
For me as well. |
I played around and implemented the decorator so that this now works: from pyiron_contrib.workflow.node import node
@node("sum")
def adder(x: int|float, y: int|float) -> int|float:
return x + y I was thinking a bit about how to handle macros and had two concrete thoughts:
And one fuzzy thought:
|
Sorry, I didn't write again yesterday. The earliest I can do tomorrow is 7pm CET, @JNmpi can do earlier he told me. I guess we can use the normal pyiron link. |
So what is the actual time then? I will set an alarm for 1445 CET (0545 PST) and check for a concrete reply here, in case the time is 1500 CET... but at that point I would certainly be happy to roll over and go back to sleep until my kids wake me up. 1900 CET is fine for me. |
Let's say 1915 CET then, in case there's a train delay or so. |
1915 CET works for me. |
I will be late. |
I tried to summarize and sketch some of the ideas we had over the last few days, particularly with Jan at the IPAM workshop, in schematic graphs. They should serve to sharpen and focus the discussion rather than representing a fixed construction schema. The first figure below shows the main components of the future node-based implementation of pyiron. An important aspect is the difference in the concepts/terms node and task. The node is the object that has all the information to translate input into a series of tasks. A simple example would be our Murnaghan object that creates for each fixed volume a separate Lammps or VASP job (task). Another example could be the Lammps library which creates for a structure container a series of jobs. Below is a specification of the node repository (or node store), which locally or globally stores and provides all information to run a node on any computer. For the other parts, we should construct similar sketches and augment them by pseudo code. |
Some notes from our discussion today: @JNmpi and I chatted today about graph-based pyiron computations, including taking a look at @pmrv's The current thrust is to make this sort of graph computation super easy to use and sufficiently powerful and performant for code users -- forget an actual graphical representation right now. Rapid access for simple nodesAfter talking a bit about pyiron objects knowing their own history, we came around to the idea of storing recipes for objects in the form of simple graphs. We came up with the following ideas for the case of very simple nodes that (a) initialize with valid input for all ports, (b) evaluate quickly, and (c) have a single output:
Then we might get an example like this: from pyiron_contrib import Workflow
wf = Workflow("my_chain_example")
structure = wf.add.node.atomistics.bulk(element="Al", repeat=5)
structure.plot3d() # == structure.outputs.structure.value.plot3d()
structure.visualize() # Shows the graph vis for a single node
structure[:5] # Creates a _new node slice under the hood_
structure.visualize()
# Now we see a two-node graph, with an internal connection
structure.plot3d() # == structure.outputs.slice_sliced.value.plot3d()
# Shows a structure with just the first five atoms
# Note that we still have only one output, so getattr works fine
# The full path to that output, however, is changed to the dynamically-created
# macro path of {node_label}_{output_channel_label}
structure.undo()
# Pops off the last node in our macro chain
structure.plot3d() # Shows the original, full-size structure
structure[:5] = "Cu" # ***Hard***
# By some magic, this adds a different new node,
# that changes species and returns a structure, and its input is set
# to match the slice info
structure.plot3d() # == structure.outputs.change_species_structure
# Shows the Al structure with 5 Cu atoms
structure.inputs.bulk_element = "Mg"
structure.inputs.change_species_element = "Ca"
structure.inputs.change_species_i_end = 6
structure.plot3d()
# Now we have an Mg structure with 5 Ca atoms! Honestly, I'm not sure how we will get the magic line labeled What to do about control loopsWe can currently make sophisticated graphs on-the-fly in the notebook, but since the python process and notebook are the ones aware of for/while loops, there is no way to serialize them as part of the graph a-priori. Today Joerg shared some snippets from Lego Mindstorms, where these sort of flow control objects are offered graphically with a drag-and-drop interface.
Or something like that. At any rate, for now let's just jam loops inside the node functionality itself and keep going. Key missing piecesWe want a stable and useful solution ASAP, that means prioritizing a few things while letting others fall by the wayside. Priorities:
Non-priorities:
|
@liamhuber, thanks for the excellent summary of our discussion. I fully agree with it. Only a few minor points/thoughts:
|
Notes from 2023.05.17 meeting with JoergRaw notes augmented and polished on the 18th. @JNmpi, you had a nicely updated version of the sketch in this comment, could you upload it in this thread? Core pyiron 1.0 features:
Executors and restarting workflowsRecently @jan-janssen has been getting into flux; I experimented with the most-primitive "executor" in Joerg and I talked a bit about how to handle restarting workflows when (a) the python process controlling the workflow gets restarted and/or (b) the process handling the task execution gets restarted. Joerg was also excited about the hierarchical approach of Flux, giving us the option to have something like per-worflow or even per-node task management. Sitting down to write out these notes the day after the meeting, this is my thinking on the topic -- and it may all be "duh" stuff to Marvin and Jan who have been thinking about pyiron's interaction with task scheduling for longer. I would define a "task manager" as some python-independent and permanent/recoverable process for executing computations, and an "executor" as a python object that executes tasks generated by nodes in our workflow.
When the "executor" very simply runs tasks modally on the main workflow python process, this is all trivial. There is obviously some strong overlap with dask's resiliancy policies, although in my dream-behaviour above, we would find a way to handle (some things like) scheduler failure more robustly. In all cases, the end user should see an extremely similar interface for their GUI stuffIn terms of GUIs, ironflow's dependence on ipycanvas means that more complex features -- like connection lines that automatically bend to flow around objects, or meta-nodes with snappable slots for actual nodes, etc. -- are going to be a huge pain to implement from scratch. Below is a sketch for what a slottable macro-node might look like.
Maintainability and classesWe want a few interfaces like In particular, a Language power and usabilityJoerg has recently looked into Julia a bit and was particularly keen on how it handles multiple dispatch. E.g., we have talked about having hierarchically defined IO classes, like E.g. 2, the Node packagesWe also talked briefly about version control and node packages. In principle, each serialized node will need to know which version of its node package it is from, but there is no problem mixing-and-matching nodes in a given workflow from different versions of the same node package -- as long as the IO connections are valid, the workflow shouldn't care if it uses nodes from multiple packages, and different versions of the same package is just equivalent to different packages. The one catch is that node package versions will need to be consistent with the version of pyiron being used, e.g. in case we change something like the PracticallityWe need to get @niklassiemer more deeply involved in these developments as he is the ideal person for long-term support! |
Today I spent some time playing with the idea of macros. Nothing is running yet, but I have some very rough spec ideas and pseudocode. As we've discussed before, I am thinking of a macro as a sort of crystalized workflow. As such,
Here's the syntax I'm playing around with: from pyiron_contrib.workflow import Workflow
@Workflow.wrap_as.single_value_node("sum")
def add(x: int = 0, y: int = 0) -> int:
return x + y
macro = Workflow("plus_minus_one")
macro.p1 = add(y=1)
macro.m1 = add(y=-1)
# Choice 1) Use the default interface
plus_minus_one_default = wf.to_macro()
wf_default = Workflow("double_it_default")
wf_default.start = add()
wf_default.macro = plus_minus_one(
p1_x=wf_default.start,
m1_x=wf_default.start
)
wf_default.end = add(
x=wf_default.macro.outputs.p1_sum,
y=wf_default.macro.outputs.m1_sum
)
# Choice 2) Define a new interace
plus_minus_one_custom = wf.to_macro(
# inputs={
# "x": (macro.p1.inputs.x, macro.m1.inputs.x)
# }, # This way?
inputs={
macro.p1.inputs.x: "x",
macro.m1.inputs.x: "x"
} # Or this way? For linking two inputs to a single channel
outputs={
"p1": macro.p1,
"m1": macro.m1
}
)
wf_custom = Workflow("double_it_custom")
wf_custom.start = add()
wf_custom.macro = plus_minus_one_custom(x=wf_custom.start)
wf_custom.end = add(
x=wf_custom.macro.outputs.p1,
y=wf_custom.macro.outputs.m1
)
# Choice 3) With a decorator
@Workflow.wrap_as.macro()
def plus_minus_one_deco(macro):
"""
Macro wrapped functions take the macro as the first and only argument
(which is a lot like a workflow), create nodes and make connections,
and return inputs and outputs maps for giving special access
"""
macro.p1 = add(y=1)
macro.m1 = add(y=-1)
return {macro.p1.inputs.x: "x", macro.m1.inputs.x: "x"}, {}
wf_deco = Workflow("double_it_deco")
wf_deco.start = add()
wf_deco.macro = plus_minus_one_deco(x=wf_deco.start)
wf_deco.end = add(
x=wf_deco.macro.outputs.p1_sum, # We didn't map these
y=wf_deco.macro.outputs.m1_sum # So use the default
)
for wf in [wf_default, wf_custom, wf_deco]:
for i in range(5):
# All cases should do the same boring thing, and should do it
# automatically since the children are SVNodes
assert(2 * wf.inputs.start_x.value == wf.outputs.end_y) The fact that the macro's child node is defined in the notebook really pushes at the question of how to best store macros. Of course I'd love if under the hood they just stored class names and connection lists and re-instantiated nodes from known libraries... but perhaps sometimes we will really need to cloudpickle node instances and reinstantiate by unpicking the whole thing. |
Summary
Detailed Description
Further Information, Files, and Links
The text was updated successfully, but these errors were encountered: