As many have probably guessed, RosettaScripts is indeed turing complete. It has probably been turing complete from very near its conception, but the addition of StoredResidueSubsets
gave access to unlimited named variables. Combine that with the IfMover
, the IfThenFilter
, ParsedProtocols
, and the LoopOverMover
and you have control flow. Finally, the CalculatorFilter
and SavePoseMover
give incredible shortcuts for actually performing computations.
Printing is accomplished by mutating a pose and printing its sequence. Ironically, the MultiplePoseMover
is probably the best for printing a pose's sequence. Key input is technically possible by loading constraint files, but an external program must generate that file on the fly.
Many of the operations performed here don't seem possible, unless you consider the idea of writing thousands of lines of xml. For instance, mathematical operations coming from the CalculatorFilter
go through an entire array of IfMovers
with RangeFilters
each looking for exactly 1 int value. (It uses a binary search, so it's not that slow.)
This github repo contains "rosetta_complete.py" which compiles the Ros language to RosettaScripts xml. Ros is an ultra slim language that looks like python but lacks most features.
To execute a compiled xml, the following command will work where pdb.pdb is any pdb.
rosetta_scripts -s pdb.pdb -parser:protocol compiled.xml | grep Applying
Often one may want to pass the output through sed to make it easier to read. The command may buffer at this point and so sed outputs often look like this: (this converts all I to " ")
stdbuf -o0 rosetta_scripts -s pdb.pdb -parser:protocol compiled.xml | stdbuf -o0 -i0 grep Applying | stdbuf -i0 -o0 sed 's/I/ /g'
prime_numbers.ros
, tunnel.ros
, and tetris.ros
provide pretty good overviews of the Ros language. tetris.ros
was the holy grail to really prove the turing completeness of the language.
Only tetris accepts key inputs. Running tetris.ros
in one terminal and ./keyinput.py '<left><up><right><down>/'
in another will give you control. Use the arrow keys and / to control the pieces.
tetris.xml
uses this command to run:
stdbuf -o0 rosetta_scripts -overwrite -s test.pdb -mute all -unmute protocols.rosetta_scripts.MultiplePoseMover -parser:protocol tetris.xml | stdbuf -o1880 -i0 grep --color=never Applying | stdbuf -o0 -i0 sed -e 's/I/ /g' -e 's/[VYTW]/▓/g'
The others work well with this command:
stdbuf -o0 rosetta_scripts -overwrite -s test.pdb -mute all -unmute protocols.rosetta_scripts.MultiplePoseMover -parser:protocol tunnel.xml | stdbuf -o0 grep Applying | stdbuf -o0 -i0 sed -e 's/I/ /g' -e 's/W/▓/g'
In Ros, all indentations (think python) must be 4 spaces.
In Ros, all numbers are non-negative integers. Floating point numbers may exist briefly inside calculations, but the results are floored to an int. Overflows often clip to 0 or $MAXINT.
Variable work like you would expect.
a = 5
b = a
There are many reserved variable names. In general, avoid double underscore (__), lit_*, retval, and many more.
All ascii chars can be accessed via 'X'
like C++. These are evaluated as integers.
All protein sequence chars may be accessed via 'aaW'
. Use these when writing to buf[]
In Ros, there are a few buffers you can use. All but buf[]
accept both read and writes. buf[]
may only be written to.
buf
-- literally a protein of size $BUFSIZE
ram
-- buffer of size $RAMSIZE
big0
,big1
,big2
,big3
,big4
-- The screen buffer used for the big_*()
functions. Each big*
is length $BUFSIZE
keybuf
-- Where the keyinput()
stores its values
Buffers are accessed like python:
buf[5] = 'aaR'
ram[16] = ram[ a + b ]
Arbitrarily named buffers could easily be added to this project (8 lines of code). Though, the are not currently implemented.
if
in Ros works very similarly to python except the if statement must be an expression containing: <
, >
, <=
, >=
, or ==
.
if ( (x + 3) / 2 >= 7 ):
do_something()
else:
do_something_else()
elif
is not supported. Neither is !=
. If you need !=
, just go straight to else
Only the while
loop is supported and only the no-argument version is supported.
while:
do_something()
To break out of a loop, use break
i = 0
while:
if ( i == 5 ):
break
i = i + 1
continue
is also supported.
Functions look very much like python. But default values are not supported.
def my_cool_function( arg1, arg2 ):
return arg1 + arg2
All functions return exactly 1 integer value. return
may be used anywhere inside a function. Functions without return
return 0.
A major difference between Ros functions and python functions however, is that like C++, Ros functions can only be used once they've been declared. This means if you define A()
then B()
, you cannot call B()
from A()
. Additionally, you can't call B()
from B()
because B()
isn't fully defined until after it has been parsed. So recursive functions aren't possible. There's likely no way around this limitation as it is a limitation of RosettaScripts itself.
All expression in Ros are evaluated by the CalculatorFilter
. Only these characters are allowed *-+/()
. In order to aid parsing, every nested ( )
is evaluated separately. The effect of this is that every ( )
is effectively the python function int( )
.
Example:
a = ( 4 / 5 ) * 5
if ( a == 0 ):
# this is what actually happens
else:
# you would expect 4 though
Because the ( 4 / 5 )
is evaluated separately before the X * 5
, 4 / 5
gets floored to 0
and the entire expression evaluates to 0
.
Boolean operations are not allowed in expressions.
Expression evaluation is really slow, avoid it where possible.
pass
-- works exactly like in python (and is actually never needed in Ros)
print()
-- prints the contents of buf[]
@__script__VAR
-- Using the @
sign, you can access the raw variable names. For the most part don't use this, but @__script__VAR
will access "global" variables stored at 0
indent
@buf_NAME_X
-- You can also use @
to directly access buffers. This is much faster but requires you to know at compile time what offset you want. buf[]
cannot be accessed this way.
Ros scripts may signal to the compiler certain values. Here are all options:
$BUFSIZE
-- The size of buf
$RAMSIZE
-- Length of ram
buffer
$MAXINT
-- The maximum integer value. Must be 2^X - 1
$RANDMAX
-- Max output for rand()
function. Must be 2^X - 1
$WHILE_ITERS
-- The maximum number of while
loop iterations
$AVAIL_CHARS
-- Characters available for big_print()
$CLEAR_CHAR
-- big_print()
space char
$PRINT_CHAR
-- big_print()
filled char
$BUF_CACHING
-- Either 0
or 1
. See "buf_caching" section
$KEY_INPUT
-- Number of keys for keyinput. See "key input" section
$KEY_INPUT_CALLS
-- Number of times keyinput()
can be called
Storing to directives looks like this:
$MAXINT 1023
$AVAIL_CHARS ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
Ros offers many built in functions to do thing too hard to code in the actual Ros language
big_clear()
-- clear the big*
buffer
big_print(char, offset)
-- write a character to the big*
buffer at offset. Returns new offset
big_display()
-- copy the big*
buffer to buf
and print each line
rand()
-- return a random number between 0
and $RANDMAX
inclusive
keyinput()
-- Load keyinput data
save("name")
-- Save buf[]
to name "name"
. Don't use the names "integer"
or "buffer"
load("name")
-- Load stored buf from "name"
into buf[]
. See buf_caching.
big_print()
offers the easiest way to actually print words and numbers from Ros.
In order to use big_print()
, you must set $AVAIL_CHARS
to all of the characters you plan to use. $PRINT_CHAR
and $CLEAR_CHAR
decide what will be used for filled and empty regions of the text.
big_print()
is most often used like this:
big_clear()
off = 10
off = big_print('H', off)
off = big_print('E', off)
off = big_print('L', off)
off = big_print('L', off)
off = big_print('O', off)
big_display()
See tunnel.ros
for a way to print integers.
This is the only sad part of the Ros language because it relies on an external program. Perhaps someday someone can get Ros to actually read from stdin.
Calling keyinput()
will cause a sequence constraint file called keyinput.MSAcst
to be read. It is expected that keyinput.MSAcst
refers to another file containing a PSSM constraint (here keyinput.profile
). The value stored for Alanine is what is read. Each seqpos of the file is read into the keybuf[]
buffer. $NUM_KEYINPUT
determines how many keys are read.
keyinput.py
can be used generate th keyinput.profile
. It takes a string argument for each key you want to use. Then, each time each key is pressed, it increments the corresponding seqpos of the PSSM by 1.
Due to a hilarious twist of fate, the ConstraintSetMover
caches each constraint it reads. This means that internally, a different ConstraintSetMover
is used for each keyinput()
call. $KEY_INPUT_CALLS
specifies how many times you are allowed to call keyinput()
buf_caching:
In order to make printing faster, it's possible to cache the buf[]
buffer such that the underlying pose does not need to be updated if the current write won't change it. This makes storing to buf[]
slightly slower, unless your stored value is already cached in which case it is very fast.
An important caveat here is that the buffer used for caching is not directly tied to the underlying pose. Therefore, if one calls the builtin load()
, the buffer will go stale.
If one wishes to use load()
with $BUF_CACHING 1
, the best option is ensure that writes to buf[]
always go to the original buffer. (i.e. if you call load()
on something that is out of sync with buf[]
, be sure to load()
the thing that's in-sync with buf[]
before you actually store to buf[]
)
This section describes how rosetta_complete.py
is actually able to use the RosettaScript language as an assembly language.
There are two poses that are used and that are created on startup by clever use of the AddChain
mover. One is $BUFSIZE
long and the other is $MAXINT
long and they are both stored with SavePoseMovers
. For 99% of execution, the "integer"
pose is loaded. Rarely, the "buffer"
pose is loaded when mutations need to be made or it needs to be printed.
All of the variables are StoredResidueSubsets
. Specifically, they are selections of 0 or more residues starting from residue 1. These can be easily copied around with StoreResidueSubset
and can be evaluated as filters with the ResidueCountFilter
.
Literal values (like 4) come from a giant table of IndexResidueSelectors
from "lit_0"
all the way to "lit_$MAXINT"
Expressions are first parsed into their most basic form A*B+C
, and then converted into CalculatorFilters
. The calculator filter is then evaluated by a huge array of IfMovers
holding RangeFilters
to decide what value to store to a specific result variable.
To limit the number of lines of xml, a sort of math subprocessor is used. First, all variables are stored into the math registers and another register is used to select which math expression to use. The actual math function is an IfThenFilter
which selects from a list of calculator filters. That IfThenFilter
is then evaluated over and over again as the RangeFilters
try to decide what the result is.
This process uses a binary search to make it faster, but math is still really slow.
Buffers are made as terribly as you're probably expecting. Every single position in the buffer is actually a StoredResidueSubset
. Then, a binary search is used to index into a wall of StoreResidueSubsetMovers
.
The actual print()
statement is very easy, it turns out that the MultiplePoseMover
prints the pose sequence. So one just needs to load the "buffer"
pose, call MultiplePoseMover
, and reload "integer"
.
Actually mutating the "buffer"
pose is much harder. The reason is that all of the StoredResidueSubsets
are actually stored inside "integer"
. So as soon as you load "buffer"
, all of your variables disappear. It's possible to circumvent this with walls of ParsedProtocols
though. First, a set of $BUFSIZE
ParsedProtocols
mark a residue on "buffer"
with a PDBInfoLabel
. Then, a 20-length ParsedProtocols
wall marks the marked aa with another PDBInfoLabel
specifying which aa to use. Finally, the PackRotamersMover
is called with 21 task operations, 1 the prevents repacking on non-marked residues, and 20 RestrictAbsentCanonicalAASRLT
.
Functions are surprisingly easy. They are simply parsed protocols which store to a specific named variable before they exit. Then, the code that called the function simply stores that variable where it needs to go.
Function arguments are handled in the same way with the calling code storing to a few specific named variables which are then used by the function.
See the "Breaking" section for information about mid-function return
At the heart of If
, is the IfMover
. The then
clause is one ParsedProtocol
and the else
clause is another. Expressions are evaluated by the CalculatorFilter
and the result passed through different RangeFilters
to decide on less than or greater than.
while
loops are just LoopOverMovers
. Each while
loop has a variable that tells it to keep going. If something causes the loop to end, the 0
is stored to that variable and on the next iteration, the loop exists.
This is where things start to deviate from simple. First, the LoopOverMover
isn't great at saving state. For this reason, each LoopOverMover
starts and ends with a SavePoseMover
. Additionally, the filter needs to be a MoveBeforeFilter
that loads the pose because the LoopOverMover
discards poses that fail a filter halfway through.
Leaving a loop is accomplished by "Breaking"
This turned out to be the most complicated part of the whole control flow compilation. It may seem obvious what should happen when returning from an if
statement inside a while
loop inside a function
, but when you consider that this is represented as 3 nested ParsedProtocols
, actually signalling a return from the current line is tough.
From the viewpoint of control-flow compilation, break
and return
behave exactly the same.
The first break
is accomplished by simply failing a filter. This causes the current ParsedProtocol
to quit.
Subsequent breaks
are harder. They necessitate including a filter after every single if
and while
call that can fail if a break was triggered inside. Additionally, the "integer"
pose must be saved and loaded before triggering the filters because failed filters usually reset the pose (and all stored variables).
In practice, compiling a break looks like this:
-
Look back in the scope-stack for the correct structure to break (
break
breakswhile
andreturn
breaks functions) -
Identify all
while
loops along the way and set their loop variable to0
-
If we're returning, store the return value
-
Set all the filter variables to fail for each structure whose parent we are breaking. In the initial example, we have to tell the
if
andwhile
structures to fail as soon as they return. -
Save the current pose
-
Fail a filter
And then what happens, is that the current ParsedProtocol
quits, the ParsedProtocol
containing it likely quits, and the stack unrolls until we've hit the correct final level.
One more note, if a ParsedProtocol
fails a filter, the entire mover gets flagged as a fail. If uncaught, this breaks control flow farther than one wants. The LoopOverMover
saves the day here, because it can actually be used like a try: catch: block to wrap all parsed protcols that might fail. Simply set the iterations to 1.
The rand function is a actually a tower of IfMovers
that all trigger on a FalseFilter
with confidence="0.5"
This is the only bad part of the project because I don't know how to actually read keys from Rosetta. In principle, this is pretty simple, load a PSSM into the pose, the pose is already poly ALA, read the PSSM value for the first N residues as a score.
The hilarious part though is that since the ConstraintSetMover
caches the value, you actually have to use a different mover for every single read.
Maybe someday someone can get rid of this dependency.