Skip to content
This repository has been archived by the owner on Dec 8, 2022. It is now read-only.

Commit

Permalink
Merge pull request #95 from seq-lang/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
arshajii authored Jan 26, 2020
2 parents a86be08 + 870fdbf commit 131a5a8
Show file tree
Hide file tree
Showing 21 changed files with 274 additions and 84 deletions.
65 changes: 65 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Contributing to Seq

Thank you for considering contributing to Seq! This document contains some helpful information for getting started. The best place to ask questions or get feedback is [our Gitter chatroom](https://gitter.im/seq-lang/Seq). For a high-level outline of the features we aim to add in the future, check the [roadmap](https://github.com/seq-lang/seq/wiki/Roadmap).

## An overview

The [compiler internals documentation](https://seq-lang.org/internals.html) is probably a good starting point if you plan on modifying the compiler. If you have any specific questions, please don't hesitate to ask on Gitter.

## Development workflow

All development is done on the [`develop`](https://github.com/seq-lang/seq/tree/develop) branch. Just before release, we bump the version number, merge into [`master`](https://github.com/seq-lang/seq/tree/master) and tag the build with a tag of the form `vX.Y.Z` where `X`, `Y` and `Z` are the [SemVer](https://semver.org) major, minor and patch numbers, respectively. Our Travis CI build script automatically builds and deploys tagged commits as a new GitHub release via our trusty [@SeqBot](https://github.com/seqbot). It also builds and deploys the documentation to our website.

## Coding standards

- All C++ code should be formatted with [ClangFormat](https://clang.llvm.org/docs/ClangFormat.html) using the LLVM style guide.
- All OCaml code should be formatted with [OCamlFormat](https://github.com/ocaml-ppx/ocamlformat) using the Jane Street style guide.

## Writing tests

Tests are written as Seq programs. The [`test/core/`](https://github.com/seq-lang/seq/tree/master/test/core) directory contains some examples. If you add a new test file, be sure to add it to [`test/main.cpp`](https://github.com/seq-lang/seq/blob/master/test/main.cpp) so that it will be executed as part of the test suite. There are two ways to write tests for Seq:

#### New style

Example:

```python
@test
def my_test():
assert 2 + 2 == 4
my_test()
```

**Semantics:** `assert` statements in functions marked `@test` are not compiled to standard assertions: they don't terminate the program when the condition fails, but instead print source information, fail the test, and move on.

#### Old style

Example:

```python
print 2 + 2 # EXPECT: 4
```

**Semantics:** The source file is scanned for `EXPECT`s, executed, then the output is compared to the "expected" output. Note that if you have, for example, an `EXPECT` in a loop, you will need to duplicate it however many times the loop is executed. Using `EXPECT` is helpful mainly in cases where you need to test control flow, **otherwise prefer the new style**.

## Pull requests

Pull requests should generally be based on the `develop` branch. Before submitting a pull request, pleace make sure...

- ... to provide a clear description of the purpose of the pull request.
- ... to include tests for any new or changed code.
- ... that all code is formatted as per the guidelines above.

Please be patient with pull request reviews, as our throughput is limited!

## Issues

We use [GitHub's issue tracker](https://github.com/seq-lang/seq/issues), so that's where you'll find the most recent list of open bugs, feature requests and general issues. If applicable, we try to tag each issue with at least one of the following tags:

- `Build`: Issues related to building Seq
- `Codegen`: Issues related to code generation (i.e. after parsing and type checking)
- `Parser`: Issues related to lexing/parsing
- `Library`: Issues related to the Seq standard library
- `Interop`: Issues related to interoperability with other languages or systems
- `Docs`: Issues related to documentation
- `Feature`: New language feature proposals
2 changes: 1 addition & 1 deletion compiler/include/seq/seq.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@

#define SEQ_VERSION_MAJOR 0
#define SEQ_VERSION_MINOR 9
#define SEQ_VERSION_PATCH 3
#define SEQ_VERSION_PATCH 4

namespace seq {
namespace types {
Expand Down
3 changes: 3 additions & 0 deletions compiler/lang/func.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,9 @@ void Func::sawPrefetch(Prefetch *prefetch) {
throw exc::SeqException(
"function cannot perform both prefetch and inter-sequence alignment",
getSrcInfo());
if (this->prefetch)
return;

this->prefetch = true;
gen = true;
outType = types::GenType::get(outType, types::GenType::GenTypeKind::PREFETCH);
Expand Down
5 changes: 5 additions & 0 deletions compiler/lang/lang.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -530,6 +530,11 @@ void TryCatch::codegen0(BasicBlock *&block) {
BasicBlock *preambleBlock = base->getPreamble();
types::Type *retType = base->getFuncType()->getBaseType(0);

if (types::GenType *gen = retType->asGen()) {
if (gen->fromPrefetch() || gen->fromInterAlign())
retType = gen->getBaseType(0);
}

// entry block:
BasicBlock *entryBlock = BasicBlock::Create(context, "entry", func);
BasicBlock *entryBlock0 = entryBlock;
Expand Down
9 changes: 5 additions & 4 deletions compiler/lang/pipeline.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,7 @@ static Value *codegenPipe(BaseFunc *base,
assert(!inParallel);

BasicBlock *notFull = BasicBlock::Create(context, "not_full", func);
BasicBlock *notFull0 = notFull;
BasicBlock *full = BasicBlock::Create(context, "full", func);
BasicBlock *exit = BasicBlock::Create(context, "exit", func);

Expand Down Expand Up @@ -396,8 +397,8 @@ static Value *codegenPipe(BaseFunc *base,
BasicBlock *preamble = base->getPreamble();
// construct parameters
types::GenType::InterAlignParams paramExprs = genType->getAlignParams();
Value *params = PipeExpr::validateAndCodegenInterAlignParams(
paramExprs, base, preamble);
Value *params =
PipeExpr::validateAndCodegenInterAlignParams(paramExprs, base, entry);

IRBuilder<> builder(preamble);
const unsigned W = PipeExpr::SCHED_WIDTH_INTERALIGN;
Expand Down Expand Up @@ -431,14 +432,14 @@ static Value *codegenPipe(BaseFunc *base,
Value *N = builder.CreateLoad(filled);
Value *M = ConstantInt::get(seqIntLLVM(context), W);
Value *cond = builder.CreateICmpSLT(N, M);
builder.CreateCondBr(cond, notFull, full);
builder.CreateCondBr(cond, notFull0, full);

builder.SetInsertPoint(full);
N = builder.CreateCall(flush, {pairs, bufRef, bufQer, states, N, params,
hist, pairsTemp, statesTemp});
builder.CreateStore(N, filled);
cond = builder.CreateICmpSLT(N, M);
builder.CreateCondBr(cond, notFull, full); // keep flushing while full
builder.CreateCondBr(cond, notFull0, full); // keep flushing while full

// store the current state for the drain step:
drain->states = states;
Expand Down
8 changes: 8 additions & 0 deletions compiler/types/ptr.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,14 @@ void types::PtrType::initOps() {
},
false},

{"__int__",
{},
Int,
[](Value *self, std::vector<Value *> args, IRBuilder<> &b) {
return b.CreatePtrToInt(self, seqIntLLVM(b.getContext()));
},
false},

{"__copy__",
{},
this,
Expand Down
2 changes: 1 addition & 1 deletion docs/sphinx/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def setup(sphinx):
# The short X.Y version
version = u'0.9'
# The full version, including alpha/beta/rc tags
release = u'0.9.3'
release = u'0.9.4'

# Logo path
html_logo = '../images/logo.png'
Expand Down
2 changes: 1 addition & 1 deletion docs/sphinx/cookbook.rst
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ Parallel FASTQ processing
# Sometimes batching reads into blocks can improve performance,
# especially if each is quick to process.
FASTQ('reads.fq') |> iter |> block(1000) ||> process
FASTQ('reads.fq') |> blocks(size=1000) ||> iter |> process
Reading SAM/BAM/CRAM
--------------------
Expand Down
23 changes: 23 additions & 0 deletions docs/sphinx/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,26 @@ Source code is available `on GitHub <https://github.com/seq-lang/seq>`_. You can
build
internals
api/doxygen

Frequently Asked Questions
--------------------------

*What is the goal of Seq?*

One of the main focuses of Seq is to bridge the gap between usability and performance in the fields of bioinformatics and computational genomics, which have an unfortunate reputation for hard-to-use, buggy or generally poorly-written software. Seq aims to make writing high-performance genomics or bioinformatics software substantially easier, and to provide a common, unified framework for the development of such software.

*Why do we need a whole new language? Why not a library?*

There are many great bioinformatics libraries on the market today, including `Biopython <https://biopython.org>`_ for Python, `SeqAn <https://www.seqan.de>`_ for C++ and `BioJulia <https://biojulia.net>`_ for Julia. In fact, Seq offers a lot of the same functionality found in these libraries. The advantages of having a domain-specific language and compiler, however, are the higher-level constructs and optimizations like :ref:`pipeline`, :ref:`match`, :ref:`interalign` and :ref:`prefetch`, which are difficult to replicate in a library, as they often involve large-scale program transformations/optimizations. A domain-specific language also allows us to explore different backends like GPU, TPU or FPGA in a systematic way, in conjunction with these various constructs/optimizations, which is ongoing work.

*What about interoperability with other languages and frameworks?*

Interoperability is and will continue to be a priority for the Seq project. We don't want using Seq to render you unable to use all the other great frameworks and libraries that exist. Seq already supports interoperability with C/C++ and Python (see :ref:`interop`), which we are in the process of expanding (e.g. by allowing Python libraries to be written in Seq).

*I want to contribute! How do I get started?*

Great! Check out our `contribution guidelines <https://github.com/seq-lang/seq/blob/master/CONTRIBUTING.md>`_ and `open issues <https://github.com/seq-lang/seq/issues>`_ to get started. Also don't hesitate to drop by our `Gitter chatroom <https://gitter.im/seq-lang/Seq?utm_source=share-link&utm_medium=link&utm_campaign=share-link>`_ if you have any questions.

*What is planned for the future?*

See the `roadmap <https://github.com/seq-lang/seq/wiki/Roadmap>`_ for information about this.
6 changes: 4 additions & 2 deletions docs/sphinx/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ Pre-built binaries for Linux and macOS on x86_64 are available alongside `each r
This will install Seq in a new ``.seq`` directory within your home directory. Be sure to update ``~/.bash_profile`` as the script indicates afterwards!

Seq binaries require a `libomp <https://openmp.llvm.org>`_ to be present on your machine. ``brew install libomp`` or ``apt install libomp5`` should do the trick.

Building from source
^^^^^^^^^^^^^^^^^^^^

Expand All @@ -35,8 +37,8 @@ or produce an LLVM bitcode file if a ``-o <out.bc>`` argument is provided. In th
seqc -o myprogram.bc myprogram.seq
llc myprogram.bc -filetype=obj -o myprogram.o
g++ -L/path/to/libseqrt/ -lseqrt -o myprogram myprogram.o
gcc -L/path/to/libseqrt/ -lseqrt -lomp -o myprogram myprogram.o
This produces a ``myprogram`` executable. (If multithreading is needed, the ``g++`` invocation should also include ``-fopenmp``.)
This produces a ``myprogram`` executable.

**Interfacing with C:** If a Seq program uses C functions from a particular library, that library can be specified via a ``-L/path/to/lib`` argument to ``seqc``. Otherwise it can be linked during the linking stage if producing an executable.
10 changes: 10 additions & 0 deletions docs/sphinx/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,8 @@ Common formats like FASTQ, FASTA, SAM, BAM and CRAM are supported.

Sequences can be reverse complemented in-place using the ``revcomp()`` method; both sequence and :math:`k`-mer types also support the ``~`` operator for reverse complementation, as shown above.

.. _match:

Sequence matching
^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -186,6 +188,8 @@ A novel aspect of Seq's ``match`` statement is that it also works on sequences,

Sequence patterns consist of literal ``ACGT`` characters, single-base wildcards (``_``) or "zero or more" wildcards (``...``) that match zero or more of any base.

.. _pipeline:

Pipelines
^^^^^^^^^

Expand Down Expand Up @@ -255,6 +259,8 @@ Here is the list of options supported by the ``align()`` method; all are optiona

Note that all costs/scores are positive by convention.

.. _interalign:

Inter-sequence alignment
""""""""""""""""""""""""

Expand All @@ -272,6 +278,8 @@ Seq uses `ksw2 <https://github.com/lh3/ksw2>`_ as its default alignment kernel.
Internally, the Seq compiler performs pipeline transformations when the ``inter_align`` function is used within a function tagged ``@inter_align``, so as to suspend execution of the calling function, batch sequences that need to be aligned, perform inter-sequence alignment and return the results to the suspended functions. Note that the inter-sequence alignment kernel used by Seq is adapted from `BWA-MEM2 <https://github.com/bwa-mem2/bwa-mem2>`_.

.. _prefetch:

Genomic index prefetching
^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -399,6 +407,8 @@ The ``ptr[T]`` type in Seq also corresponds to a raw C pointer (e.g. ``ptr[byte]

Seq also provides ``__ptr__`` for obtaining a pointer to a variable (as in ``__ptr__(myvar)``) and ``__array__`` for declaring stack-allocated arrays (as in ``__array__[int](10)``).

.. _interop:

C/C++ and Python interoperability
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
3 changes: 2 additions & 1 deletion stdlib/bio/__init__.seq
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@ import bio.seq

from bio.builtin import *

from bio.block import Block, blocks
from bio.locus import Locus
from bio.iter import Seqs, Block, blocks, block_seqs
from bio.iter import Seqs

from bio.align import SubMat, CIGAR, Alignment, inter_align
from bio.pseq import pseq, translate, as_protein
Expand Down
20 changes: 20 additions & 0 deletions stdlib/bio/align.seq
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,26 @@ extend CIGAR:
def __len__(self: CIGAR):
return self.len

def __init__(self: CIGAR, cigar: str) -> CIGAR:
ops = list[int]()
d = 0
for c in cigar:
if c.isdigit():
d = 10 * d + ord(c) - ord('0')
elif c in "MIDNSHP=XB":
if d == 0:
raise ValueError(f"cigar op '{c}' in string '{cigar}' has no count or count zero")
ops.append((d << 4) | "MIDNSHP=XB".find(c))
d = 0
else:
raise ValueError(f"invalid CIGAR string '{cigar}': unexpected '{c}'")
if d != 0:
raise ValueError(f"unclosed cigar op in string '{cigar}'")
p = ptr[u32](len(ops))
for i, o in enumerate(ops):
p[i] = u32(o)
return (p, len(ops))

@property
def qlen(self: CIGAR):
return _C.bam_cigar2qlen(self.len, self.value)
Expand Down
9 changes: 8 additions & 1 deletion stdlib/bio/bam.seq
Original file line number Diff line number Diff line change
Expand Up @@ -261,9 +261,12 @@ class BAM:
self._ensure_open()
while _C.seq_hts_sam_itr_next(self.file, self.itr, self.aln) >= 0:
yield self.aln

self.close()

def __blocks__(self: BAM, size: int):
from bio.block import _blocks
return _blocks(self.__iter__(), size)

def __seqs__(self: BAM):
for aln in self._iter():
yield _C.seq_hts_get_seq(aln)
Expand Down Expand Up @@ -353,6 +356,10 @@ class SAM:
for aln in self._iter():
yield SAMRecord(aln)

def __blocks__(self: SAM, size: int):
from bio.block import _blocks
return _blocks(self.__iter__(), size)

def close(self: SAM):
if self.aln:
_C.bam_destroy1(self.aln)
Expand Down
43 changes: 43 additions & 0 deletions stdlib/bio/block.seq
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
class Block[T]:
_data: ptr[T]
_size: int

def __init__(self: Block[T], size: int):
self._data = ptr[T](size)
self._size = 0

def __iter__(self: Block[T]):
data = self._data
size = self._size
i = 0
while i < size:
yield data[i]
i += 1

def __len__(self: Block[T]):
return self._size

def __bool__(self: Block[T]):
return len(self) != 0

def __str__(self: Block[T]):
return f'<block of size {self._size}>'

def _add(self: Block[T], elem: T):
self._data[self._size] = elem
self._size += 1

def _blocks[T](g: generator[T], size: int):
b = Block[T](size)
for a in g:
if len(b) == size:
yield b
b = Block[T](size)
b._add(a)
if b:
yield b

def blocks(x, size: int):
if size <= 0:
raise ValueError(f"invalid block size: {size}")
return x.__blocks__(size)
4 changes: 4 additions & 0 deletions stdlib/bio/fasta.seq
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,10 @@ type FASTA(file: gzFile, fai: list[int], names: list[str]):
yield (self.names[-1], seq(p, n))
self.close()

def __blocks__(self: FASTA, size: int):
from bio.block import _blocks
return _blocks(self.__iter__(), size)

def close(self: FASTA):
self.file.close()

Expand Down
4 changes: 4 additions & 0 deletions stdlib/bio/fastq.seq
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,10 @@ type FASTQ(file: gzFile):
line += 1
self.close()

def __blocks__(self: FASTQ, size: int):
from bio.block import _blocks
return _blocks(self.__iter__(), size)

def close(self: FASTQ):
self.file.close()

Expand Down
Loading

0 comments on commit 131a5a8

Please sign in to comment.