Skip to content

Latest commit

 

History

History
1168 lines (848 loc) · 24.3 KB

README.md

File metadata and controls

1168 lines (848 loc) · 24.3 KB

ari

A type-centred purely functional programming language designed to type binary files.

Funding for Ari is provided by NLnet and the European Commission through the NGI Assure initiative.

Why does this exist?

To make binary files more accessible

Most binary files require specially designed tools to read & write, but ari types are designed so that you can decompose any binary file down into its component parts. The language is intended to be a foundation for other tools to build on top of.

Some of the tools we plan on building with ari include:

arie

An editor specially designed to edit files using ari types.

We'll also build plugins for existing text editors.

ariq

A command line tool to query components of a file using ari types.

aric

A command line tool to compile ari types into other languages using ari map types.

arid

A command line tool to structurally diff files of the same ari type.

ariz

A command line tool to compress files & directories using ari types.

Ari's type system is designed to act as a guide for arithmetic coding.

... but also, not just binary formats

While the primary focus is to help interpret binary data, ari is also designed to model grammars for text-based languages.

Here is a quick comparison of how some formal language concepts map into ari:

NOTE: This comparison is meant for people who are already familiar with these concepts.

If you want to get a better understanding of what they mean, or what the ari equivalents mean, see the language reference.

RegexAri
a|b|c
(| "a" "b" "c")
abc
"abc"

or

(* "a" "b" "c")
a?
(^ "a" ?)
a*
(^ "a" _)
a+
(^ "a" !)
a{3}
(^ "a" 3)
a{2, 5}
(^ "a" (.. 2 5))

or

(^ "a" (| 2 3 4))
.
codepoint
Backus–Naur formAri
"terminal"
"terminal"
<non-terminal>
non-terminal
<symbol> ::= "ab" | "cd" <x>
:symbol (| "ab" (* "cd" x))

What makes ari unique?

Types are the main focus of the language

In ari, everything is a type. Types and type expressions are the primary focus of the language. You can add, multiply, and even exponentiate types, much like you can a number in any other programming language.

Ari type expressionsEquivalent types
:sum (+ a b c)
:product (* a b c)
:map (^ a b c)

See the language reference

Generalizes the idea of primitive types

In ari, there are infinitely many primitives types generalized by a single concept, natural numbers. We call these primitive types natural types.

...well what exactly does this mean? Say you want to create a size type that has 3 possible states, small, medium, large. In ari, you can precisely type the number of states using the 3 type:

:size 3

The possible values of this type are 0, 1, 2, which you can use to represent small, medium, large. This is a powerful concept, but this small example doesn't fit with the main goal of ari, to make things more accessible.

...this is exactly where algebraic expressions and labels come in! You can break down and label the individual states of a 3 type with a sum expression. For example, this sum expression produces a sum type that's equivalent to the 3 type:

:size (+ :small 1 :medium 1 :large 1)

and the labelled states in this example have a 1:1 correspondence with the 3 type:

  • small = 0
  • medium = 1
  • large = 2

The whole idea of ari is that binary data can be interpreted as one giant number, and we provide ways to break big numbers down into smaller numbers... and label them 🙂.

NOTE: Many other languages can partially model the idea of sum types with "enums". The main issue though is enums are often "rounded" to the best matching primitive type(s) (usually some kind of int type). In ari every possible sum type has an equivalent natural type, which allows for a level of precision not possible in other type systems.

Value bindings and dependent types

In ari all types have an implicit binding with a corresponding runtime value. Dereference expressions @ can be used on a type to bring this value from "value-space" into "type-space".

Here's an example where they're used to describe a 24-bit RGB colour image with a dynamically sized width & height:

:byte (^ :bit 2 :bit-length 8)
:my-image-format
(*
  :width byte
  :height byte
  :image
  (^
    :pixel (* :r byte :g byte :b byte)
    @width
    @height
  )
)

In this example the @width and @height types are evaluated from the runtime value of width & height.

Expressions are normally evaluated at compile time, but this pushes the evaluation to runtime. Ari can use the same expressions between compile time & runtime because it's purely functional, meaning that expressions don't depend on any hidden state.

Labels are optional, but you can also label anything

Most languages force you to name things, even when the name is redundant or unnecessary. A lot of these languages have to come up with parallel constructs to match their named counterparts. eg. functions ↔ lambdas, structs ↔ tuples... Often there isn't an unnamed alternative. You'll be forced to decouple & name things even when an inlined alternative would be easier to understand.

Ari takes a fundamentally different approach. Any expression can be inlined without labelling, but also, all expressions can be labelled.

This gives developers flexibility to do whatever makes the most sense. Maybe something is only used once and it would be simpler to just inline it, maybe you don't even know how to name something yet, but you want to define its structure. Maybe you just want to refer to a type nested within another type without decoupling it from its parent.

When you can label everything, anything can be a module.

What ideas does ari steal?

Ari steals a bunch of good ideas from various places:

Academic Inspirations:

Language Inspirations:

  • Lisp (s-expressions - ari is not a lisp, homoiconicity, functional-focused)
  • Rust (sum types & other algebraic types)
  • Nix (purely functional, lazily evaluated, currying)
  • Haskell (indirectly through Rust & Nix + monads)

Language reference

Atomic expressions

These are the predefined base expressions of ari, they produce the "atoms" for symbolic expressions.

Natural expressions

0 1 2 3 ...

Produce the primitive types of the language, modelled after natural numbers. Its "value" encodes the number of possible states that can exist in the type.

Bottom expression
0

Produces the bottom type , which has no possible states. You can think of this as the "error type".

This is the identity type for sum expressions:

(=
  (+ x 0)
  x
)

There are certain contexts that propagate the bottom type:

This "error propagation" can be "caught" by other expressions. For example:

(+ 256 (* 256 0))

(* 256 0) evaluates to 0, the bottom type, but (+ 256 0) evaluates to 256... which is not the bottom type. This sum expression catches the error!

This doesn't seem useful when only working with natural expressions, why not just remove expressions that propagate the bottom type? Well.... types can also be evaluated at runtime with dereference expressions @, and we use this same idea to catch runtime errors.

Unit expression
1

Produces the unit type, which has only one possible state. You can think of this as a type that doesn't actually store any information.

This is the identity type for product expressions:

(=
  (* x 1)
  x
)

Label expressions

:label 123

Define a name for the result of an expression in the current scope.

Symbol expressions

symbol

Produce "symbol types", which are equivalent to whatever type is associated with symbol in the current scope.

Value expressions

In ari, all types are implicitly bound to a corresponding runtime value. Value expressions are used to refer to these runtime values.

They can only be used in:

NOTE: These are the same contexts that propagate the bottom type 0.

Reference expressions
$symbol

Produce "reference types", which are bound to the same runtime value of symbol, but are equivalent to 1.

Dereference expressions
@symbol

When applied to symbol types, produce a type from the runtime value of the symbol type.

@123
@(+ :small 1 :medium 1 :large)

When applied to non-symbol types, produce an "effective type" from the runtime value of the non-symbol type.

"Effective types" don't have an associated runtime value.

Path expressions

symbol:x:y:z
(* :x (* :y (* :z 256))):x:y:z

Produce a nested type contained within another type.

Extended symbol expressions

'(symbol)'

Produce symbols that can include non-whitespace special characters.

To encode quotes in this expression, double them up:

'''quoted'''

NOTE: To get a better understating of how this works, these expressions only terminate after an odd sequence of quotes. Then all terminating quotes (except for the last) are interpreted as part of the symbol.

Any characters following whitespace will be treated as documentation:

:'pixel A 24bit RGB pixel'
(* :r 256 :g 256 :b 256)

'pixel Here we're referencing pixel'

Unicode text expressions

"()"

A convenient notation for defining text-based grammars, which is actually just syntax sugar for runtime assertion expressions that assert for Unicode text.

To encode quotes in this expression, double them up:

"{
  ""abc"": 123
}"

NOTE This follows the same behaviour as extended symbol expressions

Text not bound to any particular encoding context will be interpreted as utf-8, but this can be changed with text encoding macros:

(=
  (ascii-8 "text")
  (ascii "text")
)
(ascii-7 "text")
(utf-16 "text")
(utf-32 "text")
Codepoint symbol
codepoint

Represents a single Unicode code point for text in the current text encoding context. This can be dynamically sized.

Grapheme symbol
grapheme

Represents a single Unicode grapheme for text in the current text encoding context. This can be dynamically sized.

Symbolic expressions

(sexpr arg1 arg2 arg3)

Symbolic expressions are a list of atomic expressions, separated by whitespace & wrapped by parenthesis.

The first element is treated as a function, and the remaining elements are passed as inputs to that function.

These are considered "symbolic expressions" because their behaviour depends on the symbols defined in the current scope.

These modelled after lisp s-expressions, but are different in a few ways:

  • Ari doesn't have cons cells, so symbolic expressions aren't implemented with cons cells. The closest concept to cons cells are product expressions.

  • Symbolic expressions form a lexical scope from the label expressions it contains. This means that "let" expressions are embedded in all symbolic expressions.

Assertion expressions

:equal (= a b c)

Asserts that all arguments have the same number of possible states. If they do, this evaluates to the last argument, otherwise it evaluates to 0.

Algebraic expressions

Additive expressions
Add and sum expressions
:add (+ a b)

Given natural inputs, produces a sum type , which is either a or b. This adds together the number of possible states between the two types.

We derive a nary form of add + by repeated application of associativity:

:additive-associativity
(=
  (+ (+ a b) c)
  (+ a (+ b c)
  (+ a b c)
)
Additive inverse expressions
:subtract (- a b)

Produces an integer type that subtracts the first few b states from a.

Combined with the additive identity 0, we derive a unary form of subtract -:

:negative
(=
  (- 0 x)
  (- x)
)
Additive group theory

Additive expressions form an abelian group in type-space, and a non-abelian group in value-space.

Multiplicative expressions
Multiply and product expressions
:multiply (* a b)

Given natural inputs, produces a product type Π, which has both a and b. This multiplies the number of possible states between the two types.

We derive a nary form of multiply * by repeated application of associativity:

:multiplicative-associativity
(=
  (* (* a b) c)
  (* a (* b c))
  (* a b c)
)
Multiplicative inverse expressions
:divide (/ a b)

Produces a rational type type that divides b states out of a.

Combined with the multiplicative identity 1, we derive a unary form of divide /:

:inverse
(=
  (/ 1 x)
  (/ x)
)
Multiplicative group theory

Multiplicative expressions form an abelian group in type-space, and a non-abelian group in value-space.

Exponentiation expressions
Exponentiate and map expressions
:exponentiate (^ b e)

Given natural inputs, produces a map type , which maps the exponent e to the base b. This raises the number of possible states in b to the power of the number of possible states in e.

Exponentiation is non-associative, but we derive a nary form of exponentiate ^ through currying. Which is just repeated application of the power of a power identity:

:power-of-a-power-identity
(=
  (^ (^ a b) c)
  (^ a (* b c))
  (^ a b c)
)
Exponentiation left inverse expression
:logarithm (log b x)

Produces a complex type ℂ, which treats x like a map type (even if it's not), and computes the exponent e given the base b.

Exponentiation right inverse expression
:root (root x e)

Produces a complex type ℂ, which treats x like a map type (even if it's not), and computes the base b given the exponent e.

root is actually just syntax sugar for exponentiation with a multiplicative inverse:

:root
(=
  (^ x (/ e))
  (root x e)
)
Exponentiation group theory

Exponentiation expressions form a quasigroup in both type-space & value-space.

Set expressions

Sets let you interpret a single value as multiple different types. This is particularly useful when you don't actually know what the type is, but the type can be deduced by its contents.

Singletons

All types are implicitly treated as singletons. These are sets with exactly one type.

Empty set expression
:empty-set ()

Produces the empty set. A set with no types.

Top expression
:top-type _

Produces what is called a top type , but it's not exactly a "type". It's a set containing every possible type.

_ is actually just syntax sugar for any type that directly references itself:

:_ _
Union expressions
:union (| a b)

Produces a set which takes the union between a and b.

We derive an nary form of union | by repeated application of associativity:

:union-associativity
(=
  (| (| a b) c)
  (| a (| b c)
  (| a b c)
)
Zero or one expression
?

Syntax sugar for the union | between 0 and 1:

:? (| 0 1)
Symmetric difference expression
:symmetric-difference (~ a b)

Produces a set which takes the symmetric difference between a and b.

symmetric-difference ~ is actually just syntax sugar for the union | of both relative complements !:

(=
  (| (! a b) (! b a))
  (~ a b)
)
Intersection expressions
:intersection (& a b)

Produces a set which takes the intersection between a and b.

We derive an nary form of intersection & by repeated application of associativity:

:intersection-associativity
(=
  (& (& a b) c)
  (& a (& b c)
  (& a b c)
)
Complement expressions
:relative-complement (! a b)

Produces a set which takes the relative complement of a in b.

Combined with the top type _, we derive a unary form of relative-complement !:

:complement
(=
  (! x _)
  (! x)
)
One or more expression
:one-or-more
(=
  (! 0)
  !)

Syntax sugar for the complement ! of 0:

Interval expression
:interval (.. a b)

Produces a half-open interval between a & b, where b is excluded.

Combined with the empty set (), we derive a unary form of interval ..:

(=
  (.. () b)
  (.. b)
)

License

Copyright (C) 2023 Kira Bruneau

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Why is this licensed under the GPL instead of the LGPL?

This was a difficult decision. By licensing this under the GPL, not only are we preventing proprietary software from being combined with ari, but also open source software using a license that's incompatible with the GPL.

This restriction may seem to go against our goal of making binary data more open & accessible, but it very much works in favour of it.

Proprietary software is designed to restrict the freedoms of its users for profit. By allowing ari to be used in proprietary software, we'd be perpetuating the idea that software should be hard to access & modify. This is exactly what we don't want.

This doesn't mean that users are restricted from using ari along side proprietary software. The GPL is designed in a way to take away freedoms from developers, and give as many freedoms as possible back to users.

This license just requires that proprietary (or open source software with a license allowing for integration with proprietary software) can't be built on top of, or packaged together with ari.

See why you shouldn't use the Lesser GPL for your next library.