-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CIP-0058? | New CIP for bitwise primitives #268
Conversation
@goverthrow in a future editing round could you remove the hard line breaks from within paragraphs? also re: portability of the document... @KtorZ @SebastienGllmt @crptmppt the inline maths formatting here is brilliant & probably essential... how could we make sure it's preserved in other contexts like https://cips.cardano.org ? So far equations in CIPs have been submitted as code, images, or manually formatted text. |
I think pasting the entire content into the PR description was probably not necessary :) |
|
||
# Abstract | ||
|
||
Add primitives for bitwise operations, based on `BuiltinByteString`, without requiring new data types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add primitives for bitwise operations, based on `BuiltinByteString`, without requiring new data types. | |
Add primitives for bitwise operations, based on `bytestring`, without requiring new data types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these (and other) changes? BuiltinByteString
is the name of the corresponding Plutus Core type as far as I'm aware; has this changed? Even if this is the case, bytestring
isn't a valid type name.
but are difficult or impossible to implement without some, or all, of the | ||
primitives we propose. | ||
|
||
## Succinct data structures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Succinct data structures | |
### Succinct data structures |
Furthermore, succinct data structures are not limited to sets of integers, but | ||
**all** require bitwise operations to be implementable. | ||
|
||
## Binary representations and encodings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Binary representations and encodings | |
### Binary representations and encodings |
only constant overheads. There are several examples of these, and all rely on | ||
bitwise operations for their implementations. | ||
|
||
For example, consider wanting to store a set of `BuiltinInteger`s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, consider wanting to store a set of `BuiltinInteger`s | |
For example, consider wanting to store a set of `integer`s |
|
||
For example, consider wanting to store a set of `BuiltinInteger`s | ||
on-chain. Given current on-chain primitives, the most viable option involves | ||
some variant on a `BuiltinList` of `BuiltinInteger`s; however, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some variant on a `BuiltinList` of `BuiltinInteger`s; however, | |
some variant on a `list` of `integer`s; however, |
|
||
We also specify some specific non-goals of this proposal. | ||
|
||
### No metaphor-mixing between numbers and bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
``` | ||
Performs a bitwise shift of the first argument by the | ||
absolute value of the second argument, with padding, the direction being | ||
indicated by the sign of the second argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "with padding" mean, and how does the argument indicate the direction?
We intend that `BuiltinByteString`s represent byte sequences, with the | ||
sequence of bits being exactly as the description above. For example, given the | ||
byte sequence `0110111100001100`, the `BuiltinByteString` | ||
corresponding to it would be `o\f`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you mean "the bytestring corresponding to this string under encoding X", what is X?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be under the encoding we describe in this section. How could I phrase this more clearly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't describe a string encoding here. I guessed that it might be ASCII and it looks like it is, so you could say "The bytestring corresponding to it would be the one corresponding to the ASCII encoding of o\f
".
return $-1$ instead. | ||
|
||
|
||
## Semantics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe my comments in IntersectMBO/plutus#4252 (comment) still need to be addressed here.
$n \neq m$, the result is a mismatched length error. Otherwise, the result is | ||
the byte sequence $u = u_n u_{n - 1} \ldots, u_0$ such that for all $i \in \{0, 1, \ldots, n\}$ we have | ||
|
||
$$u_i = \begin{cases} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't render well on github
There may be a justification for this choice, but it's not obvious to me and it's not specified in the proposal either. The operations' result length could instead match the length of the longer argument, reproducing its tail. This would simplify certain tasks like clearing and masking of select bits. |
I think the justification is that we want to be especially fussy in on-chain code about not allowing unexpected behaviour to sneak through. In this proposal, a bitwise AND just does a bitwise AND, rather than possibly also extending one of the arguments. The extending behaviour may be convenient, but it's easy to implement yourself, and opens up opportunities for accidental misuse. Better to be explicit, do one thing, and let the user extend if they need it. |
@kozross @goverthrow marking this deprecated as per #283 (comment). |
Shall we close this PR ? @goverthrow |
Abstract
Add primitives for bitwise operations, based on
BuiltinByteString
, without requiring new data types.Motivation
Bitwise operations are one of the most fundamental building blocks of algorithms
and data structures. They can be used for a wide variety of applications,
ranging from representing and manipulating sets of integers efficiently, to
implementations of cryptographic primitives, to fast searches. Their wide
availability, law-abiding behaviour and efficiency are the key reasons why they
are widely used, and widely depended on.
At present, Plutus lacks meaningful support for bitwise operations, which
significantly limits what can be usefully done on-chain. While it is possible to
mimic some of these capabilities with what currently exists, and it is always
possible to introduce new primitives for any task, this is extremely
unsustainable, and often leads to significant inefficiencies and duplication of
effort.
We describe a list of bitwise operations, as well as their intended semantics,
designed to address this problem.
Example applications
We provide a range of applications that could be useful or beneficial on-chain,
but are difficult or impossible to implement without some, or all, of the
primitives we propose.
Succinct data structures
Due to the on-chain size limit, many data structures become impractical or
impossible, as they require too much space either for their elements, or their
overheads, to allow them to fit alongside the operations we want to perform on
them. Succinct data structures could serve as a solution to this, as they
represent data in an amount of space much closer to the entropy limit and ensure
only constant overheads. There are several examples of these, and all rely on
bitwise operations for their implementations.
For example, consider wanting to store a set of$k$ on the $n$
BuiltinInteger
son-chain. Given current on-chain primitives, the most viable option involves
some variant on a
BuiltinList
ofBuiltinInteger
s; however,this is unviable in practice unless the set is small. To see why, suppose that
we have an upper limit of
BuiltinInteger
s we want to store;this is realistic in practically all cases. To store
BuiltinInteger
s under the above scheme requiresbits, where$c$ denotes the constant overhead for each cons cell of$k$ ), this cost
the
BuiltinList
holding the data. If the set being represented is dense(meaning that the number of entries is a sizeable fraction of
becomes intolerable quickly, especially when taking into account the need to
also store the operations manipulating such a structure on-chain with the script
where the set is being used.
If we instead represented the same set as a bitmap based on
BuiltinByteString
, the amount of space required would instead bebits. This is significantly better unless$n$ is small. Furthermore,
this representation would likely be more efficient in terms of time in practice,
as instead of having to crawl through a cons-like structure, we can implement
set operations on a memory-contiguous byte string:
can have terrifyingly efficient implementations: the Muła-Kurz-Lemire
algorithm (the current state of the art) can process four kilobytes per loop
iteration, which amounts to over four thousand potential stored integers.
A potential implementation could use a range of techniques to make these
operations extremely efficient, by relying on SWAR (SIMD-within-a-register)
techniques if portability is desired, and SIMD instructions for maximum speed.
This would allow both potentially large integer sets to be represented on-chain
without breaking the size limit, and nodes to efficiently compute with such,
reducing the usage of resources by the chain. Lastly, in practice, if
compression techniques are used (which also rely on bitwise operations!), the
number of required bits can be reduced considerably in most cases without
compromising performance: the current state-of-the-art (Roaring Bitmaps) can be
used as an example of the possible gains.
In order to make such techniques viable, bitwise primitives are mandatory.
Furthermore, succinct data structures are not limited to sets of integers, but
all require bitwise operations to be implementable.
Binary representations and encodings
On-chain, space is at a premium. One way that space can be saved is with binary
representations, which can potentially represent something much closer to the
entropy limit, especially if the structure or value being represented has
significant redundant structure. While some possibilities for a more efficient
packing already exist in the form of
BuiltinData
, it is ratheridiosyncratic to the needs of Plutus, and its decoding is potentially quite
costly.
Bitwise primitives would allow more compact binary encodings to be defined,
where complex structures or values are represented using fixed-size
BuiltinByteString
s. The encoders and decoders for these could also beimplemented more efficiently than currently possible, as there exist numerous
bitwise techniques for this.
Goals
To ensure a focused and meaningful proposal, we specify our goals below.
Useful primitives
The primitives provided should enable implementations of algorithms and data
structures that are currently impossible or impractical. Furthermore, the
primitives provided should have a high power-to-weight ratio: having them should
enable as much as possible to be implemented.
Maintaining as many algebraic laws as possible
Bitwise operations, via Boolean algebras, have a long and storied history of
algebraic laws, dating back to important results by the like of de Morgan, Post
and many others. These algebraic laws are useful for a range of reasons: they
guide implementations, enable easier testing (especially property testing) and
in some cases much more efficient implementations. To some extent, they also
formalize our intuition about how these operations should work. Thus,
maintaining as many of these laws in our implementation, and being clear about
them, is important.
Allowing efficient, portable implementations
Providing primitives alone is not enough: they should also be efficient. This is
not least of all because many would associate primitive operation with a
notion of being close to the machine, and therefore fast. Thus, it is on us to
ensure that the implementations of the primitives we provide have to be
implementable in an efficient way, across a range of hardware.
Clear indication of failure
While totality is desirable, in some cases, there isn't a sensible answer for us
to give. A good example is a division-by-zero: if we are asked to do such a
thing, the only choice we have is to reject it. However, we need to make it as
easy as possible for someone to realize why their program is failing, by
emitting a sensible message which can later be inspected.
Non-goals
We also specify some specific non-goals of this proposal.
No metaphor-mixing between numbers and bits
A widespread legacy of C is the mixing of treatment of numbers and blobs of
bits: specifically, the allowing of logical operations on representations of
numbers. This applies to Haskell as much as any other language: according to the
Haskell Report, it is in fact required that any type implementing
Bits
implementNum
first. While GHC Haskell only mandatesEq
, it still definesBits
instances for types clearly meant torepresent numbers. This is a bad choice, as it creates complex situations and
partiality in several cases, for arguably no real gain other than C-like bit
twiddling code.
Even if two types share a representation, their type distinctness is meant to be
a semantic or abstraction boundary: just because a number is represented as a
blob of bits does not necessarily mean that arbitrary bit manipulations are
sensible. However, by defining such a capability, we create several semantic
problems:
account. A good example are shifts: instead of simply having left or right
shifts, we now have to distinguish arithmetic versus logical
shifts, simply to take into account that a shift can be used on something
which is meant to be a number, which could be signed. This creates
unnecessary complexity and duplication of operations.
BuiltinInteger
s are of arbitrary precision, certainbitwise operations are not well-defined on them. A good example is bitwise
complement: the bitwise complement of
fact, is partial in its
Bits
instance.BuiltinInteger
would have quiteundesirable semantic changes in order to be implementable. A good example
are bitwise rotations: we should be able to decompose a rotation left or
right by
without changing the outcome. However, because trailing zeroes are not
tracked by the implementation, this can fail depending on the choice of
decomposition, which seems needlessly annoying for no good reason.
BuiltinInteger
would requireadditional arguments and padding to define them sensibly. Consider bitwise
logical AND: in order to perform this sensibly on
BuiltinInteger
swe would need to specify what length we assume they have, and some policy
of padding when the length requested is longer than one, or both,
arguments. This feels unnecessary, and it isn't even clear exactly how we
should do this: for example, how would negative numbers be padded?
These complexities, and many more besides, are poor choices, owing more to the
legacy of C than any real useful functionality. Furthermore, they feel like a
casual and senseless undermining of type safety and its guarantees for very
small and questionable gains. Therefore, defining bitwise operations on
BuiltinInteger
is not something we wish to support.There are legitimate cases where a conversion from
BuiltinInteger
toBuiltinByteString
is desirable; this conversion should be provided, andbe both explicit and specified in a way that is independent of the machine or
the implementation of
BuiltinInteger
, as well as total andround-tripping. Arguably, it is also desirable to provide built-in support for
BuiltinByteString
literals specified in a way convenient to theirtreatment as blobs of bytes (for example, hexadecimal or binary notation), but
this is outside the scope of this proposal.
Specification
Proposed operations
We propose several classes of operations. Firstly, we propose two operations for
inter-conversion between
BuiltinByteString
andBuiltinInteger
:Convert a number to a bitwise representation.
Reinterpret a bitwise representation as a number.
We also propose several logical operations on
BuiltinByteString
s:Perform a bitwise logical AND on arguments of the same
length, producing a result of the same length, erroring otherwise.
Perform a bitwise logical IOR on arguments of the same
length, producing a result of the same length, erroring otherwise.
Perform a bitwise logical XOR on arguments of the same
length, producing a result of the same length, erroring otherwise.
Complement all the bits in the argument, producing a
result of the same length.
Lastly, we define the following additional operations:
Performs a bitwise shift of the first argument by the
absolute value of the second argument, with padding, the direction being
indicated by the sign of the second argument.
Performs a bitwise rotation of the first argument by
the absolute value of the second argument, the direction being indicated by
the sign of the second argument.
Returns the number of$1$ bits in the argument.
If the position given by the second argument is not in$1$ , return
bounds for the first argument, error; otherwise, if the bit given by that
position is
True
, andFalse
otherwise.If the position given by the second$1$ if the third argument is $0$ otherwise.
argument is not in bound for the first argument, error; otherwise, set the
bit given by that position to
True
,and
Return the lowest index such that$-1$ instead.
testBitByteString
with the firstargument and that index would be
True
. If no such index exists,return
Semantics
Preliminaries
We define$\mathbb{N}^{+} = { x \in \mathbb{N} \mid x \neq 0 }$ . We assume$\mathbb{Z}$ . A$s = s_n s_{n-1} \ldots s_0$ is a sequence such that for$i \in {0,1,\ldots,n}$ , $s_i \in {0, 1}$ . A bit sequence $s = s_n s_{n-1} \ldots s_0$ is a byte sequence if $n = 8k - 1$ for some $k \in \mathbb{N}$ . We denote the empty bit sequence (and, indeed, byte sequence$\emptyset$ .
that
BuiltinInteger
is a faithful representation ofbit sequence
all
as well) by
We intend that
BuiltinByteString
s represent byte sequences, with thesequence of bits being exactly as the description above. For example, given the
byte sequence
0110111100001100
, theBuiltinByteString
corresponding to it would be
o\f
.Let$i \in \mathbb{N}^{+}$ . We define the sequence $\mathtt{binary}(i) = (d_0, m_0), (d_1, m_1), \ldots$ as
and
Representation of
BuiltinInteger
asBuiltinByteString
and conversionsWe describe the translation of
$0$ ,$i \in \mathbb{N}^{+}$ . We represent $i$ as the bit sequence $s = s_n s_{n-1} \ldots s_0$ , such that:
BuiltinInteger
intoBuiltinByteString
which is implemented as theintegerToByteString
primitive. Informally, we representBuiltinInteger
s with the least significant bit at bit positionusing a twos-complement representation. More precisely, let
For$0$ , we represent it as the sequence $i \in { x \in \mathbb{Z} \mid x < 0 }$ as the twos-complement
00000000
(one zero byte). Werepresent any
of the representation of its additive inverse. We observe that any such sequence
is by definition a byte sequence.
To interpret a byte sequence$s = s_n s_{n - 1} \ldots s_0$ as a
BuiltinInteger
, we use the following process:00000000
, then the result isThe above interpretation is implemented as the
byteStringToInteger
primitive. We observe that
byteStringToInteger
andintegerToByteString
form an isomorphism. More specifically:Bitwise logical operations on
BuiltinByteString
Throughout, let$s = s_n s_{n-1} \ldots s_0$ and $t = t_m t_{m - 1} \ldots t_0$ be two byte sequences. Whenever we
specify a mismatched length error result, its error message must contain
at least the following information:
We describe the semantics of$s$ and $t$ , if
$n \neq m$ , the result is a mismatched length error. Otherwise, the result is$u = u_n u_{n - 1} \ldots, u_0$ such that for all $i \in {0, 1, \ldots, n}$ we have
andByteString
. For inputsthe byte sequence
For$s$ and $t$ , if $n \neq m$ , the result is$u = u_n u_{n - 1} \ldots u_0$ such that for all $i \in {0, 1, \ldots, n}$ we have
iorByteString
, for inputsa mismatched length error. Otherwise, the result is the byte sequence
For$s$ and $t$ , if $n \neq m$ , the result is$u = u_n u_{n-1} \ldots u_0$ such that for all $i \in {0, 1, \ldots, n}$ we have
xorByteString
, for inputsa mismatched length error. Otherwise, the result is the byte sequence
We observe that, for length-matched arguments, each of
$k$ , each of these$k$ where each element is $0$ , and for $k$ where each$1$ . Lastly, for any length $k$ , the bit sequence of length $k$ where$0$ is an absorbing element for $k$ where each element is $1$ is an absorbing element for
andByteString
,iorByteString
andxorByteString
describes a commutative andassociative operation. Furthermore, for any given length
operations have an identity element: for
iorByteString
, this is the bitsequence of length
andByteString
and
xorByteString
, this is the bit sequence of lengthelement is
each element is
andByteString
, and thebit sequence of length
iorByteString
.We now describe the semantics of$s$ ,$u = u_n u_{n - 1} \ldots u_0$ such that for all
$i \in {0, 1, \ldots, n}$ we have
complementByteString
. For inputthe result is the byte sequence
We observe that
complementByteString
is self-inverting. We also notethe following equivalences hold assuming
b
andb'
have thesame length; these are the DeMorgan laws:
Mixed operations
Throughout this section, let$s = s_n s_{n-1} \ldots s_0$ and $t = t_m t_{m - 1} \ldots t_0$ be byte sequences, and let $i \in \mathbb{Z}$ .
We describe the semantics of$0$ , and positive$0$ . More precisely, given the argument
$s$ and $i$ , the result of
$u_n u_{n - 1} \ldots u_0$ , such that for all $j \in {0, 1, \ldots, n }$ , we have
shiftByteString
. Informally, these are logicalshifts, with negative shifts moving away from bit index
shifts moving towards bit index
shiftByteString
is the byte sequenceWe observe that for$k, \ell$ with the same sign and any
bs
, we haveWe now describe
$u_n u_{n - 1} \ldots u_0$ such that for all $j \in {0, 1, \ldots, n}$ , we$u_j = s_{j + i \mod (n + 1)}$ . We observe that for any $k, \ell$ , and any
rotateByteString
, assuming the same inputs as thedescription of
shiftByteString
above. Informally, the direction ofthe rotations matches that of
shiftByteString
above. More precisely,the result of
rotateByteString
on the given inputs is the byte sequencehave
bs
, we haveWe also note that
For$s$ , the result is
popCountByteString
with argumentInformally, this is just the total count of$1$ bits. We observe that
for any
bs
andbs'
, we haveWe now describe the semantics of
testBitByteString
andwriteBitByteString
. Throughout, whenever we specify an out-of-bounds error result, its error message must contain at least thefollowing information:
For$s$ and $i$ , if $0 \leq i \leq n$ ,$s_i = 1$ , and $s_i = 0$ ;$s$ , $i$ and $0\leq i \leq n$ , then the result is the byte sequence $u_n u_{n - 1} \ldots u_0$ $j \in {0, 1, \ldots, n}$ , we have
testBitByteString
with argumentsthen the result is
True
ifFalse
ifotherwise, the result is an out-of-bounds error. Let
b :: BuiltinBool
;for
writeBitByteString
with argumentsb
, ifsuch that for all
$$u_j = \begin{cases}
1 & i = j \text{ and
b
} = \text{True
}0 & i = j \text{ and
b
} = \text{False
}s_j & \text{otherwise}
\end{cases}
$$
If$i < 0$ or $i > n$ , the result is an out-of-bounds error.
Lastly, we describe the semantics of$s$ , if for any $j \in {0, 1, \ldots, n }$ , $s_j = 0$ , the result is
$-1$ ; otherwise, the result is $k$ such that all of the following hold:
findFirstSetByteString
. Given theargument
Costing
All of the primitives we describe are linear in one of their arguments. For a
more precise description, see the table below.
integerToByteString
byteStringToInteger
andByteString
iorByteString
xorByteString
complementByteString
shiftByteString
BuiltinByteString
argumentrotateByteString
BuiltinByteString
argumentpopCountByteString
testBitByteString
BuiltinByteString
argumentwriteBitByteString
BuiltinByteString
argumentfindFirstSetByteString
Primitives and which argument they are linear in
Rationale
Why these operations?
There needs to be a well-defined
interface between the world of
BuiltinInteger
andBuiltinByteString
. To provide this, we requireintegerToByteString
andbyteStringToInteger
, which is designedto roundtrip (that is, describe an isomorphism). Furthermore, by spelling out a
precise description of the conversions,
we make this predictable and portable.
Our choice of logical AND, IOR, XOR and complement as the primary logical
operations is driven by a mixture of prior art, utility and convenience. These
are the typical bitwise logical operations provided in hardware, and in most
programming languages; for example, in the x86 instruction set, the following
bitwise operations have existed since the 8086:
AND
: Bitwise AND.OR
: Bitwise IOR.NOT
: Bitwise complement.XOR
: Bitwise XOR.Likewise, on the ARM instruction set, the following bitwise operations have
existed since ARM2:
AND
: Bitwise AND.ORR
: Bitwise IOR.EOR
: Bitwise XOR.ORN
: Bitwise IOR with complement of the second argument.BIC
: Bitwise AND with complement of the second argument.Going up a level, the C and Forth programming languages (according to C89 and
ANS Forth respectively) define bitwise AND (denoted
\&
andAND
respectively), bitwise IOR (denoted|
andOR
respectively), bitwise XOR (denoted
\^
andXOR
respectively)and bitwise complement (denoted
\~
andNOT
respectively) asthe primitive bitwise operations. This is followed by basically all languages
higher-up than C and Forth: Haskell's
Bits
type class defines thesesame four as
.&.
,.|.
,xor
andcomplement
.This ubiquity in choices leads to most algorithm descriptions that rely on
$x \text{ XOR } y$ into
bitwise operations to assume that these four are primitive, and thus,
constant-time and cost. While we could reduce this number
(and, in fact, due to Post, we know that there exist two sole sufficient
operators), this would be both inconvenient and inefficient. As an example,
consider implementing XOR using AND, IOR and complement: this would translate
This is both needlessly complex and also inefficient, as it requires copying the
arguments twice, only to throw away both copies.
Like our baseline bitwise operations above, shifts and rotations are widely
used, and considered as primitive. For example, x86 platforms have had the
following available since the 8086:
RCL
: Rotate left.RCR
: Rotate right.SHL
: Shift left.SHR
: Shift right.Likewise, ARM platforms have had the following available since ARM2:
ROR
: Rotate right.LSL
: Shift left.LSR
: Shift right.While C and Forth both have shifts (denoted with
<<
and>>
inC, and
LSHIFT
andRSHIFT
in Forth), they don't have rotations;however, many higher-level languages do: Haskell's
Bits
type class hasrotate
, which enables both left and right rotations.While
popCountByteString
could in theory be simulated usingtestBitByteString
and a fold, this is quite inefficient: the best wayto simulate this operation would involve using something similar to the
Harley-Seal algorithm, which requires a large lookup table, making it
impractical on-chain. Furthermore, population counting is important for several
classes of succinct data structure (particularly rank-select dictionaries and
bitmaps), and is in fact provided as part of the
SSE4.2
x86 instructionset as a primitive
POPCNT
.In order to usefully manipulate individual bits, both
testBitByteString
and
writeBitByteString
are needed. They can also be used as part ofspecifying, and verifying, that other bitwise operations, both primitive and
non-primitive, are behaving correctly. They are also particularly essential for
binary encodings.
findFirstSetByteString
is an essential primitive for several succinctdata structures: both Roaring Bitmaps and rank-select dictionaries rely on it
being efficient for much of their usefulness. Furthermore, this operation is
provided in hardware by several instruction sets: on x86, there exist (at least)
BSF
,BSR
,LZCNT
andTZCNT
, which allowfinding both the first and last set bits, while on ARM, there exists
CLZ
, which can be used to simulate finding the first set bit. Theinstruction also exists in higher-level languages: for example, GHC's
FiniteBits
type class hascountTrailingZeros
andcountLeadingZeros
. The main reason we propose taking finding the first set bit as primitive, rather than counting leading zeroes or counting trailing zeroes is that finding the first set bit is required specifically forseveral succinct data structures.
On-chain vectors
For linear structures on-chain, we are currently limited to
BuiltinList
and
BuiltinMap
, which don't allow constant-time indexing. This is asignificant restriction, especially when many data structures and algorithms
rely on the broad availability of a constant-time-indexable linear structure,
such as a C array or Haskell
Vector
. While we could introduce aprimitive of this sort, this is a significant undertaking, and would require
both implementing and costing a possibly large API.
While for variable-length data, we don't have any alternatives if constant-time
indexing is a goal, for fixed-length (or limited-length at least) data, there is
a possibility, based on a similar approach taken by the
finitary
library. Essentially, given finitary data, we can transform any item into a
numerical index, which is then stored by embedding into a byte array. As the
indexes are of a fixed maximum size, this can be done efficiently, but only if
there is a way of converting indices into bitstrings, and vice versa. Such a
construction would allow using a (wrapper around)
BuiltinByteString
asa constant-time indexable structure of any finitary type. This is not much of a
restriction in practice, as on-chain, fixed-width or size-bounded types are
preferable due to the on-chain size limit.
Currently, all the pieces to make this work already exist: the only missing
piece is the ability to convert indices (which would have to be
BuiltinInteger
s) into bit strings (which would have to beBuiltinByteString
s) and back again. With this capability, it would bepossible to use these techniques to implement something like an array or vector
without new primitive data types.