Provides both low-level and high-level interfaces for handling UTF-8, serving
as a complement to the functionality provided by com.google.common.base.Utf8
(Guava) and Java's built-in StandardCharsets.UTF_8
.
Maven projects can use this library with a simple POM dependency:
<project>
...
<dependencies>
...
<dependency>
<groupId>com.everlaw</groupId>
<artifactId>utf8</artifactId>
<version>1.0.1</version>
</dependency>
...
</dependencies>
...
</project>
A low-level utility class that provides static
methods for testing,
encoding, and decoding UTF-8. The principal methods are:
isValid(codepoint)
: Returnstrue
iff the given codepoint is valid UTF-8.toPackedInt(cseq, i)
: Encodes the 1- or 2-char Unicode codepoint starting atcseq[i]
to 1-4 bytes of UTF-8, packed into a singleint
. This enables incremental encoding of anyCharSequence
without heap allocations.toPackedInt(codepoint)
: Encodes the given codepoint as UTF-8 packed into anint
as described above.isContinuationByte(byte)
: Returnstrue
iff the given byte is a UTF-8 continuation byte.numContinuationBytes(byte)
: Returns the number of continuation bytes that follow the given first byte of a possibly-multibyte UTF-8-encoded codepoint.
A high-level class for iterating over the UTF-8 bytes of a CharSequence
,
implementing Java 8's PrimitiveIterator.OfInt
. It allows for simple,
space-efficient iteration:
Utf8Iterator utf8 = new Utf8Iterator(string);
while (utf8.hasNext()) {
byte b = utf8.nextByte(); // convenience method for (byte) utf8.nextInt()
// do something with b
}
This is functionally equivalent to:
ByteBuffer utf8 = StandardCharsets.UTF_8.encode(string);
while (utf8.hasRemaining()) {
byte b = utf8.get();
// do something with b
}
The main benefits of using Utf8Iterator
are:
- It operates on
CharSequence
s of all types, not justString
. - It uses constant space, even for large strings, whereas the buffer returned
from
UTF_8.encode
is proportional to the size of the string. - It encodes incrementally, so no work is wasted if the loop is exited early.
This project uses Semantic versioning.
We are happy to receive Pull Requests. If you are planning a big change, it's probably best to discuss it as an Issue first.
In the root directory, run mvn install
. That will build everything.