-
Notifications
You must be signed in to change notification settings - Fork 65
Substring Explained
What is Substring?
String manipulation in Java isn't easy enough.
We often need to resort to indexOf()
and in turn play with index arithmetics. Sometimes we'll forget to check if the index is negative; and sometimes we need to do index + 1
or length() - index - 1
and mess up with "off by one" errors. Even if you get it right, your readers may often pause at the code, run it in their head to understand what exactly you intended to do.
Substring
is a fluent Java API aimed at making string manipulation intuitive and safe.
To use it, you start with a Substring.Pattern
, which represents the target you want to find in the string. Examples are:
- first
/
-first('/')
- last
.
-last('.')
- a prefix/suffix -
prefix("http://")
,suffix('/')
- before or after a pattern -
before(first('@'))
,after(last('.'))
- between two patterns -
between("{", "}")
- spanning two patterns -
spanningInOrder("{", "}")
- starting from a pattern -
last('.').toEnd()
- up to a pattern -
upToIncluding(first("//"))
- one of two alternative patterns -
last("//").or(END)
- a list of candidate patterns -
keywords.stream().map(Substring::first).collect(firstOccurrence())
- ...
The basic patterns aren't fancy. They are roughly equivalent to your indexOf('/')
, lastIndexOf('.')
and substring(indexOf("{"), indexOf("}"))
etc.
But the power comes with the manipulation once a pattern of substring is found. You can:
- extact -
between("{", "}").from(snippet)
- remove -
before(first("projects/")).removeFrom(resourceName)
- replace -
after(last('.')).replaceWith(fileName, "yaml")
- split around -
first('=').splitThenTrim(line)
- extract all (occurrences) -
between("{", "}").repeatedly().from(snippet)
- remove all -
between("/*", "*/").repeatedly().removeAllFrom(code)
- replace all -
snappingInOrder("{", "}").repeatedly().replaceAllFrom(template, placeholder -> ...)
- split around all -
List<String> items = first(',').repeatedly().split(text).collect(toList())
- and more...
To remove the http://
from the url string, if it exists.
Substring.prefix("http://").removeFrom(url);
Instead of
url.startsWith("http://")
? url.substring(7)
: url;
url.startsWith("http://")
? url.substring("http://".length())
: url;
To remove any url scheme http://
, https://
, chrome://
etc.
import static com.google.mu.util.Substring.first;
Substring.upToIncluding(first("://")).removeFrom(url);
Instead of:
url.substring(url.indexOf("://") + 3);
// Or if you didn't forget bounds checking
int index = url.indexOf("://");
index == -1
? url
: url.substring(index + 3);
To add home/
to a file path if it's not already there.
Use:
Substring.prefix("home/").addToIfAbsent(filePath);
Instead of:
filePath.startsWith("home/")
? filePath
: "home/" + filePath;
To add comma to the end of line if it's missing.
Use:
Substring.suffix(',').addToIfAbsent(line);
Instead of:
filePath.endsWith(",")
? line
: line + ",";
To extract the directory path home/foo
from home/foo/Bar.java
.
Use:
import static com.google.mu.util.Substring.last;
String path = ...;
Optional<String> directory = Substring.before(last('/')).from(path);
Instead of:
int index = path.lastIndexOf('/');
Optional<String> directory =
index == -1
? Optional.empty()
: Optional.of(path.substring(0, index));
To extract the shelf name id-1
from resource name
bookstores/barnes.noble/shelves/id-1/books/foo
.
Optional<String> shelfName = Substring.between("/shelves/", "/").from(resourceName);
To extract the gaia id from the string id:123
.
Use:
import static com.google.mu.util.Substring.prefix;
String str = "id:123";
Optional<Long> gaiaId = Substring.after(prefix("id:"))
.from(str)
.map(Long::parseLong);
Instead of:
Optional<Long> gaiaId =
str.startsWith("id:")
? Optional.of(Long.parseLong(
str.substring("id:".length())))
: Optional.empty();
To extract the user id from email user-id@corp.com
.
Use:
Optional<String> userId = Substring.before(first('@')).from(email);
Instead of:
int index = email.indexOf('@');
Optional<String> userId =
index == -1
? Optional.empty()
: Optional.of(email.substring(0, index));
To extract both the user id and the domain from an email address.
Use:
Optional<UserWithDomain> userWithDomain =
first('@')
.split(email)
.map(UserWithDomain::new);
Instead of error-prone index arithmetic:
int index = email.indexOf('@');
Optional<UserWithDomain> userId =
index == -1
? Optional.empty()
: Optional.of(
new UserWithDomain(
email.substring(0, index),
email.substring(index + 1)));
Substring or Guava Splitter?
Both Substring.Pattern
and Guava Splitter support splitting strings. Substring.Pattern
doesn't have methods like limit()
, omitEmptyResults()
because in Java 8, they are already provided by Stream
, for example:
first(',').repeatedly()
.split(input)
.filter(m -> m.length() > 0)
.limit(10); // First 10 non-empty
Substring.Pattern
also supports two-way split so if you are parsing a flag that looks like --loggingLevels=file1=1,file2=3
, you only need to split around the first =
character:
first('=')
.split(arg)
.map((name, value) -> ...);
On the other hand, Splitter only splits to a List
or Stream
. You'll need to do the list size checking and extract the name and value parts using list.get(0)
and list.get(1)
.
Lastly, repeatedly().split()
and repeatedly().splitThenTrim()
return Stream<Match>
, where Match
is a CharSequence
view over the original string. No characters are copied until you explicitly ask for it. That is, if you decide that only certain matches are worth keeping, you can save the allocation and copying cost for items that aren't of interest:
List<String> names =
first(',').repeatedly()
.splitThenTrim("name=foo,age=10,name=bar,location=SFO")
.filter(prefix("name=")::isIn)
.map(Match::toString)
.map(prefix("name=")::removeFrom)
.collect(toList());
To parse key-value pairs
The
Substring.Pattern.split()
method returns a BiOptional<String, String>
object that optionally contains a
pair of substrings before and after the delimiter pattern.
But if called on the returned RepeatingPattern
object from
repeatedly()
method, the input string will be
split into a stream of
substrings.
Combined together, it can parse key-value pairs and then collect them into a
Map
, a Multimap
or whatever.
For example:
import static com.google.mu.util.stream.GuavaCollectors.toImmutableSetMultimap;
Substring.RepeatingPattern delimiter = first(',').repeatedly();
ImmutableSetMultimap<String, String> multimap =
delimiter
.split("k1=v1,k2=v2,k2=v3") // => ["k1=v1", "k2=v2", "k2=v3"]
.collect(
toImmutableSetMultimap(
// "k1=v1" => (k1, v1), "k2=v2" => (k2, v2) etc.
s -> first('=').split(s).orElseThrow(...)));
Substring
patterns are best used on strings that are known to be in the
expected format. That is, they are either internal (flag values, internal file
paths etc.), or are already guaranteed to be in the expected format by a
stricter parser or validator (URLs, emails, Cloud resource names, ...).
For example, the following code returns a nonsensical result:
String unexpected = "Surprise! This is not a url with http:// or https://!";
upToIncluding(first("://")).removeFrom(unexpected);
// => " or https://!".
If you need to parse a string with complex syntax rules or context-sensitive grammar, use a proper parser or regex instead.
Substring.Pattern
can also be created off of a CharMatcher
or regexp:
import static com.google.common.base.CharMatcher.digit;
import static com.google.common.base.CharMatcher.whitespace;
import static com.google.mu.util.Substring.last;
before(first(whitespace())) // first(CharMatcher)
.removeFrom("foo bar") => "foo"
upToIncluding(last(digit())) // last(CharMatcher)
.from("314s") => "314";
first(Pattern.compile("\\(\\d{3}\\)\\s")) // regex
.removeFrom("(312) 987-6543") => "9876543"
Substring.Pattern
is immutable and a pattern object can be reused to save
object allocation cost, especially when used in a Stream
chain.
For example, prefer:
charSource.readLines().stream()
.map(first('=')::split)
over:
charSource.readLines().stream()
.map(line -> first('=').split(line))
Because the latter will allocate the same Substring.Pattern
object over and
over for every line.