Skip to content

Substring Explained

Ben Yu edited this page Jan 10, 2023 · 45 revisions

What is Substring?

String manipulation in Java isn't easy enough.

We often need to resort to indexOf() and in turn play with index arithmetics. Sometimes we'll forget to check if the index is negative; and sometimes we need to do index + 1 or length() - index - 1 and mess up with "off by one" errors. Even if you get it right, your readers may often pause at the code, run it in their head to understand what exactly you intended to do.

Substring is a fluent Java API aimed at making string manipulation intuitive and safe.

To use it, you start with a Substring.Pattern, which represents the target you want to find in the string. Examples are:

  • first / - first('/')
  • last . - last('.')
  • a prefix/suffix - prefix("http://"), suffix('/')
  • before or after a pattern - before(first('@')), after(last('.'))
  • between two patterns - between("{", "}")
  • spanning two patterns - spanningInOrder("{", "}")
  • starting from a pattern - last('.').toEnd()
  • up to a pattern - upToIncluding(first("//"))
  • one of two alternative patterns - last("//").or(END)
  • a list of candidate patterns - keywords.stream().map(Substring::first).collect(firstOccurrence())
  • ...

The basic patterns aren't fancy. They are roughly equivalent to your indexOf('/'), lastIndexOf('.') and substring(indexOf("{"), indexOf("}")) etc.

But the power comes with the manipulation once a pattern of substring is found. You can:

  • extact - between("{", "}").from(snippet)
  • remove - before(first("projects/")).removeFrom(resourceName)
  • replace - after(last('.')).replaceWith(fileName, "yaml")
  • split around - first('=').splitThenTrim(line)
  • extract all (occurrences) - between("{", "}").repeatedly().from(snippet)
  • remove all - between("/*", "*/").repeatedly().removeAllFrom(code)
  • replace all - snappingInOrder("{", "}").repeatedly().replaceAllFrom(template, placeholder -> ...)
  • split around all - List<String> items = first(',').repeatedly().split(text).collect(toList())
  • and more...

Show and Tell

To remove the http:// from the url string, if it exists.

Use
Substring.prefix("http://").removeFrom(url);

Instead of

url.startsWith("http://")
    ? url.substring(7)
    : url;

url.startsWith("http://")
    ? url.substring("http://".length())
    : url;

To remove any url scheme http://, https://, chrome:// etc.

Use:
import static com.google.mu.util.Substring.first;

Substring.upToIncluding(first("://")).removeFrom(url);

Instead of:

url.substring(url.indexOf("://") + 3);

// Or if you didn't forget bounds checking
int index = url.indexOf("://");
index == -1
    ? url
    : url.substring(index + 3);

To add home/ to a file path if it's not already there.

Use:

Substring.prefix("home/").addToIfAbsent(filePath);

Instead of:

filePath.startsWith("home/")
    ? filePath
    : "home/" + filePath;

To add comma to the end of line if it's missing.

Use:

Substring.suffix(',').addToIfAbsent(line);

Instead of:

filePath.endsWith(",")
    ? line
    : line + ",";

To extract the directory path home/foo from home/foo/Bar.java.

Use:

import static com.google.mu.util.Substring.last;

String path = ...;
Optional<String> directory = Substring.before(last('/')).from(path);

Instead of:

int index = path.lastIndexOf('/');
Optional<String> directory =
    index == -1
        ? Optional.empty()
        : Optional.of(path.substring(0, index));

To extract the shelf name id-1 from resource name bookstores/barnes.noble/shelves/id-1/books/foo.

Optional<String> shelfName = Substring.between("/shelves/", "/").from(resourceName);

To extract the gaia id from the string id:123.

Use:

import static com.google.mu.util.Substring.prefix;

String str = "id:123";
Optional<Long> gaiaId =  Substring.after(prefix("id:"))
    .from(str)
    .map(Long::parseLong);

Instead of:

Optional<Long> gaiaId =
    str.startsWith("id:")
        ? Optional.of(Long.parseLong(
            str.substring("id:".length())))
        : Optional.empty();

To extract the user id from email user-id@corp.com.

Use:

Optional<String> userId = Substring.before(first('@')).from(email);

Instead of:

int index = email.indexOf('@');
Optional<String> userId =
    index == -1
        ? Optional.empty()
        : Optional.of(email.substring(0, index));

To extract both the user id and the domain from an email address.

Use:

Optional<UserWithDomain> userWithDomain =
    first('@')
        .split(email)
        .map(UserWithDomain::new);

Instead of error-prone index arithmetic:

int index = email.indexOf('@');
Optional<UserWithDomain> userId =
    index == -1
        ? Optional.empty()
        : Optional.of(
            new UserWithDomain(
                email.substring(0, index),
                email.substring(index + 1)));

Substring or Guava Splitter?

Both Substring.Pattern and Guava Splitter support splitting strings. Substring.Pattern doesn't have methods like limit(), omitEmptyResults() because in Java 8, they are already provided by Stream, for example:

first(',').repeatedly()
    .split(input)
    .filter(m -> m.length() > 0)
    .limit(10);  // First 10 non-empty

Substring.Pattern also supports two-way split so if you are parsing a flag that looks like --loggingLevels=file1=1,file2=3, you only need to split around the first = character:

first('=')
    .split(arg)
    .map((name, value) -> ...);

On the other hand, Splitter only splits to a List or Stream. You'll need to do the list size checking and extract the name and value parts using list.get(0) and list.get(1).

Lastly, repeatedly().split() and repeatedly().splitThenTrim() return Stream<Match>, where Match is a CharSequence view over the original string. No characters are copied until you explicitly ask for it. That is, if you decide that only certain matches are worth keeping, you can save the allocation and copying cost for items that aren't of interest:

List<String> names =
    first(',').repeatedly()
        .splitThenTrim("name=foo,age=10,name=bar,location=SFO")
        .filter(prefix("name=")::isIn)
        .map(Match::toString)
        .map(prefix("name=")::removeFrom)
        .collect(toList());

To parse key-value pairs

The Substring.Pattern.split() method returns a BiOptional<String, String> object that optionally contains a pair of substrings before and after the delimiter pattern.

But if called on the returned RepeatingPattern object from repeatedly() method, the input string will be split into a stream of substrings.

Combined together, it can parse key-value pairs and then collect them into a Map, a Multimap or whatever.

For example:

import static com.google.mu.util.stream.GuavaCollectors.toImmutableSetMultimap;

Substring.RepeatingPattern delimiter = first(',').repeatedly();
ImmutableSetMultimap<String, String> multimap =
    delimiter
        .split("k1=v1,k2=v2,k2=v3")  // => ["k1=v1", "k2=v2", "k2=v3"]
        .collect(
            toImmutableSetMultimap(
                // "k1=v1" => (k1, v1), "k2=v2" => (k2, v2) etc.
                s -> first('=').split(s).orElseThrow(...)));

When not to use Substring?

Substring patterns are best used on strings that are known to be in the expected format. That is, they are either internal (flag values, internal file paths etc.), or are already guaranteed to be in the expected format by a stricter parser or validator (URLs, emails, Cloud resource names, ...).

For example, the following code returns a nonsensical result:

String unexpected = "Surprise! This is not a url with http:// or https://!";
upToIncluding(first("://")).removeFrom(unexpected);
// => " or https://!".

If you need to parse a string with complex syntax rules or context-sensitive grammar, use a proper parser or regex instead.

More ways to create Substring.Pattern.

Substring.Pattern can also be created off of a CharMatcher or regexp:

import static com.google.common.base.CharMatcher.digit;
import static com.google.common.base.CharMatcher.whitespace;
import static com.google.mu.util.Substring.last;

before(first(whitespace()))       // first(CharMatcher)
    .removeFrom("foo bar")  => "foo"

upToIncluding(last(digit()))      // last(CharMatcher)
    .from("314s")  => "314";

first(Pattern.compile("\\(\\d{3}\\)\\s"))  // regex
    .removeFrom("(312) 987-6543") => "9876543"

Reuse Substring.Pattern objects

Substring.Pattern is immutable and a pattern object can be reused to save object allocation cost, especially when used in a Stream chain.

For example, prefer:

charSource.readLines().stream()
    .map(first('=')::split)

over:

charSource.readLines().stream()
    .map(line -> first('=').split(line))

Because the latter will allocate the same Substring.Pattern object over and over for every line.