Skip to content

Perusing Pair (and BiStream)

Ben Yu edited this page Aug 15, 2020 · 24 revisions

Let's talk about Pair

It's 2020 and I'm still talking about Pairs. ;-)

There are plenty of Stackoverflow questions asking about a generic Pair class in Java:

And at least 4 or 5 libraries that provide from a whole slew of tuple implementations to at least a Pair class.

At Google, we are deeply biased against Pair (and all of those tuple types), why?

I've personally been disgusted by my own Flume code (both Java and C++) similar to the examples given in this post

Code using these nested Pair classes also tend to be horrible:

emitFn.emit(in.getFirst(), Iterables.getOnlyElement(in.getSecond().getFirst()));
...

return String.format("%s (SourceId=%d)\t Status:%s\t AllocationCount:%d",
    getNetworkName(input.getFirst().getFirst()),
    input.getFirst().getFirst(),
    input.getFirst().getSecond(),
    input.getSecond());
if ((next_mid_iterator->second.first.first > mid_iterator->second.first.first)
   || (next_mid_iterator->second.first.second <= mid_iterator->second.first.second)) {
  ...
}

So what exactly is wrong about Pair? I'd like to think of two aspects.

Meaningless names

People have different tastes. One may opt for the first/second terminology, or _1/_2, or left/right, car/cdr, foo/bar, a/b, yin/yang, head/tail, night/day, gandolf/saruman.

Whatever names you choose, they have one thing in common: they don't mean anything.

And that is why the above Pair usage code are horrible. "second.first.first" is likely the second best thing since goto was invented.

If you can come up with a logical meaning for these "second.first.first" thingies, and you want yourself weeks later to understand the code, try to call them what they are, for example: value.name.first_name.

Granted, Java didn't make it easy to create proper classes with proper field names. But I think programmers are also partially responsible because often times you don't really need hashCode()/equals()/getters/setters, if you are just trying to have a place to define fields and document their semantics/invariants etc. There is nothing wrong with the following simple class:

class Name {
  public final String firstName;
  public final String lastName;

  Name(String firstName, String lastName) {
    this.firstName = firstName;
    this.lastName = lastName;
  }
}

"But that exposes the fields as public. Breaks encapsulation!" you say. Yes, you are right that it doesn't provide abstractions through getters. But YAGNI anyone? Plus, consider this:

  1. Neither does Pair<String, String> provide any encapsulation. It's just worse because it exposes not just the public access, it even sticks the <String, String> thing on your nose so you are guaranteed to carry it around wherever you go.
  2. Many IDES have this "Encapsulate Field" auto refactoring. If it turns out you need to wrap the fields through a getter, it means two things:
    1. You had made a good choice not using Pair in the first place. Because by now it'd have been more difficult to add any abstraction.
    2. Just need to use the "Encapsulate Field" auto refactoring. It will take care of updating your callers.

That YAGNI optimistism only goes so far for locally-used, private/inner classes where you know you won't need to store the object as a hash map key or a Set. It won't work if you justifiably need equals()/hashCode() (or in C++, many might need the operator==, operator< etc.)

For the other uncooperative use cases, code generators like AutoValue give a way out so we can create proper value classes almost as easy as we had wished:

@AutoValue
class Name {
  public abstract String firstName();
  public abstract String lastName();

  static Name of(String firstName, String lastName) {
    return AutoValue_Name(firstName, lastName);
  }
}

(In the not-too-distant future, we may even be able to use tuples)

To be fair, even with the Pair class, this problem could be alleviated in the age of lambda. For example, why not add a method like:

class Pair<A, B> {
  public <R> as(BiFunction<? super A, ? super B, R> output);
}

Code like the following would be easy to read:

parseUserNameAndDomain("foo@gmail.com")
    .as((userId, domain) -> ...);

Useless Type

The type Pair<String, String> is both under-specified and over-specified:

  • It underspecifies the relationship between the two strings. The readers have no idea if they are id/name, title/content, ssn/creditcard.
  • It overspecifies the implementation detais. If for example it represents title/content of a book, I need a Book type, not a type that hides its identity but taunts me with a riddle: "Hey, I have two strings in me, guess what I am?".

There are some confusions around this topic though.

Does it mean a method shouldn't ever return int, or String, and should always wrap them?

No. When there is just one thing, the "relationship" argument is moot. Relationship is at least between two things.

That said, it can still be bad if you are over-using primitives to represent logical entities, especially if this logical entity will be used in multiple places. For example, if your code tends to use "user id" concept over and over again, it's probably a better idea to create a UserId type. Don't use String just because the user id happens to be represented/encoded as a String.

But what about Map<String, String>?

In a Map, the relationship between the two types is defined. They are keys and the values associated with the key.

And yes, Map<String, String> may be okay in the internal implementation detail when it's used once or twice, with the context clearly in scope. But if it ever gets used across packages, or referenced multiple times, not knowing which String means what can be a readability problem. You'd be better off with Map<UserId, UserId> if they are some kind of user id mapping, or wrap the Map inside a higher-level abstraction class.

Is BiStream<Integer, Integer> bad?

Unlike Map, BiStream does not define a relationship between the two types. So unless seeing the two types gives the readers an immediate clue of the relationship (like in BiStream<UserId, User>), BiStream<Integer, Integer> would be bad.

But, BiStream typically forms a chain of operations, where at each time the BiStream's type changes. the BiStream<Integer, Integer> type may only be invisible intermediary types, like this:

BiStream.zip(indexesFrom(0), visits)  // BiStream<Integer, Integer>
    .map((index, visit) -> ...)
    ...
    .collect(...);
What if a proper class makes no sense?

There are situations where a semantic-free pair type is precisely what I need. For example, in a layered application, we have a bunch of domain types (Order, LineItem etc.) and then a bunch of corresponding DTO types (OrderDto, LineItemDto). At the boundary of the DTO -> Domain, the implementation of translation code may sometimes need to accept or return a list of Pair<OrderDto, Order> objects.

There is no relationship untold upon seeing Pair<OrderDto, Order>; and that this thing has a pair of OrderDto and Order is exactly the semantics we need to convey.

In such case, I'd use either BiStream<FooDto, Foo> or BiCollection<FooDto, Foo>, depending on whether I need it to be streamed once, or repetitively accessed.

In Conclusion

Going back to the root of the problem, people need Pair because they have methods that need to return two values.

As argued above, some of these cases are not really two-valued binary use cases, because what happens to be two things today may evolve to 3 or 4 things tomorrow. What the programmer really needs is a higher-level abstraction. For example, you'll want to return a Marriage object, not a Pair<Person, Person> object, because in the future Marriage may evolve to also need other information such as, say, Asset? Jurisdiction? Diamond? Anniversay? ExpirationDate? :)

There exists cases are truly two-valued, binary. Some real world examples I can think of:

  • Split a flag string in the form of "--mode=dry_run" into the flag name/value pair.
  • Calculate the quotient and remainder of a division.
  • Find a list element and its current index in the list.

So what do I recommend when you need to do something similar?

If you are dealing with a collection or a stream of these pairs, use BiStream or BiCollection.

Or else, consider to use the lambda approach (like in JDK 12's [Collectors.teeing()](https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/util/stream/Collectors.html#teeing(java.util.stream.Collector,java.util.stream.Collector,java.util.function.BiFunction) API):

/** Splits string by delimiter */
<R> R split(..., BiFunction<String, String, R> output) {
  ...
  return output.apply(before, after);
}

/** Finds the element and its index */
<R> R locate(Id id, BiFunction<Integer, ? super T, R> output) {
  ...
  return output.apply(index, element);
}

The benefit is that the callers can call the method to create the appropriate type as it fits:

Flag flag = split(..., Flag::new);
locate(id, (index, element) -> ...);

Even when the caller has no appropriate type to use, they can use Pair or Map.Entry if they so choose:

Map.Entry<Integer, V> found = locate(id, Map::entry);
Pair<String, String> nameValue = split(..., Pair::new);

As a bonus, all two-valued methods with such signature can be method referenced and used together with BiStream. For example, one can split a stream of strings using:

  ImmutableListMultimap<String, String> keyValues = readLines().stream()
      .collect(toBiStream(Substring.first('='):splitThenTrim))
      .collect(ImmutableListMultimap::toImmutableListMultimap);

So, I guess the question is: why not?