Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterate once to create two iterators in partition #2577

Merged
merged 15 commits into from
May 23, 2020

Conversation

mincong-h
Copy link
Member

Fix #2559

@mincong-h
Copy link
Member Author

Stream<T> second = Stream.empty();
for (T t : this) {
if (predicate.test(t)) {
first = first.append(t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember, Stream.append is an O(n) operation, so the whole op will be O(n^2).
I believe you should use two java's ArrayList objects to partition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your pointer @ruslansennov . This should be fixed now.

final Iterator<T> first = that.iterator().filter(predicate);
final Iterator<T> second = that.iterator().filter(predicate.negate());
return Tuple.of(first, second);
final java.util.List<T> first = new ArrayList<>();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know if it is an issue here, but...
This solution is filtering eagerly and previous solution was filtering lazily.

If we would like to keep the same behaviour there is an option of creating a memoized predicate and then use it during filtering, e.g.

final Predicate<? super T> memoizedPredicate = Function1.of(predicate::test).memoized()::apply;
final Iterator<T> first = that.iterator().filter(memoizedPredicate);
final Iterator<T> second = that.iterator().filter(memoizedPredicate.negate());

Copy link

@kefasb kefasb May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that my solution with memoized predicate will not work in case when a mutable object mutates to the value that corresponds to another object to be filtered. Then another object is not evaluated but taken from cache with wrong result for it.
Example test:

final HashSet<MutableInteger> integers = HashSet.of(MutableInteger.of(1), MutableInteger.of(2));
final Tuple2<HashSet<MutableInteger>, HashSet<MutableInteger>> partition = integers.partition(mutableInteger -> mutableInteger.incrementAndCheckGreaterThan(2));
assertThat(partition._1).isEqualTo(HashSet.of(MutableInteger.of(3)));
assertThat(partition._2).isEqualTo(HashSet.of(MutableInteger.of(2)));

Where

    private static class MutableInteger {
        private int integer;
...
        boolean incrementAndCheckGreaterThan(int i) {
            return ++integer > i;
        }
...
    }

In this case the result of evaluation MutableInteger.of(1).incrementAndCheckGreaterThan(2) is put into cache with entry MutableInteger.of(2) -> false.
Then when the next object from set MutableInteger.of(2) is filtered and it is found in the cache and evaluated to false. While normally it would evaluate to MutableInteger.of(3) -> true.

But the question stays whether we want to keep the partitioning lazy or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit too technical for me... Maybe @danieldietrich can have an answer about it.

@@ -1722,10 +1722,16 @@ public U next() {
if (!hasNext()) {
return Tuple.of(empty(), empty());
} else {
final Stream<T> that = Stream.ofAll(this);
Copy link

@kefasb kefasb May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also partition method implemented by Stream which works wrongly too.

I do not know another places. Maybe @danieldietrich could point them out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of the problem of Stream#partition. Thanks for pointing it out! 🙇 I checked all data-structures in VAVR, apart from Stream, all other implementations work correctly. About the Stream itself, I'm afraid that I don't know how to fix it and I'm not sure if it's really a bug. In Scala 2.13.2, the Stream implementation uses the predicates twice (link) as below:

override def partition(p: A => Boolean): (Stream[A], Stream[A]) = (filter(p(_)), filterNot(p(_)))

If I modify your example in issue #2559 by changing Set into Stream, you will see result in Scala as follow:

  "Stream" should "partition correctly" in {
    val fruitsToEat = Stream("apple", "banana")
    val partition = fruitsToEat.partition(name => biteAndCheck(name))

    partition._1 shouldEqual Stream()
    partition._2 shouldEqual Stream()  // not Stream("apple", "banana")

    fruitsBeingEaten.get("apple").get.name shouldEqual "apple"
    fruitsBeingEaten.get("apple").get.bites shouldEqual 2  // not 1
    fruitsBeingEaten.get("banana").get.name shouldEqual "banana"
    fruitsBeingEaten.get("banana").get.bites shouldEqual 2  // not 1
  }

So I would say the VAVR is aligned with Scala on Stream's behaviors. Also, my current approach (ArrayList) does not fit the Stream requirement, because Stream is lazy sequence of elements which may be infinitely long. So I don't know how to use only one predicate to achieve that... So my suggestion is let's keep stream as it is and avoid modifying it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream is lazy sequence of elements which may be infinitely long

Oh sorry guys I completely forgot about it :(

Copy link
Member Author

@mincong-h mincong-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review @kefasb, please see my comments inline.

@@ -1722,10 +1722,16 @@ public U next() {
if (!hasNext()) {
return Tuple.of(empty(), empty());
} else {
final Stream<T> that = Stream.ofAll(this);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of the problem of Stream#partition. Thanks for pointing it out! 🙇 I checked all data-structures in VAVR, apart from Stream, all other implementations work correctly. About the Stream itself, I'm afraid that I don't know how to fix it and I'm not sure if it's really a bug. In Scala 2.13.2, the Stream implementation uses the predicates twice (link) as below:

override def partition(p: A => Boolean): (Stream[A], Stream[A]) = (filter(p(_)), filterNot(p(_)))

If I modify your example in issue #2559 by changing Set into Stream, you will see result in Scala as follow:

  "Stream" should "partition correctly" in {
    val fruitsToEat = Stream("apple", "banana")
    val partition = fruitsToEat.partition(name => biteAndCheck(name))

    partition._1 shouldEqual Stream()
    partition._2 shouldEqual Stream()  // not Stream("apple", "banana")

    fruitsBeingEaten.get("apple").get.name shouldEqual "apple"
    fruitsBeingEaten.get("apple").get.bites shouldEqual 2  // not 1
    fruitsBeingEaten.get("banana").get.name shouldEqual "banana"
    fruitsBeingEaten.get("banana").get.bites shouldEqual 2  // not 1
  }

So I would say the VAVR is aligned with Scala on Stream's behaviors. Also, my current approach (ArrayList) does not fit the Stream requirement, because Stream is lazy sequence of elements which may be infinitely long. So I don't know how to use only one predicate to achieve that... So my suggestion is let's keep stream as it is and avoid modifying it.

final Iterator<T> first = that.iterator().filter(predicate);
final Iterator<T> second = that.iterator().filter(predicate.negate());
return Tuple.of(first, second);
final java.util.List<T> first = new ArrayList<>();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit too technical for me... Maybe @danieldietrich can have an answer about it.

Copy link
Contributor

@danieldietrich danieldietrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mincong-h, thank you for your PR!

I dived into Scala's source code and checked their strategy. I summed up my findings on the original issue.

I think we need to align our Iterator.partition implementation to Scala. That way we will have the best of both worlds: laziness and iteration-once.

Additionally we might also check, if there are Scala collections that override the lazy evaluation strategy and provide a strict iteration-once implementation. I will do that tomorrow.

If you are not sure what to do, we can further discuss on the original issue.

Thanks!

Daniel

@mincong-h
Copy link
Member Author

Thank you Daniel for your comments (both here and in the issue). I will take a look this weekend.

@codecov-io
Copy link

Codecov Report

Merging #2577 into master will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2577      +/-   ##
============================================
+ Coverage     92.81%   92.85%   +0.03%     
- Complexity     5338     5340       +2     
============================================
  Files            89       89              
  Lines         12743    12743              
  Branches       1609     1611       +2     
============================================
+ Hits          11828    11832       +4     
+ Misses          727      723       -4     
  Partials        188      188              
Impacted Files Coverage Δ Complexity Δ
src/main/java/io/vavr/collection/Iterator.java 89.61% <100.00%> (+0.05%) 184.00 <0.00> (+2.00)
src/main/java/io/vavr/collection/Map.java 98.07% <0.00%> (+3.63%) 33.00% <0.00%> (ø%)
src/main/java/io/vavr/collection/Seq.java 93.75% <0.00%> (+3.75%) 40.00% <0.00%> (ø%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b7a72c...42117bf. Read the comment docs.

@codecov-commenter
Copy link

Codecov Report

Merging #2577 into master will increase coverage by 0.02%.
The diff coverage is 95.74%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2577      +/-   ##
============================================
+ Coverage     92.81%   92.84%   +0.02%     
- Complexity     5338     5355      +17     
============================================
  Files            89       89              
  Lines         12743    12773      +30     
  Branches       1609     1622      +13     
============================================
+ Hits          11828    11859      +31     
+ Misses          727      723       -4     
- Partials        188      191       +3     
Impacted Files Coverage Δ Complexity Δ
src/main/java/io/vavr/collection/Traversable.java 96.65% <ø> (+0.08%) 108.00 <0.00> (+2.00)
src/main/java/io/vavr/collection/Iterator.java 89.57% <91.66%> (+0.01%) 183.00 <1.00> (+1.00)
...main/java/io/vavr/collection/AbstractMultimap.java 98.44% <100.00%> (+0.03%) 141.00 <0.00> (+2.00)
src/main/java/io/vavr/collection/BitSet.java 98.02% <100.00%> (-0.01%) 82.00 <0.00> (ø)
src/main/java/io/vavr/collection/Collections.java 94.73% <100.00%> (+0.14%) 139.00 <3.00> (+3.00)
src/main/java/io/vavr/collection/HashSet.java 95.72% <100.00%> (-0.04%) 140.00 <1.00> (ø)
...rc/main/java/io/vavr/collection/LinkedHashSet.java 96.58% <100.00%> (-0.03%) 145.00 <1.00> (ø)
src/main/java/io/vavr/collection/Maps.java 99.24% <100.00%> (+0.02%) 80.00 <0.00> (+2.00)
src/main/java/io/vavr/collection/TreeSet.java 97.85% <100.00%> (-0.02%) 166.00 <1.00> (-2.00)
src/main/java/io/vavr/collection/Queue.java 95.76% <0.00%> (-0.69%) 202.00% <0.00%> (ø%)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b7a72c...b75a1b6. Read the comment docs.

Copy link
Member Author

@mincong-h mincong-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments to facilitate the review, @danieldietrich . I mainly have 3 concerns: the correctness in concurrent situation; whether or not the duplicate method should be part of the public API; whether you see improvement about testing, the Iteractor.duplicate in Scala 2.13 was added without tests... see scala/scala#6578

*
* @return a pair of iterators
*/
default Tuple2<Iterator<T>, Iterator<T>> duplicate() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method uses almost exactly the same logic as Iterator.duplicate in Scala 2.13

Comment on lines 1977 to 1979
if (gap.isEmpty()) {
ahead.set(this);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two Partners (duplicate iterators) may move forwards in different speed. The one moving faster is the one called "ahead", the other one is "behind". The distance between them is the "gap". When the gap becomes zero, it means that both iterators reached to the same position. We need to reset the "ahead".

}

@Override
public synchronized T next() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the synchronized keyword, this is an intrinsic lock. From what I learnt from book "Java Concurrency in Practice", this can cause performance problem. But I don't know what is the good way to encapsulate all mutable states, gap and ahead here, and protect it from concurrent access. I would like to know if you have any suggestion about it. In Scala, the duplicate method is written as follows:

def next(): A = self.synchronized { ... }


@Override
public synchronized boolean hasNext() {
return (this != ahead.get() && !gap.isEmpty()) || Iterator.this.hasNext();
Copy link
Member Author

@mincong-h mincong-h May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Daniel, you suggested me to create this method in another interface called IteratorModule. But we need to use the "parent" iterator (not sure what is the right word here) via Iterator.this. I'm not sure how to achieve that if the logic is moved to IteratorModule. The trade-off of declaring the method here is that it becomes a public API... while it is meant to be a utility method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need to use the "parent" iterator

The static duplicate method receives the parent Iterator as parameter:

IteratorModule.duplicate(Iterator iter)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes 😂 👍

@danieldietrich
Copy link
Contributor

danieldietrich commented May 21, 2020

@mincong-h thank you! please find my answers to your 3 concerns below.

the correctness in concurrent situation

An Iterator is stateful. It does not offer thread-safety on its own. The user has two options in a concurrent setting:

  • explicitly synchronize iterator access from the outside
  • alternatively use a persistent collection instead

Please remove the occurrences of synchronized. The AtomicReference also isn't needed.

whether or not the duplicate method should be part of the public API

For now, I would hide it in the IteratorModule you mentioned. We offer a static method:

public interface Iterator<T> {
    // ...
}

interface IteratorModule {
  static <T> Tuple2<Iterator<T>, Iterator<T>> duplicate(Iterator<T> iterator) {
    // ...
  }
}

whether you see improvement about testing

For Iterators, equals and hashCode aren't overwritten, we use Object equality because an Iterator may be infinite and there is no way to compare the elements. (Note: A Stream may also be infinite but we are able to defined equals and hashCode in a meaningful way because elements are persisted for later use).

Therefore, please remove hasSame, equals and hashCode from Partner.

The unit tests cannot use the isEqualTo method on Iterators. We need to first consume the elements, e.g. by storing them in a List or by joining them in a String. These then can be compared. Maybe the cheapest solution is to compare joined strings.

It is great to see that we will get the duplicate method :)

@mincong-h
Copy link
Member Author

@danieldietrich , thanks for your suggestion and comments, really appreciated! I addressed most of the comments, here are something left.

Please remove the occurrences of synchronized. The AtomicReference also isn't needed.

I cannot remove AtomicReference, because I need a mutable object holder for holding the reference. It's not for being thread safe here. Assigning the reference to Iterator<T> ahead is not an option, because it fails the compilation with warning: "Variable 'ahead' is accessed from within innerclass, needs to be final or effectively final."

The unit tests cannot use the isEqualTo method on Iterators. We need to first consume the elements, e.g. by storing them in a List or by joining them in a String. These then can be compared. Maybe the cheapest solution is to compare joined strings.

Vavr has it own implementation about assertThat(Iterator) in IteratorTest as IterableAssert, which transforms the actual iterator and expected iterator into two lists and then compare them. So I think it's fine to use isEqualsTo. But just in case of something I didn't think about, I use what you suggested, i.e. joining them into a String and compare.

It is great to see that we will get the duplicate method :)

+1 :)

@mincong-h mincong-h force-pushed the issue-2559-partition branch from 0022fa7 to 4481b97 Compare May 22, 2020 20:07
Copy link
Contributor

@danieldietrich danieldietrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Mincong,
thanks for applying the changes!
Of course you are absolutely right about the AtomicReference object holder and about IterableAssert. As long as Iterator does not implement equals and hashCode we are free to use any of our assertion strategies. 👍
A special thank that you added the issue test! 😊

@danieldietrich danieldietrich merged commit 1c90106 into vavr-io:master May 23, 2020
@mincong-h
Copy link
Member Author

Hi Daniel. Thanks for your review, I learnt a lot from them. I'm looking forwards to contributing more in the future 😀

@mincong-h mincong-h deleted the issue-2559-partition branch May 24, 2020 06:02
@danieldietrich
Copy link
Contributor

Hi Mincong, I‘m also looking forward to your future contributions, it is fun to work with you!

Currently I have to finish some heavy duty tasks, such as restructuring the core Gradle project a bit and pulling some Git projects back into the core multi-module project. We need a clean version baseline in order to prevent versioning conflicts.

Additionally, I will change Try, Option and Either accordingly to the current 2.0.0 branch (which will be deleted afterwards, it is just an arbitrary name).

Finally, after some polishing, we will be able to release Vavr 1.0. After that, I will focus on a rework of the web page and the documentation. I expect also that there will be many questions and migration issues.

Additionally, I expect several new feature requests. Here come new PRs into play but we will see.

Thanks!

@mincong-h
Copy link
Member Author

The future looks exciting and your plan makes sens. Thanks for doing this Daniel! Do not hesitate to ping me if you need any help. I will try to grab some issues whenever I can as well.

Best,
Mincong

danieldietrich pushed a commit that referenced this pull request Jul 14, 2021
* Reproduce the problem

* Iterate once to create two iterators in partition

* Avoid using io.vavr.collection.Stream

* Test behavior of `partition` on different classes

* Test that Stream.partition() is lazy

* Create Iterator.duplicate() and add tests

* Change the implementation of Iterator.partition()

* Fix Set

* Fix Map

* Fix Multimap

* Move duplicate to IteratorModule

* Remove synchronized keyword

* Remove hashCode and equals

* Avoid using isEqualTo

* Remove redundant tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

partition method seems to not work correctly, comparing to scala
6 participants