-
Notifications
You must be signed in to change notification settings - Fork 106
Introduction to batEnum
##Introduction
This page was dedicated to the module batEnum
(enum
), which is widely used for two purposes:
- As a bridge to convert from one data structure to another
- Efficient implementation of most one-pass algorithms(more algorithms can be linearized than you might expect).
Enum
is no more powerful than LazyList
, but runs faster. Since it's
destructive, it does not cache the result in most cases. In order to read from an enum
multiple times, you can clone, create an independent copy which will produce the same values in the same order.
To understand this page, a basic understanding of LazyList is assumed.
PartI ##The fundamental stuff
type 'a t= {
mutable count : unit -> int ;
mutable next : unit -> 'a;
mutable clone : unit -> 'a t;
mutable fast : bool;
}
The field fast
indicates whether the count can be done in O(1) time.
That all the fields are mutable is important, since they may all be changed, we will see later.
When you invoke the count
function, it may enumerate the whole list, but the
current position in the list is kept.
The field clone
is to make a clone object, the object and the clone will not
interfere with each other if the computation is pure. (This's the most cases
in impure languages, when you fetch the value of the lazy block, it will force
the evaluation, and you get it the second time, the value is cached, but the
side effect will not be reproduced.)
During the following notations, T stands for time complexity, P : Purpose, S : Suggestion, A: Attention, Tip: Tips. For simplicity, 'a t is the same as 'a Enum.t. D: dependent on some functions
##Basic Functions (from basic to high level)
of_enum, enum : 'a Enum.t -> 'a Enum.t
(*T: O(1)
P: provided for a uniform interface *)
make :
next:(unit -> 'a) -> count:(unit-> int)-> clone:(unit -> 'a t)-> 'a t
(*T: O(1)
Tip: just fill the field of 'a t, the `fast` field is filled with true,
auxiliary function.
S: a bug:
let b= LazyList.(Enum.range 1 |> of_enum|> enum );;
b.fast is true (it uses make, and fast field is true)
*)
force : 'a t -> unit
(*T: O(n)
Tip: Cache the enumeration to a strict list, if the list is very long, then
probably the memory will be used up. However, after you force a list, the
clone, count, next, operations is very cheap, since they already has the whole
list in the memory. If no clone is operated on this enum later, and you
enumerate the list again, the memory will be reclaimed. Otherwise when clone
is invoked after force, you may be careful to avoid memory leak, this is also
quite common in Haskell
A: force will change its all fields
*)
type 'a _mut_list = {hd :'a ; mutable tl : 'a _mut_list}
(*Tip: This type is mainly for generating a list from head to tail
without recursion, and one pass, it uses compiler trick *)
let force t =
let rec clone enum count =
(*Here the enum is of type 'a list ref, count is of type int. Clone is a
function clone the state of 'a list ref content, and its length is known,
since it already has the whole list, this function is O(1), and did not
allocate space any more, and fast is true. its clone field is also easy to
define.
*)
let enum = ref !enum and count = ref !count in
{
count = (fun () -> !count );
next = (fun () -> match !enum with
|[] -> raise No_more_elements
|h::t -> decr count ; enum := t ; h );
clone = (fun () -> let enum = ref !enum and count = ref !count in
clone enum count );
fast = true ;}
in
let count = ref 0 in
let _empty = Obj.magic [] in
let rec loop dst =
(* we will invoke the code later
enum = {hd=t.next (); tl = _empty}
loop !enum ;
so this function will enumerate the t, and fill all the data in enum, then we
use (Obj.magic enum) to force enum to a list. If we did not use Obj.magic, we
get the elements to from the front to end, and will reverse it, more time
consuming. This a trick to build a immutable list from front to end using
Obj.magic*)
let x = {hd = t.next (); tl = _empty } in
incr count;
dst.tl <- x;
loop x
in
let enum = ref _empty in
(try
enum := {hd = t.next (); tl = _empty };
incr count;
loop !enum
with No_more_elements -> ());
let tc = clone (Obj.magic enum) count in
(*here we replace t with tc. Since all the fileds of 'a t are mutable
it changes all the fields of t *)
t.clone <- tc.clone;
t.next <- tc.next;
t.count <- tc.count;
t.fast <- true
empty : unit -> 'a t
(*T: o(1).
Tip: It is mutable, so its signature should be unit -> 'a t, not like
functional empty, Set.empty : 'a Set.t
S: this makes me uncomfortable, the interface is inconsistent, how about
change its name to make_empty or other ways to make a uniform interface?? *)
from : (unit -> 'a) -> 'a t
(*T: o(1)
Tip: very important function in batEnum.
Field fast is false, next is f.
Field count is (fun _ -> force e; e.count ())
The clone is tricky, since we need to clone the state, one way is when clone
is invoked, we force the list, and then clone. If we adopt this way, the clone
will be O(n), and will cache the whole list. Here we chose another way.
The other way is to make use of the lazy list. when we invoke the clone,
We use LazyList.from to build a lazy list, and LazyList.enum to hold the
state. so the state consists of 'a LazyList.t ref.
(* Here we use MicroLazyList for lazyList, since LazyList depends on batEnum,
otherwise it cause circular dependence.
For MicroLazyList, when it is converted to enum,
its next field is eval one step, and mutate the state,
its count is (fun () -> force e; e.count ()),
its fast field is false, its clone is just sharing the reference.
*)
(*S: batEnum use MicroLazyList.enum,
its count field is (fun () -> force e; e.count ());
then it will build a new list unnecessary.
For example, if we have an enum_a built from
let enum_a = from f
let enum_b = clone a
later invoke enum_a.count, since we cloned a, enum_a.count will force enum_a
to evaluate all the elements and pushed it to 'a list. Whereas enum_b still
hold the evaluated 'a LazyList.t ref. So enum_b and enum_a will hold two
evaluated list, this is quite unnecessary. I recommend to use
batLazyList.enum, change its count field *)
To conclude, the new generated enum's clone field is o(1), but when you clone
the num, it will change the record totally, fill the state with a LazyList.t
ref.
The order of clone, count matters, when you count first you cache the whole
list at once, and clone is cheap.
When you clone first, its state is kept via a lazy list's reference.
I recommend to modify MicroLazyList.enum to make it pure when count invoked.
*)
let from f =
let e = {
next = f;
count = _dummy;
clone = _dummy;
fast = false} in
e.count <- (fun () -> force e; e.count ());
e.clone <- (fun () ->
let e' = MicroLazyList.enum (MicroLazyList.from f) in
e.next <- e'.next ;
e.clone <- e'.clone;
e.count () <- (fun () -> force e ; e.count);
(*
Here we use MicroLazyList.enum (MicroLazyList.from f) to keep the state of the
closure f, and use the result to replace the previous enum totally.
However, e.count<- e'.count is wrong, since when e.count is invoked,
it will force e', and will modify e' which will not be reflected on e, or else
we could add one indirection e.fast <- (fun () -> e'.fast ()).
S: all make this bug is due to the fact that we use force in
MicroLazyList.enum's count field, if we use batLazyList.enum's count then, the
count is not destructive, the code will be more consistent. Here we can write
e.count <- e'.count.
The only disadvantage is e.count will always is for LazyList.enum the state
is 'a LazyList.t ref, and the fast will always be false whether or not the
list is totally calculated, but I think it does not matter.
*)
e.fast <- e'.fast ;
e.clone ());
e
from2 : (unit -> 'a) -> (unit -> 'a t) -> 'a t
(*T: O(1).
Tip: fields next, clone are provided by the user
fast is false, count <- (fun () -> force e ; e.count ())
This is the same as from, except that it provide a chance to customize the
clone function, it is useful, we will see it later
*)
Part II
##High level function (based on previous functions)
init : int -> (int -> 'a) -> 'a t
(*T: o(1)
D:from
Tip: fast is true, count is kept as a state, people from Haskell
should notice that init n f, init 4 (fun _ -> init 3 identity) and init 4 (const (init 3 identity)) are quite different, since the block init 3 identity is a mutable, should not be shared. This is also true for Array.init, etc.
*)
get : 'a t -> 'a option
(*T: o(1)
D: field next *)
push : 'a t -> 'a -> unit
(*T: o(1).
Tip: it add one element to the head
use a state next_called = ref false to record.
its t.next is tricky and destructive, when you invoke t.next
it will modify t.next,count,clone, this is quite common and buggy
*)
let push t e =
let rec make t =
let (fnext,fcount,fclone,next_called) = (t.next,t.count,t.clone,ref false) in
t.next <- (fun () ->
t.next <- (fun () -> next_called := true;
t.next <- fnext ;
t.count <- fcount;
t.clone <- fclone; e);
t.count <- (fun () ->
let n = fcount () in
if !next_called then n else n+1);
(* S: next_called is always false, otherwise this function
is changed, so remove it?
here people may worry if fcount is destructive, which means it can modify the fields of t, then t.count will be changed, but we replace t.next with the old t.next. However, here we wouldn't get a bug, since when fcount is destructive, it modifies all the fields, the next field is also changed, so
don't worry. But we may introduce some bugs via make???
The batList.enum, its count is not destructive, since it does not modify the count function, just modify the state of the closure *)
t.clone <- (fun () ->
let tc = fclone () in
if not !next_called then make tc;
tc);
(*here the fclone may be destructive, however, when it's
destructive, it updates all the fields, the definition of from, etc *)
in make t
peek : 'a t -> 'a option
(*T: o(1)
D: get, push *)
take : int -> 'a t -> 'a t
(*T: o(n), I recommend to change this function,
first, it's not efficient, it takes the first n elements (<=n when it has no
more than n elements, still successful), and cache it in a list. but the count
for the new list is re-calculated, we can improve to reduce the unnecessary calculation.
Second, in most cases we hope the take function is lazy, I suggest renaming
take to take_strict, and use the next function take_lazy
*)
take_lazy : int -> 'a t -> 'a t
(*time: o(1). my customized function *)
let take_lazy n t =
let m = ref n in
let next () = if !m > 0 then begin
decr m;
match get t with Some v -> v |None -> raise No_more_elements
end else
raise No_more_elements in
from next ;;
(*there is another commented version take in the source code, however its clone is o(n), I recommended to use from, first to reduce the complexity, second, the fast count is indeed unnecessary and wrong, since we may have less
than n elements, right? *)
junk : 'a t -> unit
(*time:o(1) trivial*)
is_empty : 'a t -> bool
(*based on the assumption, if t.fast then t.count should be o(1) *)
count,fast_count,clone
(*trivial*)
iter : 'a t -> ('a -> unit) -> unit
(*o(n) loop t.next , destructive *)
iteri (*trivial*)
iter2 : ('a ->'b -> 'c ) -> 'a t -> 'b t -> unit
(*
time: o(min(m,n)) ; be careful when get the element is not pure.
if the first enum is used up first, then the short is chosen,
otherwise, the second is shorter, we should push one element of
the first element back.
based on push
*)
iter2i (*trivial*)
fold,reduce,sum (*time: o(n) trivial*)
exists,for_all
(*time : o(n/2), be careful, it will destruct the enum until unnecessary,
so the side effect is hard to predict. you may make use of it, or bitten by it
*)
scanl: ('a -> 'b -> 'a ) -> 'a -> 'b t -> 'a t
(* time: o(1), based on from, push .
let y = scanl f init x
be careful, x and y is related, when you enumerate x, it will have an effect on y. This is all caused by lazy evaluation in the impure world.
*)
scan (*trivial, see scanl*)
foldi (*trivial, see fold *)
fold2 (*trivial, see fold + iter2 based on push*)
fold2i (*trivial, see fold2,+iteri *)
map : 'a t -> ('a -> 'b) -> 'b t
(*time o(1)
let ys = map xs f
(/@) = map (left associate)
(@/) = flip map (right associate)
lazy, so xs and ys are related see scanl.
*)
let rec map f t =
{count = t.count; next = (fun () -> f(t.next ()));
clone = (fun () -> map f (t.clone ()));
fast = t.fast ;}
(* let ys = map xs f
ys.count is not destructive, since xs.count dose not update ys, which means
ys.count will not call ys.next.
let zs = xs /@ f /@ g
you will see zs.count is the same as g.count. This is quite like Haskell, if
you call length, it will not evaluate the inner block. It's also the same that it's quite subtle to argue the time complexity.
we can not use from next here, since the count is customized, and the clone
, fast is also customized. So we clone ys will cause xs also to be cloned.
could any one explain the benefit??
another counter example:
let xs = range 1 ~until:20 /@ (fun x -> printf "%d\n" x; 2 * x)
let ys = clone xs
junk xs (*print 1 *)
junk ys
(*print 1 the side effect is caused twice, most times this is not true *)
*)
find : ('a -> bool) -> 'a t (*see exists *)
mapi : (see map)
filter : ('a -> bool) -> 'a t -> 'a t
(* time : o(1) (//) = flip filter (left associate)
based on from2
let ys = xs // f
the same as before, could anyone explain why provided a customized clone
function?
let xs = range 1 ~until:20 // (fun x -> printf "%d\n" x; x mod 2= 0)
let ys = clone xs
junk xs
junk ys (*side effect appeared again *)
what's the benefit over from next ?
*)
filter_map (*see filter*)
append : 'a t -> 'a t -> 'a t
(*time:o(1) *)
let rec append ta tb =
let t = { count = (fun () -> ta.count () + tb.count ());
next = _dummy;
clone = (fun () -> append (ta.clone ()) (tb.clone ());
(* here clone t will cause ta, tb to be cloned, however, since
append does not introduce side effects, it's ok.
another benefit here, if tb is a clone of tb they will
share the memory (*very good*)
*)
fast = ta.fast && tb.fast } in
t.next <- (fun () ->
try ta.next ()
with No_more_elements ->
t.next <- (fun () -> tb.next ());
t.count <- (fun () -> tb.count ());
(*add one indirection, since
let tc = append ta tb
... (*mutate tb , this is quite common
when invoke t.count, it may mutate tb
*)
the update of tb should be reflected on tc
*)
t.fast <- tb.fast ;
(* a bug here ??? why not add an indirection *)
t.next ());
t
To be added
Just for notes (to be added)
- the printer for enum is unnecessary on the toplevel (battop.ml)
- Seq.fold and Enum.fold signature does not correspond
- Some useful functions missed : concat_map
- slazy (a bug? fast should be false?)
- batCharParser, batParserCo unusable, error message lacked
- Stream syntax extension for Enum?
- polymorphic heap (this is necessary even though it is unsafe)
- double-linked list (it is strange that it does not even support zero element list, should support)
- a generic tree data structure is missing