ChangeLog

2018-08-29  dyuret  <dyuret@login03.kuacc.ku.edu.tr>

	* differentiate->@diff: differentiate(nll, f(x), y) does not work because f(x) gets
	evaluated before differentiate can create tapes.  @diff nll(f(x),y) works and looks
	better. Can use the shorter name (function diff was taken by Base). We still can't
	get rid of the global tape stack.

	* loss: f(x,y)->nll(f(x),y) -- instead of the two-arg call to the model defining the
	loss, we will resolve all model calls to be predict calls (which will allow multiple
	arguments). This has multiple advantages: (1) inner layers never need a loss, (2)
	multi-args predicts are possible, (3) the same model can be tested with multiple
	loss functions (e.g. nll vs accuracy). Disadvantage: either have to define separate
	loss function or make df(loss,f(x),y) work.

	* iteration: f()->f -- instead of the user having to define a zero-arg function call
	or a params(f) function for each layer/model, we make `for param in model` work
	automatically using a recursive iterator based on deepcopy.

	* Param-Result-Value: mutable structs are a lot faster as keys for an IdDict!

	* Tape: valN=>nodeN, ..., val1=>node1, NIL=>node0
	# Special node0 marks both ends of the tape:
	# node[n].cdr = node[n-1]
	# node1.cdr = node0
	# node0.cdr = nodeN
	# Special Value NIL acts as a key to node0.
	# Tape is iterated in reverse, last in first out
	# This automatically makes first(tape) the final result node
	# last(tape) is the initial parameter node, but default slow.
	# The last(::Tape) hack is to make old style grad faster.


2018-08-05  Deniz Yuret  <dyuret@ku.edu.tr>

	* TODO:
	- Overriding broadcasted vs broadcast?  Measure memory and speed. Figure out Knet functions vs AutoGrad functions. What methods are defined?
	- primitive tests
	- linalg
	- unit tests for each src file
	+ separate gradcheck and debug.jl
	+ separate math.jl?
	+ require specialfunctions?
	+ test knet
	- scan code, finish todos and optimize
	- go from f(Grad{n}) to grad(f,n,...)?
	- try memoization
	- speed test
	- test highorder: Innes has a PR?
	- pull requests and issues
	- fix docs and comments and examples
	- tracked array interface
	- fix issues
	+ separate j6 and j7 releases
	+ @primitive and @zerograd need broadcast and broadcasted to be	imported, use own names? import in macro?
	- optimize sum_outgrads, reduce memory use.
	- ::Rec .^ ::Int does not work!

	* Proposal: reduce compilation, do not construct functions at runtime.
	- record(f,x...) instead of recorder(f)(x...)
	- back(f,x...) instead of f(Grad,x...), need argnum
	- grad(f,x...) instead of grad(f)(x...), need argnum

2017-03-12  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	## Fix buggy gradient of getindex for repeated indices.
	## fix sum_outgrads as well to do the right thing.
	## fix full to use addindex!

	* TODO:
	# ungetindex problems:
	## we could have indices instead of index and use UngetIndex as a more efficient accumulator.
	## use full much more rarely and keep things as UngetIndex.
	## implement KnetArray version of addindex!
	## figure out julia4 problem with Array{CartesianIndex}

	# gradcheck with getindex tuples is broken and not easy to fix: tuple->tuple->tuple type struct means everything needs to be replaced when one leaf changes.
	# gradcheck with getindex dict is broken.
	# gradcheck with ungetindex does not work, fix and move ungetindex tests from src to test.
	# repmat does not work.
	# oftype(Rec(a), b) should ignore Rec.
	# similar(Rec(a), 2, 3) does not work.
	# Check the todo items in interfaces.jl.
	# Get the tests out of src.
	# Look into the pull request and speed benchmark issue.
	# Generate reference document using Documenter.jl converting core.jl comments into docstrings.
	# Pick test ranges to avoid floating failures (e.g. acoth args close to 1)
	# Test and fix Julia v0.6 compatibility: gradients with broadcast.
	# test higher order gradients of cat, getindex and other utilities.
	# log and unbroadcast fail test with the new toscalar.
	# Implement backward operations in profile.jl.
	# Also rnn profile may be different from mnist, may be better to compare on rnn: charlm or copyseq. need to get primitives right.
	# rnn example? other examples from autograd, knet? can we do a large mt or lm model?
	# implement convenience_wrappers (jacobian etc)? utilities (quick_grad_check for mnist)?
	# subarrays: can be used to make gradients for concat faster: as long as we don't have overwriting.
	# finish abstractarray(hvcat), array, tuple and dict support.
	# avoid double normalization in softmax.
	# find the problem with runtests when length=2 and higher order	dict support, open issues.
	# finish all function gradients.

2017-03-08  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# Rename OneHot -> UngetIndex.

2017-02-23  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# Add convert as a primitive, need to handle Node(::Rec) constructor. Too dangerous, could not find a safe way.

2016-09-03  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# improve testing: generate multiple tests compatible with given types.
	# fix problem with check_grads(log,0.12099921186616151,[1.7601369948066048,1.0407040634792377])
	# fix problem with (:check_grads,lfact,:args,(1.667842864901635,),:exact,(0.3202972807833342,),:numeric,(0.7822992913247839,))
	# Email announce new release with profiling results.
	# check slowdown in simple functions
	# f4ccc0d is faster than ae84389 ?

	* f4ccc0d-vs-ae84389: unboxing was the culprit, fixed.

	abb2cfe 2016-09-03
	1.961046 seconds (1.45 M allocations: 604.340 MB, 1.71% gc time)
	4.399855 seconds (2.76 M allocations: 2.182 GB, 2.34% gc time)

	a71400c 2016-09-03
	2.108507 seconds (1.45 M allocations: 600.952 MB, 1.66% gc time)
	4.650728 seconds (2.76 M allocations: 2.177 GB, 2.21% gc time)

	ae84389 2016-08-31
	2.060078 seconds (1.45 M allocations: 600.952 MB, 1.58% gc time)
	4.598383 seconds (2.94 M allocations: 2.183 GB, 2.09% gc time)

	f4ccc0d 2016-08-31
	2.041250 seconds (1.45 M allocations: 601.044 MB, 1.60% gc time)
	4.605514 seconds (3.11 M allocations: 2.191 GB, 2.07% gc time)

2016-09-01  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# test dict and tuple access.

2016-08-31  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# fix rosenbrock(x) = sum(map((i, j) -> (1 - j)^2 + 100*(i - j^2)^2, x[2:end], x[1:end-1]))
	# improve ungetindex (sparse container?)

	* Rosenbrock:
	# Profile memory overhead, check cost of recording large number of ops
	# rosenbrock(x) = sum(map((i, j) -> (1 - j)^2 + 100*(i - j^2)^2, x[2:end], x[1:end-1]))
	# grad fails with very large inputs.  Debug memory usage. Closures?

	# Vectorized version:
	# rosenbrock2(x) = sum((1-x[1:end-1]).^2 + 100*(x[2:end]-x[1:end-1].^2).^2)

	using AutoGrad
	f1(x)=sum(map((i, j) -> (1 - j)^2 + 100*(i - j^2)^2, x[2:end], x[1:end-1]))
	f2(x)=sum((1-x[1:end-1]).^2 + 100*(x[2:end]-x[1:end-1].^2).^2)
	g1=grad(f1)
	g2=grad(f2)
	x = [ rand(10^i) for i=1:6 ]
	for i=1:6; print("$i "); @time g2(x[i]); end
	#for i=1:6; print("$i "); @time g1(x[i]); end

	# Is it map?  No:
	f3(x)=(s=0; for i=2:length(x); s+=(1-x[i-1])^2 + 100*(x[i]-x[i-1]^2)^2; end; s)
	g3 = grad(f3)
	@time g3(rand(200)) =>  58.218724 seconds (143.24 M allocations: 7.839 GB, 3.70% gc time)
	@time g3(rand(200)) =>    0.020137 seconds (203.65 k allocations: 5.928 MB)
	what is going on the first time?  even if we change the length by
	1 all slows down! memalloc shoots up.  only works fast on the
	exact length it has seen before!

	Changed the way we do sum_outgrads.  Much better performance.
	Still explodes when much more than 10K entries used.  The reason
	is ungetindex.  It creates N dense copies of an N element array,
	one for each element.  Could use sparse arrays?  SparseMatrix only
	covers 2-D.  No gpu implementation Slows everything down.  Still
	cannot handle 100K.  Giving up for now but will think about
	ungetindex improvements.

2016-08-30  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# Profile speed
	# Get rid of closures
	# Make gradients generic
	# Implement broadcast and reduce more efficiently then CUDNN.
	# Test higher order derivatives
	# Update documentation and comments.
	# Problem with getindex.  Fix it, then test small version of mnist.
	# exporting @primitive by itself doesn't work, it needs recorder, getval etc. to be exported to work.
	# remove sum_outgrads by using the first outgrad as the accumulator
	## but sum_outgrads is a primitive and we can only do this after we support overwriting operations.
	## did this by using a functional accumulator
	# optimize for (tuple) loops in core.jl?
	# check and eliminate all tight uses of (?:)

	* CANCELLED:
	# CUBLAS:master already has highlevel.jl implemented with A_mul_B etc. (it is incomplete!) (make sure it doesn't transpose). Update CUDNN as well.  Put the rest of the primitives in CUDArt?  General broadcasting make sense for GPU?
	# Check out https://github.com/timholy/FastAnonymous.jl -- don't use them any more.
	# work on memory: overwriting: 3arg (overwriting) functions (Ax_mul_Bx, broadcast) or julia v0.5 overwrite syntax or InplaceOps.
	# use TypeNode <: Type instead of Node{Type}?


2016-08-29  Deniz Yuret  <dyuret@ku.edu.tr>

	* Timing:

		run	forw	grad
	e3e85c9 1.09	1.82	3.53
	e6b7b49	1.09	1.82	3.30	# fixed useless copy in broadcast.
	@time 1.802994 seconds (3.67 M allocations: 160.433 MB, 1.86% gc time)
	292b854 1.09	1.68	2.99	# fixed recorder
	@time 1.715418 seconds (3.13 M allocations: 127.752 MB, 1.68% gc time)
	8edef74	1.10	1.55	2.96	# eliminated gradient closures
	@time 1.553175 seconds (2.59 M allocations: 110.687 MB)

	* Operations: commented output from nnet shows unnecessary operations:

	sum(((w[3]*max(0,w[1]*x.+w[2]).+w[4])-y).^2)

	187:w			(:rcall,AutoGrad.merge_tapes,:A187_4,:A187_4,:A187_4)
	226:x
	153:y
	119:w3			(:rcall,getindex,:A119_10_64,:A187_4,3)
	581:w1			(:rcall,getindex,:A581_64_784,:A187_4,1)
	506:a1=w1*x		(:rcall,*,:A506_64_100,:A581_64_784,:A226_784_100)
	429:w2			(:rcall,getindex,:A429_64,:A187_4,2)
	373:a2=a1+w2		(:rcall,.+,:A373_64_100,:A506_64_100,:A429_64)
	285:a3=max(0,a2)	(:rcall,max,:A285_64_100,0,:A373_64_100)
	104:a4=w3*a3		(:rcall,*,:A104_10_100,:A119_10_64,:A285_64_100)
	866:w4			(:rcall,getindex,:A866_10,:A187_4,4)
	404:a5=a4+w4		(:rcall,.+,:A404_10_100,:A104_10_100,:A866_10)
	551:a6=a5-y		(:rcall,-,:A551_10_100,:A404_10_100,:A153_10_100)
	271:a7=a6^2		(:rcall,.^,:A271_10_100,:A551_10_100,2)
	000:a8=sum(a7)		(:rcall,sum,332.66202f0,:A271_10_100)

	000.d8=1
	000.413:d7=1.+zeros(a7) ***
	# Leave sum grad as identity, matmul may have a problem!

-	271.544:d6a=a6^1   ***	(:rcall,.^,:A544_10_100,:A551_10_100,1)
#	# Fix gradient of .^ to check for .^2 and .^1
?	271.344:d6b=2*d6a	(:rcall,.*,:A344_10_100,2,:A544_10_100)
+	271.939:d5=d6=d7*d6b *	(:rcall,.*,:A939_10_100,:A413_10_100,:A344_10_100)
-	404.481:d4=d5*1    ***	(:rcall,.*,:A481_10_100,:A939_10_100,1)
-	404.156:g4a=d5*1  ***	(:rcall,.*,:A156_10_100,:A939_10_100,1)
#	# Fix gradient of .+ to be identity
+	404.159:g4=sum(g4a,2)	(:rcall,sum,:A159_10_1,:A156_10_100,2)
+	866.945:g=unget(w,g4,4)	(:rcall,AutoGrad.ungetindex,:A945_4,:A187_4,:A159_10_1,4)
+	104.466:g3=d4*a3'	(:rcall,A_mul_Bc,:A466_10_64,:A481_10_100,:A285_64_100)
+	104.446:d3=w3'*d4	(:rcall,Ac_mul_B,:A446_64_100,:A119_10_64,:A481_10_100)
+	285.853:d2=d3.*(x.==y)	(:rcall,.*,:A853_64_100,:A446_64_100,:A43_64_100)
-	373.091:d1=d2.*1 ***	(:rcall,.*,:A91_64_100,:A853_64_100,1)
-	373.489:g2a=d2.*1 ***	(:rcall,.*,:A489_64_100,:A853_64_100,1)
#	# Same .+ problem.
+	373.048:g2=sum(g2a,2)	(:rcall,sum,:A48_64_1,:A489_64_100,2)
+	429.730:g=unget(w,g2,2)	(:rcall,AutoGrad.ungetindex,:A730_4,:A187_4,:A48_64_1,2)
+	506.636:g1=d1*x'	(:rcall,A_mul_Bc,:A636_64_784,:A91_64_100,:A226_784_100)
+	581.050:g=unget(w,g1,1)	(:rcall,AutoGrad.ungetindex,:A50_4,:A187_4,:A636_64_784,1)
+	119.117:g=unget(w,g3,3)	(:rcall,AutoGrad.ungetindex,:A117_4,:A187_4,:A466_10_64,3)
+	187.993:g=sum		(:rcall,AutoGrad.sum_helper,:A993_4,:A945_4,:A730_4,:A50_4,:A117_4)

2016-08-25  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# Try autograd mnist with own memory allocator for the gpu (to use instead of similar).
	# Finish math.jl.

	* PLAN:
	# exhaust primitives by implementing more models.
	# complete primitive implementations, utilities and release optimized cpu version.
	# try supporting cuda without overwriting, see the speed.
	# if that doesn't work seriously think about overwriting.

2016-08-24  Deniz Yuret  <dyuret@ku.edu.tr>

	* test/profprim.jl: profiling primitives.  See the file for results.
	Fixed inefficient cpu Knet.relu:  (Not good to use (?:) instead of if!)
	profknet.jl:train0(epochs=10,1layer): cpu=2.5444, gpu=1.8276
	profgpu.jl:train0(epochs=10,1layer): cpu=2.7664, gpu=3.5490; # not sure why this changed.
	profknet.jl:train0(epochs=10,2layer): cpu=4.3184, gpu=2.9201
	profgpu.jl:train0(epochs=10,2layer): cpu=5.2096, gpu=7.9503 # why cpu slower than knet? why is gpu slow?

	* timing:
	profile.jl:train0(epochs=1,2layer): 0.6234 using logsumexp, 0.5543 using xentloss; cannot test gpu with weights(64) without relu or max.
	profgpu.jl:train0(epochs=1,2layer): 0.9308 cpu using relu, 0.8403 gpu using relu; why is relu slower than max? because it used (:?) instead of if/else!
	profgpu.jl:train0(epochs=10,1layer): cpu=2.7302, gpu=2.9500; is this the slowdown from gpu alloc?
	profknet.jl:train0(epochs=10,2layer): cpu=7.7560, gpu=2.9201 # cpu slow!

	* DONE:
	# support CudaArrays and compare mnist speed with Knet. (without overwriting at first, fix gc)
	## need gpu *,.+,.-,.*,max,log,exp,sum,sum1 and whatever is needed for their gradients.
	## basically matmul, broadcast, relu and softmax...
	# It might be better to profile with primitive functions with and without allocation: test matmul,badd,relu,softmax on cpu/gpu w/wo alloc.  compare different methods for relu and softmax.  

2016-08-22  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# fixtest should give warning for unknown argtypes, not error.
	# BUG: keyword args mishandled by @primitive.
	## sin{##7516 <: Number}(x::Node{##7516},o...; a=3) = sin(getval(x),o...; a=3)
	# write documentation and publish.

2016-08-21  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# @primitive should take regular argtypes and generate signatures with Nodes.
	# concatenation? cat,hcat,vcat
	## need to figure out where each input ends up in final array.
	## call signature is complicated, vcat is already defined generically, including numbers. need Node inputs.

2016-08-20  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# copying, reshaping.

	* q: Node{Type} vs TypeNode <: Type

	We chose the first (a single parametric type) whereas the Python
	implementation uses the second (many types, I don't think python
	supports parametric types).  In the first, all types are children
	of Node, in the second we can hang them anywhere we want in the
	type hierarchy. Today Emre Yolcu suggested a possible advantage of
	the second approach: in cases where one Julia function calls a
	lower level Julia function in Base, we could get away with
	defining only the lower level function as primitive as long as we
	make AbstractFloatNode a subtype of AbstractFloat, and the Julia
	functions are written for a supertype of AbstractFloat, for
	example.  It is not possible to make Node{Float32} a subtype of
	anything other than Node and Any.  This would make it easier to
	cover groups of functions such as (*)->A_mul_B!->gemm_wrapper! or
	vcat->cat by only letting us define the lowest level one.  We'd
	need to cover arrays, matrices, vectors, tuples, dicts, floats.
	It would also solve the problem of not being able to define
	Node{A<:AbstractArray{T<:Number}}.  The current Node type used by
	core.jl would have to be a big union instead of a supertype, which
	may effect efficiency.  On the downside we may have to define many
	types and the "lowest level" we depend on may change in the next
	Julia version.

	Another point about ArrayNode:

	Currently we can have f{T<:AbstractArray}(x::Node{T})
	This will cover array nodes as well as abstractarray nodes.

	If we switch and have AbstractArrayNode <: AbstractArray and
	ArrayNode <: Array, any function we define for the first will not
	be inherited by the second.  If we use ArrayNode <:
	AbstractArrayNode, then we lose the main advantage of capturing
	function groups defined for Arrays.

2016-08-19  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# removed zeros_like: should use `nothing` in tuples, cell arrays, and dicts and have sum_outgrads understand this: will save 20%!
	# optimize dy.*1 gradients, but be careful not to pass back same array multiple times if we are going to use accumulators. sum_grads already does?
	# remove the need for transpose in matmul.
	# use more specific type signatures instead of ambiguous Nval.
	# get rid of the _name hash: it prevents gc, write a specific dbg printer instead.
	# work on efficiency (are closures efficient? nothing instead of zero arrays?). do closures get garbage collected?
	# check out JuliaDiff/ReverseDiffSource: cannot handle while loops? http://www.juliadiff.org/

	* profiling:

	(17) after getting rid of zeros_like: (compare to #15)
	0. 1.059346 seconds (776.62 k allocations: 309.133 MB)
	1. 1.011051 seconds (774.22 k allocations: 309.096 MB)
	2. 0.267993 seconds (60.78 k allocations: 58.240 MB)
	3. 0.241966 seconds (29.58 k allocations: 49.606 MB)
	4. 0.459659 seconds (454.38 k allocations: 76.330 MB)

   6576 ...d/examples/footime.jl; train0; line: 73  g = gradfun(w, x, y)
    6571 ...AutoGrad/src/core.jl; gradfun; line: 36  backward_pass(forward_pass(fun, args, kwargs, argnum)...)
     437  ...utoGrad/src/core.jl; backward_pass; line: 209  cur_outgrad = sum_outgrads(node.outgrads)
      428 ...utoGrad/src/core.jl; sum_outgrads; line: 541  sum_helper(a...)
       422 ...utoGrad/src/core.jl; sum_helper; line: 549  sum_helper{T}(a::AbstractArray{T},b::AbstractArray{T},c::AbstractArray{T}...) =
     410  ...utoGrad/src/core.jl; backward_pass; line: 213  for (gradfun, parent) in node.parent_grad_ops
      142 ./array.jl; next; line: 277
      171 tuple.jl; indexed_next; line: 21
     2094 ...utoGrad/src/core.jl; backward_pass; line: 215  og = gradfun(cur_outgrad)
      171 .../src/collections.jl; anonymous; line: 30  getindex(::D1,y,x,i...) = dy->ungetindex(x,dy,i...) # y=x[i], dy=df/dy
      713 ...rc/linalg/matmul.jl; anonymous; line: 2
       383 linalg/matmul.jl; A_mul_Bc; line: 187
	# There is some inefficiency here with actual transposing rather than calling transpose gemm!
       295 operators.jl; A_mul_Bc; line: 164
        279 no file; ctranspose; line: 0
         250 ...toGrad/src/core.jl; r; line: 127  result = f(argvals...; kwargs...)
          249 arraymath.jl; ctranspose; line: 414
           239 arraymath.jl; transpose!; line: 323
            155 arraymath.jl; transposeblock!; line: 350
      889 ...utoGrad/src/util.jl; anonymous; line: 22  gexp = :(dy->dy.*$(_d[i]))
       106 no file; .*; line: 0
	# Who is calling this? Probably max.
       212 no file; .==; line: 0
        204 ...utoGrad/src/core.jl; u; line: 408  u(x...; o...)=f(map(getval,x)...; o...)
         190 broadcast.jl; .==; line: 363
          171 broadcast.jl; broadcast!; line: 246
       437 sparse/sparsematrix.jl; .*; line: 1026
     3496 ...utoGrad/src/core.jl; forward_pass; line: 72  end_node = fun(args...; kwargs...)
      1783 ...examples/footime.jl; loss; line: 39  ypred = predict(w, x)
       1142 ...examples/footime.jl; predict; line: 32  x = max(0, w[i]*x .+ w[i+1])
        419 no file; *; line: 0
         337 ...toGrad/src/core.jl; r; line: 127  result = f(argvals...; kwargs...)
          337 linalg/matmul.jl; *; line: 132
        242 no file; .+; line: 0
        161 no file; getindex; line: 0
        319 no file; max; line: 0
         235 ...toGrad/src/core.jl; r; line: 127  result = f(argvals...; kwargs...)
          234 operators.jl; max; line: 391
       601  ...examples/footime.jl; predict; line: 35  return w[i]*x .+ w[i+1]
        157 no file; *; line: 0
        239 no file; .+; line: 0
        205 no file; getindex; line: 0
      729  ...examples/footime.jl; loss; line: 40  ynorm = ypred .- log(sum(exp(ypred),1))
       225 no file; .-; line: 0
       255 no file; exp; line: 0
        153 ...utoGrad/src/core.jl; r; line: 127  result = f(argvals...; kwargs...)
         149 operators.jl; exp; line: 381
       103 no file; log; line: 0
       140 no file; sum; line: 0
      982  ...examples/footime.jl; loss; line: 41  -sum(ygold .* ynorm) / size(ygold,2)
       207 no file; -; line: 0
       339 no file; .*; line: 0
        120 ...utoGrad/src/core.jl; r; line: 127  result = f(argvals...; kwargs...)
         116 ...se/sparsematrix.jl; .*; line: 1026
       257 no file; /; line: 0
       168 no file; sum; line: 0

	(18) after matmul is fixed (no transpose, compare to #17):
	0. 1.001226 seconds (769.42 k allocations: 256.563 MB)
	1. 0.927784 seconds (767.02 k allocations: 256.527 MB)
	2. 0.262618 seconds (60.78 k allocations: 58.240 MB)
	3. 0.241753 seconds (29.58 k allocations: 49.606 MB)
	4. 0.421250 seconds (450.78 k allocations: 76.220 MB)

   5567 ...d/examples/footime.jl; train0; line: 73  g = gradfun(w, x, y)
    5557 ...AutoGrad/src/core.jl; gradfun; line: 36  backward_pass(forward_pass(fun, args, kwargs, argnum)...)
     321  ...utoGrad/src/core.jl; backward_pass; line: 209  cur_outgrad = sum_outgrads(node.outgrads)
      303 ...utoGrad/src/core.jl; sum_outgrads; line: 530  sum_helper(a...)
       296 ...utoGrad/src/core.jl; sum_helper; line: 538  sum_helper{T}(a::AbstractArray{T},b::AbstractArray{T},c::AbstractArray{T}...) =
     359  ...utoGrad/src/core.jl; backward_pass; line: 213  for (gradfun, parent) in node.parent_grad_ops
      108 ./array.jl; next; line: 277
      153 tuple.jl; indexed_next; line: 21
     1740 ...utoGrad/src/core.jl; backward_pass; line: 215  og = gradfun(cur_outgrad)
      181 .../src/collections.jl; anonymous; line: 30  getindex(::D1,y,x,i...) = dy->ungetindex(x,dy,i...) # y=x[i], dy=df/dy
       154 no file; ungetindex; line: 0
      574 ...rc/linalg/matmul.jl; anonymous; line: 2
       440 linalg/matmul.jl; A_mul_Bc; line: 187
        440 linalg/matmul.jl; A_mul_Bt; line: 156
         433 linalg/matmul.jl; gemm_wrapper!; line: 329
          433 linalg/blas.jl; gemm!; line: 633
      680 ...utoGrad/src/util.jl; anonymous; line: 29  gexp = :(dy->dy.*$(_d[i]))
       193 no file; .==; line: 0
        183 ...utoGrad/src/core.jl; u; line: 397  u(x...; o...)=f(map(getval,x)...; o...)
         168 broadcast.jl; .==; line: 363
          129 broadcast.jl; broadcast!; line: 246
       325 sparse/sparsematrix.jl; .*; line: 1026
        197 broadcast.jl; broadcast!; line: 246
         107 broadcast.jl; _F_; line: 97
      102 ...utoGrad/src/util.jl; new_fun; line: 227  return sum(result, d)
     2968 ...utoGrad/src/core.jl; forward_pass; line: 72  end_node = fun(args...; kwargs...)
      1778 ...examples/footime.jl; loss; line: 39  ypred = predict(w, x)
       1177 ...examples/footime.jl; predict; line: 32  x = max(0, w[i]*x .+ w[i+1])
        525 no file; *; line: 0
         447 ...toGrad/src/core.jl; r; line: 127  result = f(argvals...; kwargs...)
          445 linalg/matmul.jl; *; line: 132
           432 linalg/matmul.jl; gemm_wrapper!; line: 329
            432 linalg/blas.jl; gemm!; line: 633
        156 no file; .+; line: 0
        188 no file; getindex; line: 0
        307 no file; max; line: 0
         237 ...toGrad/src/core.jl; r; line: 127  result = f(argvals...; kwargs...)
          235 operators.jl; max; line: 391
       548  ...examples/footime.jl; predict; line: 35  return w[i]*x .+ w[i+1]
        141 no file; *; line: 0
        197 no file; .+; line: 0
        207 no file; getindex; line: 0
      552  ...examples/footime.jl; loss; line: 40  ynorm = ypred .- log(sum(exp(ypred),1))
       198 no file; .-; line: 0
       156 no file; exp; line: 0
       113 no file; sum; line: 0
      635  ...examples/footime.jl; loss; line: 41  -sum(ygold .* ynorm) / size(ygold,2)
       139 no file; -; line: 0
       217 no file; .*; line: 0
       157 no file; /; line: 0
       116 no file; sum; line: 0

	(19) AutoGrad vs Knet timing with/without gpu, 10 epoch, mnist64 on biyofiz-4-1:

	AutoGrad CPU:
	0. 6.171894 seconds (7.69 M allocations: 2.505 GB) forw+rec-back-update
	1. 6.265890 seconds (7.67 M allocations: 2.505 GB) forw+rec-back
	2. 1.673144 seconds (607.80 k allocations: 582.394 MB) forw
	3. 1.471988 seconds (295.80 k allocations: 496.060 MB) forw (no loss)
	4. 3.023412 seconds (4.51 M allocations: 762.204 MB)   forw+rec

	Knet CPU:
	0. 7.661409 seconds (88.90 M allocations: 1.367 GB)  forw-back-update
	1. 8.407676 seconds (108.31 M allocations: 1.653 GB) forw-back
	2. 4.235033 seconds (54.22 M allocations: 845.108 MB) forw

	Knet GPU:
	0. 2.925609 seconds (4.22 M allocations: 189.742 MB) forw-back-update
	1. 2.792329 seconds (4.08 M allocations: 184.254 MB) forw-back
	2. 1.417462 seconds (1.96 M allocations: 86.087 MB)  forw

2016-08-18  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# profile mnist

	* profiling:

	- Hypotheses:
	- closures slow.
	- untyped functions slow.
	- memory allocation slow.
	- two types of memory alloc: user code, grad code.
	- for user code, let the user do it, support overwriting ops (careful about non-Node arrays)
	- for grad code, keep around a tape after it is used?  how about args in closures?

	(1) AutoGrad: Testing 1 epoch h=64 mnist with minibatch=100, best out of 3 on ural from fresh start.
	0. Regular train (record, update)
	1. Forward back (record, no update)
	2. Nonrecording, no update, just forward (computes softloss)
	3. Just predict, no softloss.
	4. forward_pass(loss) (recording version of #2)

	0. 13.944595 seconds (3.95 M allocations: 2.299 GB, 18.09% gc time)
	1. 13.100620 seconds (3.93 M allocations: 1.844 GB, 15.03% gc time)
	2. 4.288226 seconds (64.38 k allocations: 114.196 MB, 0.17% gc time)
	3. 4.242340 seconds (33.18 k allocations: 98.294 MB, 0.12% gc time)

	(2) Problem: mixing Float64 weigths with Float32 data!

	(3) AutoGrad: Use Float64 weights and data
	0. 6.512184 seconds (3.94 M allocations: 2.299 GB, 38.67% gc time)
	1. 5.719212 seconds (3.92 M allocations: 1.843 GB, 34.23% gc time)
	2. 0.483972 seconds (60.78 k allocations: 114.032 MB, 1.31% gc time)
	3. 0.444158 seconds (29.58 k allocations: 98.129 MB, 1.01% gc time)

	(4) AutoGrad: Use Float32 weights and data
	0. 4.275396 seconds (3.94 M allocations: 1.237 GB, 33.43% gc time)
	1. 3.702224 seconds (3.92 M allocations: 1.008 GB, 32.20% gc time)
	2. 0.269955 seconds (60.78 k allocations: 58.239 MB, 1.22% gc time)
	3. 0.242950 seconds (29.58 k allocations: 49.606 MB, 1.13% gc time)

	(5) Continue with Float32.

	(6) AutoGrad: gc_enable(false) while measuring time:
	0. 2.760838 seconds (3.94 M allocations: 1.236 GB)
	1. 2.489116 seconds (3.92 M allocations: 1.007 GB)
	2. 0.266896 seconds (60.78 k allocations: 58.239 MB)
	3. 0.240501 seconds (29.58 k allocations: 49.606 MB)

	(7) AutoGrad: use axpy! during update:
	0. 2.559526 seconds (3.92 M allocations: 1.008 GB)

	(8) Compare with Knet: same task.
	0. 1.128186 seconds (8.94 M allocations: 141.341 MB, 1.63% gc time) Forw-back-update
	1. 1.244718 seconds (10.76 M allocations: 168.832 MB, 1.70% gc time) Forward-back (no update)
	2. 0.590746 seconds (5.39 M allocations: 84.441 MB, 1.41% gc time) Forward only

	(9) broadcast vs vectorized ops. (testing exp(a) where a=rand(10000,10000))
	2.989990 seconds (2 allocations: 762.940 MB): exp(a)
	2.920066 seconds (12 allocations: 762.940 MB, 0.22% gc time): broadcast(exp,a)
	2.725942 seconds (1 allocation: 32 bytes): broadcast!(exp,b,a)
	2.724059 seconds (1 allocation: 32 bytes): broadcast!(exp,a,a)
	Not much difference, vectorized ops to be deprecated in favor of broadcast in v6.0.

	(10) Julia v0.5 is not faster (but maybe it hasn't been compiled optimally).
	0. 2.858207 seconds (3.78 M allocations: 1.011 GB)

	(11) Profile1
    2276 ...AutoGrad/src/core.jl; gradfun; line: 36  backward_pass(forward_pass(fun, args, kwargs, argnum)...)
     242 ...AutoGrad/src/core.jl; backward_pass; line: 208  cur_outgrad = sum_outgrads(node.outgrads...)
      237 ...utoGrad/src/core.jl; sum_outgrads; line: 570
       192 ...utoGrad/src/core.jl; sum_outgrads; line: 570
        187 broadcast.jl; broadcast; line: 253
         142 broadcast.jl; broadcast!; line: 246
     711 ...AutoGrad/src/core.jl; backward_pass; line: 214  og = gradfun(cur_outgrad)
      415 .../src/collections.jl; anonymous; line: 14       getindex(::D1,y,x,i...) = dy->ungetindex(x,dy,i...)
       412 no file; ungetindex; line: 0
        298 ...utoGrad/src/core.jl; r; line: 126
         298 .../src/collections.jl; ungetindex; line: 21
          297 ...toGrad/src/core.jl; fill_internal; line: 546
           295 ...oGrad/src/core.jl; fill_check; line: 549
            291 ...oGrad/src/core.jl; fill_internal; line: 546
             264 array.jl; fill!; line: 193
      150 ...utoGrad/src/util.jl; anonymous; line: 22
     953 ...AutoGrad/src/core.jl; forward_pass; line: 72   end_node = fun(args...; kwargs...)
      519 .../examples/footime.jl; loss; line: 39          ypred = predict(w, x)
       329 ...examples/footime.jl; predict; line: 32       x = max(0, w[i]*x .+ w[i+1])
        108 no file; .+; line: 0
       186 ...examples/footime.jl; predict; line: 35       return w[i]*x .+ w[i+1]
      260 .../examples/footime.jl; loss; line: 40          ynorm = ypred .- log(sum(exp(ypred),1))
      173 .../examples/footime.jl; loss; line: 41          -sum(ygold .* ynorm) / size(ygold,2)

	(12) Focusing on forward:
	2. 0.266896 seconds (60.78 k allocations: 58.239 MB) call loss
	4. 0.939486 seconds (1.96 M allocations: 146.032 MB) call forward_pass(loss) (recording)

	(13) Taking out all the debug calls (turning them into macros)
	0. 1.556437 seconds (1.12 M allocations: 906.322 MB)
	2. 0.267631 seconds (60.78 k allocations: 58.239 MB)
	4. 0.544930 seconds (587.58 k allocations: 82.464 MB)
	At this point recording takes as much time as forward calculation, backward pass takes 4x more.
	Forward is as efficient as can be, dominated by array ops, except for excessive allocation.

	(14) What costs on forward_pass? (profiling 20 epochs)
    10176 ...AutoGrad/src/core.jl; forward_pass; line: 72  end_node = fun(args...; kwargs...)
     7875 .../examples/footime.jl; loss; line: 39  ypred = predict(w, x)
      6595 ...examples/footime.jl; predict; line: 32  x = max(0, w[i]*x .+ w[i+1])
       810  no file; *; line: 0
       3307 no file; .+; line: 0
       1636 no file; getindex; line: 0
       832  no file; max; line: 0
      1231 ...examples/footime.jl; predict; line: 35  return w[i]*x .+ w[i+1]
       294 no file; *; line: 0
       448 no file; .+; line: 0
       484 no file; getindex; line: 0
     1391 .../examples/footime.jl; loss; line: 40  ynorm = ypred .- log(sum(exp(ypred),1))
      463 no file; .-; line: 0
      440 no file; exp; line: 0
      227 no file; log; line: 0
      255 no file; sum; line: 0
     907  .../examples/footime.jl; loss; line: 41  -sum(ygold .* ynorm) / size(ygold,2)
      160 no file; -; line: 0
      331 no file; .*; line: 0
      198 no file; /; line: 0
      215 no file; sum; line: 0

	Compare with regular call:
   2526 ...d/examples/footime.jl; train2; line: 96  z = loss(w, x, y)
    1604 ...examples/footime.jl; predict; line: 32  x = max(0, w[i]*x .+ w[i+1])
     262 broadcast.jl; .+; line: 315
     392 linalg/matmul.jl; *; line: 132
     943 operators.jl; max; line: 391
    201  ...examples/footime.jl; predict; line: 35  return w[i]*x .+ w[i+1]
     152 broadcast.jl; .+; line: 315
     46  linalg/matmul.jl; *; line: 132
    154  broadcast.jl; .-; line: 321
    324  operators.jl; exp; line: 381
    36   operators.jl; log; line: 381
    52   reducedim.jl; sum; line: 264
    132  sparse/sparsematrix.jl; .*; line: 1026

	Most expensive lines in recorder:
	tapes=Set() # convert to a regular array
	for (arg,i) in enumerate(args) # convert to a regular for

	Next iteration:
	109 ...utoGrad/src/core.jl; r; line: 114  in(tape,tapes) || push!(tapes,tape)
	143 ...utoGrad/src/core.jl; r; line: 127  result = f(argvals...; kwargs...)
	101 ...utoGrad/src/core.jl; r; line: 147  for (tape, argnum, parent) in ops
         83 tuple.jl; indexed_next; line: 21
	76  ...utoGrad/src/core.jl; r; line: 149  gradfun = f(Grad{argnum}, result, args...; kwargs...)

	Removing in(tape,tapes), handle duplicates in Node().
	Next iteration: for loops with tuples; maybe we can optimize the single tape case.

       448 no file; .+; line: 0
        91  ...utoGrad/src/core.jl; r; line: 111  for (tape, parent_rnode) in arg.tapes
         74 operators.jl; indexed_next; line: 437
        109 ...utoGrad/src/core.jl; r; line: 127  result = f(argvals...; kwargs...)
        63  ...utoGrad/src/core.jl; r; line: 147  for (tape, argnum, parent) in ops
         51 tuple.jl; indexed_next; line: 21

	(15) latest single epoch times: (compare to #13)
	0. 1.380165 seconds (905.62 k allocations: 896.819 MB)
	1. 1.344440 seconds (903.22 k allocations: 896.783 MB)
	2. 0.267039 seconds (60.78 k allocations: 58.240 MB)
	3. 0.240020 seconds (29.58 k allocations: 49.606 MB)
	4. 0.434085 seconds (454.38 k allocations: 76.330 MB)

	This is better than Knet with recording in forward_pass.  Time to
	optimize the backward_pass.

	(16) Profiling backward_pass with 10 epochs.

   10496 .../examples/footime.jl; train0; line: 73  g = gradfun(w, x, y)
    10484 ...AutoGrad/src/core.jl; gradfun; line: 36  backward_pass(forward_pass(fun, args, kwargs, argnum)...)	
     2018 ...utoGrad/src/core.jl; backward_pass; line: 209  cur_outgrad = sum_outgrads(node.outgrads...)
     290  ...utoGrad/src/core.jl; backward_pass; line: 213  for (gradfun, parent) in node.parent_grad_ops
     4558 ...utoGrad/src/core.jl; backward_pass; line: 215  og = gradfun(cur_outgrad)
     3472 ...utoGrad/src/core.jl; forward_pass; line: 72  end_node = fun(args...; kwargs...)
   136   .../examples/footime.jl; train0; line: 76  Base.axpy!(-lr, g[i], w[i])

	# sum_outgrads spends significant time adding gradients for fanout>1 ops.
	# we can try adding to an accumulator instead of pushing.
	# we can also use the first gradient as the accumulator to avoid allocation but we decided that was dangerous.
	#   the same gradient matrix is passed back multiple times by e.g. +?
	#   actually right now we are multiplying with 1! never passing the original dy?
	#   the main problem is sum_outgrads is a primitive for high-order derivatives and we don't support overwriting operations yet.

     2018 ...utoGrad/src/core.jl; backward_pass; line: 209  cur_outgrad = sum_outgrads(node.outgrads...)
      1996 ...utoGrad/src/core.jl; sum_outgrads; line: 572  sum_outgrads{T}(a::AbstractArray{T},b::AbstractArray{T},c::AbstractArray{T}...) =
       1686 ...utoGrad/src/core.jl; sum_outgrads; line: 572  sum_outgrads{T}(a::AbstractArray{T},b::AbstractArray{T},c::AbstractArray{T}...) =	
        1670 broadcast.jl; broadcast; line: 253
         1493 broadcast.jl; broadcast!; line: 246

     4558 ...utoGrad/src/core.jl; backward_pass; line: 215  og = gradfun(cur_outgrad)

	# This is the main deal where we can save a lot of time.
	# we seem to be spending a lot of time filling arrays with zeros for ungetindex.
	# should use `nothing` in tuples, cell arrays, and dicts and have sum_outgrads understand this.

      2514 .../src/collections.jl; anonymous; line: 14  getindex(::D1,y,x,i...) = dy->ungetindex(x,dy,i...) # y=x[i], dy=df/dy
       2491 no file; ungetindex; line: 0
        2287 ...toGrad/src/core.jl; r; line: 127  result = f(argvals...; kwargs...)
         2279 ...src/collections.jl; ungetindex; line: 21  ungetindex(x::AbstractArray, dy, i...)  = (dx=zeros_like(x);setindex!(dx,dy,i...);dx)
          2264 ...oGrad/src/core.jl; fill_internal; line: 548  fill_internal{T}(x::AbstractArray{T},v,d::ObjectIdDict)=
           2244 ...oGrad/src/core.jl; fill_check; line: 551  fill_check(x,v,d::ObjectIdDict)=(haskey(d,x) ? d[x] : d[x]=fill_internal(x,v,d))
            2232 ...oGrad/src/core.jl; fill_internal; line: 548  fill_internal{T}(x::AbstractArray{T},v,d::ObjectIdDict)=
             2178 array.jl; fill!; line: 193

      348  ...rc/linalg/matmul.jl; anonymous; line: 2

      1019 ...utoGrad/src/util.jl; anonymous; line: 22  gexp = :(dy->dy.*$(_d[i]))
       203 arraymath.jl; .*; line: 125
       157 no file; .*; line: 0
        113 ...utoGrad/src/core.jl; r; line: 127  result = f(argvals...; kwargs...)
         109 ...se/sparsematrix.jl; .*; line: 1026
       150 no file; .==; line: 0
        139 ...utoGrad/src/core.jl; u; line: 408  u(x...; o...)=f(map(getval,x)...; o...)
         126 broadcast.jl; .==; line: 363
       473 sparse/sparsematrix.jl; .*; line: 1026
        430 broadcast.jl; broadcast!; line: 246

      232  ...utoGrad/src/util.jl; new_fun; line: 189  result = gradfun(dy)

      222  ...utoGrad/src/util.jl; new_fun; line: 196  return sum(result, d)


2016-08-17  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# Julia v5.0 compatibility.
	# Julia 5 already supports function types making Fn{} unnecessary: julia> typeof(sin) => Base.#sin
	# Julia 5 in-place syntax does not work for matmul, or .+ (unless you write x .= (+).(x,a)) yet.

2016-08-16  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# fix namespace for runtests.jl.
	# solve testing problem with zerograd args (sum, airy)

2016-08-15  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# extend defgrads so it can handle manual definitions as well.
	# implement reductions: sum, vecnorm
	# implement arraymath functions (transpose etc)
	# test mnist etc. more examples.
	# implement zero-one loss.
	# write mnist loader.
	# implement zerograd for one of the arguments.

2016-08-14  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# reorganize gradients mirroring base.
	# handle broadcast.jl.

2016-08-13  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# use Grad or some other name instead of Val.
	# split functions based on what type of args they accept.
	# implement broadcasting functions (finish unbroadcast).
	# implement matrix multiplication.

2016-08-12  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# write gradcheck.
	# implement unbroadcast
	# implement/test 2arg functions.

2016-08-11  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# @primitive should be type specific.
	# sum_outgrads should not overwrite its arguments: e.g. + may have passed the same dy back to multiple places.
	# separate tests, examples, and gradients.
	# (w,b)=params does not work, implement iterators?