Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parallel wordcount example. #475

Merged
merged 1 commit into from
Feb 27, 2012
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions examples/wordcount.j
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# wordcount.j
#
# Implementation of parallelized "word-count" of a text, inspired by the
# Hadoop WordCount example. Uses @spawn and fetch() to parallelize
# the "map" task. Reduce is currently done single-threaded.
#
# To run in parallel on a string stored in variable `text`:
# julia -p <N>
# julia> @everywhere load("<julia_dir>/examples/wordcount.j")
# julia> ...(define text)...
# julia> counts=parallel_wordcount(text)
#
# Or to run on a group of files, writing results to an output file:
# julia -p <N>
# julia> @everywhere load("<julia_dir/examples/wordcount.j")
# julia> wordcount_files("/tmp/output.txt", "/tmp/input1.txt","/tmp/input2.txt",...)

# "Map" function.
# Takes a string. Returns a HashTable with the number of times each word
# appears in that string.
function wordcount(text)
words=split(text,(' ','\n','\t','-','.',',',':',';'),false)
counts=HashTable()
for w = words
counts[w]=get(counts,w,0)+1
end
return counts
end

# "Reduce" function.
# Takes a collection of HashTables in the format returned by wordcount()
# Returns a HashTable in which words that appear in multiple inputs
# have their totals added together.
function wcreduce(wcs)
counts=HashTable()
for c = wcs
for (k,v)=c
counts[k] = get(counts,k,0)+v
end
end
return counts
end

# Splits input string into nprocs() equal-sized chunks (last one rounds up),
# and @spawns wordcount() for each chunk to run in parallel. Then fetch()s
# results and performs wcreduce().
function parallel_wordcount(text)
lines=split(text,'\n',false)
np=nprocs()
unitsize=ceil(length(lines)/np)
wcounts={}
rrefs={}
# spawn procs
for i=1:np
first=unitsize*(i-1)+1
last=unitsize*i
if last>length(lines)
last=length(lines)
end
subtext=join(lines[int(first):int(last)],"\n")
push(rrefs, @spawn wordcount( subtext ) )
end
# fetch results
while length(rrefs)>0
push(wcounts,fetch(pop(rrefs)))
end
# reduce
count=wcreduce(wcounts)
return count
end

# Takes the name of a result file, and a list of input file names.
# Combines the contents of all files, then performs a parallel_wordcount
# on the resulting string. Writes the results to result_file.
function wordcount_files(result_file,input_file_names...)
text=""
for f = input_file_names
fh=open(f)
text=join( {text,readall(fh)}, "\n" )
close(fh)
end
wc=parallel_wordcount(text)
fo=open(result_file,"w")
for (k,v) = wc
with_output_stream(fo,println,k,"=",v)
end
end