-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathaverage_length_mapper.py
39 lines (34 loc) · 1.2 KB
/
average_length_mapper.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#!/usr/bin/python
"""
We are interested to see if there is a correlation between the length of a post
and the length of answers.
Write a mapreduce program that would process the forum_node data and output the
length of the post and the average answer (just answer, not comment) length for
each post. You will have to decide how to write both the mapper and the reducer
to get the required result.
"""
import sys
import csv
def mapper(stdin):
"""
MapReduce Mapper. Output is tab-delimited. Each line gives the question
ID, 0/1, question/answer, and body length.
"""
reader = csv.reader(stdin, delimiter='\t')
# Skip header.
reader.next()
writer \
= csv.writer(
sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
for line in reader:
if len(line) == 19:
the_id = line[0]
body = line[4]
node_type = line[5]
if node_type == "question":
writer.writerow([the_id, "0", "question", len(body)])
elif node_type == "answer":
parent_id = line[6]
writer.writerow([parent_id, "1", "answer", len(body)])
if __name__ == "__main__":
mapper(sys.stdin)