-
Notifications
You must be signed in to change notification settings - Fork 14
/
README.txt
247 lines (159 loc) · 8.1 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
NOTE: This is the Accumulo 1.6.0+ version.
*This build will not work against Accumulo 1.5 and previous.*
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% D4M: Dynamic Distributed Dimensional Data Model
% Architect: Dr. Jeremy Kepner (kepner@ll.mit.edu)
% MIT Lincoln Laboratory
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% (c) <2010> Massachusetts Institute of Technology
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1. INTRODUCTION
D4M is a library that allows unstructured data to be represented
as triples in sparse matrices (Associative Arrays) and can be
manipulated using standard linear algebraic operations.
Using D4M it is possible to construct advanced analytics
with just a few lines of code.
D4M also supports parallel computing and connections to
high performance databases (e.g., Accumulo).
2. DOCUMENTATION
For installation, please read this short (~5 page) document.
For usage please see the eight lecture course in d4m_api/docs directory.
For examples please see the numerous examples (ending in TEST.m) in the d4m_api/examples directory.
When citing D4M in publications please use:
[Kepner et al, ICASSP 2012] Dynamic Distributed Dimensional Data Model (D4M)
Database and Computation System, J. Kepner, W. Arcand, W. Bergeron, N. Bliss,
R. Bond, C. Byun, G. Condon, K. Gregson, M. Hubbell, J. Kurz, A. McCabe, P. Michaleas,
A. Prout, A. Reuther, A. Rosa & C. Yee, ICASSP (International Conference on Acoustics,
Speech, and Signal Processing), Special session on Signal and Information Processing
for "Big Data" (organizers: Bliss & Wolfe), March 25-30, 2012, Kyoto, Japan
3. REQUIREMENTS
D4M (standalone)
-Requires Matlab (www.mathworks.com/matlab) or GNU Octave 3.2+ (www.octave.org)
D4M Parallel
-Requires pMatlab (www.ll.mit.edu/pMatlab)
D4M Database
-Requires D4M database connector jar (see d4m_api/lib)
-Requires various 3rd party jars (see d4m_api/libext)
-Requires a running database
-D4M provides full support to Accumulo (accumulo.apache.org)
-D4M provides query support to SQL databases via JTDS (jtds.sourceforge.net)
-GNU octave < 3.6 requires the Java package
4. LICENSE
D4M follows the highly successful FFTW MIT licensing model (see fftw.org) and
is avalable via a number of licenses: Free (GPL), U.S. Gov't Agency,
U.S. Gov't Contractor, and Commercial. See additional documentation in the distribution.
5. INSTALLATION
Extract d4m_api.X.X.X.zip in your local directory.
If you want to connect to a database, then also download and extract the external libraries
libext.X.X.X.zip file and place it in the d4m_api/ directory. This should result
in a distribution containing:
d4m_api-X.X.X
docs/
examples/
lib/
libext/
matlab_src/
TEST/
From here on we will refer to the full path to d4m_api-X.X.X as D4M_HOME
and ">>" denotes the Matlab (or GNU Octave) prompt.
6. QUICKSTART
(1) Start Matlab (or GNU Octave)
(2) Add the D4M library to your path by typing
>> addpath('D4M_HOME/matlab_src')
(3) Done.
Display the function refernce by typing:
>> help D4M
Run the first example by typing:
>> cd D4M_HOME/examples/1Intro/1AssocIntro
>> AI1_SetupTEST
7. ADDING PARALLEL AND DATABASE CAPABILITIES
It is recommended that the D4M setup be placed in the Matlab ~/matlab/startup.m or GNU Octave ~/.ocatverc file.
[Note: Windows users should consult their Matlab/Octave documentation to determine where this should exist.]
Below is a fully commented example of what this file might look like:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
D4M_HOME = '/Users/kepner/SVN/d4m_api'; % SET TO LOCATION OF D4M.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
addpath([D4M_HOME '/matlab_src']); % Add the D4M library.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Assoc('','',''); % Initialize library.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Uncomment the following line to enable the D4M database connector.
%DBinit; % This requires that the libext/ directory is in place.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Uncomment and modify the following four lines for parallel D4M.
%PMATLAB_HOME = '/Users/kepner/SVN/pMatlab'; % SET location of pMatlab.
%addpath([PMATLAB_HOME '/MatlabMPI/src']); % Add MatlabMPI.
%addpath([PMATLAB_HOME '/src']); % Add pMatlab.
%pMatlabGlobalsInit;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
8. TESTING
To run all the examples, cd to the examples/ directory, start matlab (or GNU Octave) and type:
>> cd D4M_HOME/examples
>> d4mTestAllExamples
NOTE: Some of the programs in examples/3Scaling/2ParallelDatabase require a valid database connection.
To run in parallel these programs also require pMatlab (www.ll.mit.edu/pMatlab/).
To configure the Database, you will need to uncomment and modify the DB = DBsetup(...) command in
examples/3Scaling/2ParallelDatabase/DBsetup.m
9. RUNNING IN PARALLEL
Several parallel examples can be found in examples/3Scaling/2ParallelDatabase.
To run in parallel edit an example (e.g., pDB02_FileTEST.m) by uncommenting the lines
marked "% PARALLEL." To run on 4 processors on your local machine type:
>> cd D4M_HOME/examples/3Scaling/2ParallelDatabase
>> eval(pRUN('pDB02_FileTEST',4,{}))
10. DATABASE CONNECTION
10.1 Seting up an Accumulo connection
To establish an Accumulo connection in D4M, use the DBserver object.
>> DB = DBserver(host, db_type, instance_name, [username],[password])
DBserver needs the following parameters
host name : zookeeper host name
database type: always use 'Accumulo' as the parameter value
instance name: Accumulo instance name
user name: user name on database.
password: password for user
You will be prompted for a username and password if you don't include them.
As you type the password you will not see anything displayed, so type carefully and hit return.
>> hostname='localhost'
>> cb_type = 'Accumulo'
>> instance_name='Accumulo'
>> DB = DBserver(hostname,cb_type,instance_name);
Enter a username:
JoeUser <return>
Enter a password.
<return>
10.2 Create a table or get an existing table in Accumulo
D4M has 2 flavors of database table objects - DBtable and DBtablePair.
With these table objects, you have access to the data.
Once you have the DBserver object, you can create a single table or get an existing table
by instantiating a DBtable object by passing a name of the table to the DBserver object.
>> T = DB('MyTableName');
To create DBtablePair object,
>> TT = DB('MyTable','MyTableTranspose');
10.3 Querying for data
You can query for data via the DBtable or DBtablePair.
The syntax is
>> A = T(row_key,column_key)
The results from the query are contained in an associative array object Assoc.
>> A = T(:,:)
This query will give you back all the data from T in a Assoc object.
The row_key and column_key have a particular format to follow:
":" colon indicate all results.
'cat,fat,hat,' queries for cat, fat, and hat
Note, the ending comma is a necessary delimiter to include in the query string.
'cat,:,pat,' will query for a range, from cat through pat
10.4 Examples:
This will search the rows for cat, hat , and sat and any columns.
>> A = T('cat,hat,sat,',:)
This query will give me back the range between cat and sat, and all columns.
>> A= T('cat,:,sat,', :);
This query will give me back all rows with columns of 'cat', 'fat', and 'what'.
>> A = T(:,'cat,fat,what,');
The above query will be much faster if a table pair is used:
>> A = TT(:,'cat,fat,what,');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% D4M: Dynamic Distributed Dimensional Data Model
% Architect: Dr. Jeremy Kepner (kepner@ll.mit.edu)
% Software Engineer: Dr. Jeremy Kepner (kepner@ll.mit.edu)
% MIT Lincoln Laboratory
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% (c) <2010> Massachusetts Institute of Technology
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%