-
Notifications
You must be signed in to change notification settings - Fork 0
/
ChangeLog
947 lines (638 loc) · 31.8 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
2012-02-29 Adam Moody <moody20@llnl.gov>
* scr_util_mpi.c : Added some functions to send/recv strings via MPI.
* scr_reddesc.c : Step through redundancy descriptor hashes by name
as determined by rank 0 (no longer need to number CKPTDESC in
sequential order).
2012-02-08 Adam Moody <moody20@llnl.gov>
* m4/x_ac_expand_install_dirs.m4 : Fix to address datarootdir variable
added from autoconf 2.59 to 2.63.
2012-02-07 Adam Moody <moody20@llnl.gov>
* scr_param.c : Added hash to return pointers to values returned from
getenv, return those values directly caused seg faults on cray
systems.
* scr_globals.c : Removed scr_hop_distance, added, scr_group, scr_comm_node,
and scr_comm_node_across.
2012-02-01 Adam Moody <moody20@llnl.gov>
* scr_util.c : Added scr_strdupf.
2012-01-31 Adam Moody <moody20@llnl.gov>
* : split scr.c into component files and moved global variables to scr_globals.c
* : Dropped scr_comm_local/level and associated parameters and replaced
them with group descriptors (defines high-level groups of processes)
and store descriptors (defines parameters about storage locations).
2011-08-31 Adam Moody <moody20@llnl.gov>
* scr_hash_util.h/c : Added scr_hash_set/get_double.
* scr_transfer.c : Only bother writing to the transfer file if we
successfully opened it.
* scr.c : Added code to support async flush in 1.1-8.
2011-08-03 Adam Moody <moody20@llnl.gov>
* : Recording meta data in filemap and summary files instead of separate
meta data files.
* scr_filemap.h,
scr_filemap.c : Added routines to set/get/unset meta for each file.
2011-07-30 Adam Moody <moody20@llnl.gov>
* : Changed filemap, summary, and index files to refer to dataset ids
rather than checkpoint ids. Rebuild now rebuilds all datasets in cache
instead of just the most recent. Not yet fully tested.
* : Moved halt, flush, and nodes files from control directory to prefix
directory. Now only written by rank 0 rather than all master nodes.
* scr_scavenge.in : Renamed scr_flush to scr_scavenge and changed verbage
from flush and drain to scavenge where appropriate. Flush was overloaded
to refer to different concepts and drain soon will be.
* scr_hash.h,
scr_hash.c : Moved function to list integer keys from filemap to hash.
Fixed bug which was sorting integer keys as strings in scr_hash_sort_int().
2011-06-28 Adam Moody <moody20@llnl.gov>
* scr_env.h,
scr_env.c : Added file to isolate typical machine dependent code.
* scr_dataset.h,
scr_dataset.c : Associates attributes with a set of files. Extends
SCR to handle general output datasets rather than just checkpoints.
* scr_err.h,
scr_err.c,
scr_err_serial.c : Write all messages to stdout (errors were going to stderr).
Include version number in all messages. Added new scr_warn() function.
* scr_io.h,
scr_io.c : Added functions to return current working directory and
to build the absolute path to a file. Also added functions to compress
and decompress files.
* scr_util.h,
scr_util.c : Added functions to pack and unpack integer types into a
buffer in network order.
* scr_hash_util.h,
scr_hash_util.c : Added functions to set/get integer values.
* scr_flush_file.c : Fixed bug that was returning with wrong code if
named checkpoint is invalid.
* scr_hash.c : Modified code to use new pack / unpack functions.
* scr_meta.h,
scr_meta.c : Modified code to use new get / set hash functions. Also added
functions to record original directory specified by user at SCR_Route_file time.
* scr_copy.c : Allow caller to specify parameters using options instead of arg order.
* scr_dataset.h,
scr_dataset.c : Added functions to set / get / unset a dataset hash.
* scr_conf.h,
scr.c : Added partial support for containers (available in flush and fetch,
but still need drain support). Added code to record dataset attributes. Added
code to remember original directories specified by caller, need support to
create them and delete them with checkpoint.
2011-06-10 Adam Moody <moody20@llnl.gov>
* scr_err.h,
scr_err_serial.c,
scr.c : Added version string to error messages.
2011-05-23 Adam Moody <moody20@llnl.gov>
* scr_hash.c,
tv_data_display.c,
tv_data_display.h : Updated to latest C++View files to restore printing capability
in TotalView.
2011-04-13 Adam Moody <moody20@llnl.gov>
* scr.c : Added scr_summary_read_v4_to_v5 to read old scr_summary.txt files and
convert them into equivalent v5 hash for backwards compatibility. This enables
v1.1-8 to restart from checkpoints stored in v1.1-7 or earlier.
* scr_hash.c,
scr_hash.h : Added scr_hash_sort to sort keys in hash by string, also modified
integer sort call to accept ascending or descending flag.
* scr_index.c : Added --remove option to drop a directory from the index.
* man/scr_index.1.in : Described format of --list output and added an example.
2011-04-06 Adam Moody <moody20@llnl.gov>
* man/scr_index.1.in : Added man page for scr_index, and removed references to
scr_check_complete.
* doc/scr_users_manual.pdf,
doc/scr_developers_manual.pdf : Updated user's manual and renamed it. Added
developer's manual.
2011-03-24 Adam Moody <moody20@llnl.gov>
* scr.c : Fixed bug regarding leaking group handles.
* scr.c,
scr_hash_mpi.c : Moved MPI related functions for hash from scr.c to scr_hash_mpi.c.
Includes support for a sparse global exchange.
* m4/x_ac_gcs.m4,
configure.ac : Added support for libgcs (generalized comm split library), which
enables one to split processes on arbitrary data. Since this library is only at
LLNL currently, also added --with-gcs configure option and HAVE_LIBGCS compile flag
to enable it.
* scr_hash.h,
scr_hash.c : Added scr_hash_getf(), similar to scr_hash_setf().
Added hash keys for XOR file header.
2010-12-20 Adam Moody <moody20@llnl.gov>
* scr_io.h,
scr_io.c : Modified scr_read and scr_write to take (and print) file name.
* scr_copy_xor.h,
scr_copy_xor.c : Moved what little remained in these files to scr_hash.h,
and dropped these files.
2010-12-16 Adam Moody <moody20@llnl.gov>
* : Simplified Makefile.am code, and better follow package conventions.
Thanks to Todd Gamblin.
* m4/x_ac_yogrt.m4,
m4/x_ac_mysql.m4 : Disable these configure options by default.
2010-12-15 Adam Moody <moody20@llnl.gov>
* : Restructured source layout to place all files in the same
directory again. Quicken build by adding scr_base and scr_cmds
libs which the runtime library and scr commands now depend on.
2010-12-15 Adam Moody <moody20@llnl.gov>
* scripts/scr_flush_files.in,
scripts/scr_retries_halt.in : Dropped these files since commands
are now implemented in C in src/bin.
* scripts/scr_hash.pm : Scripts no longer access hash files
directly, so dropped this file.
* scripts/scr_check_complete.in,
man/ : Replaced with scr_index command.
* m4 : created subdirectory to hold m4 configure scripts,
added script to extract build flags from MPI wrappers.
* : Restructuing build (yet again) to improve build speed.
2010-12-10 Adam Moody <moody20@llnl.gov>
* : Completed conversion of meta data to hash.
* src/bin/scr_nodes_file.c : Added to read nodes file and print
number of nodes used in previous run, called by scr_env --runnodes.
* src/bin/scr_flush_file.c,
src/bin/scr_retries_halt.c : Converted command from perl to C so
that we can drop support for scr_hash.pm.
* src/bin/scr_index.c : Added --list option to list checkpoints
along with flags, flush times, and associated directories.
* src/scr_hash_util.{c,h} : Includes various functions to simplify
certain hash operations.
2010-07-02 Adam Moody <moody20@llnl.gov>
* : Restructured src directory layout.
* cmds/scr_index.c : Added command to replace scr_check_complete,
which builds summary file and adds checkpoint dir to index file
in prefix directory.
* scripts/scr_test_hostlist.in,
scripts/scr_test_runtime.in,
scripts/Makefile.am : Dropped test for Hostlist module now that
scr_hostlist.pm is provided.
2010-06-25 Adam Moody <moody20@llnl.gov>
* src/scr.c : Added scr.index file to track checkpoint flush directories
over time. Dropped scr.old symlink.
Still need to update scr_postrun to update index.
2010-06-23 Adam Moody <moody20@llnl.gov>
* src/scr_meta.h,
src/scr_meta.c : Convert meta file format to hash.
* src/scr.c,
scripts/scr_check_complete.in : Convert scr_summary.txt file to hash.
* src/scrf.h,
src/scrf.c : Added Fortran77 support.
* examples/test_ckpt.C : Added test for C++ codes using the C interface.
2010-06-17 Adam Moody <moody20@llnl.gov>
* src/scr.h : Added extern "C" to improve support for C++ codes.
* src/scr.c : Added check to abort if user calls SCR_Start_checkpoint twice
without a call to SCR_Complete_checkpoint in between.
2010-06-03 Adam Moody <moody20@llnl.gov>
* : tag scr-1.1-7
* src/Makefile.am : Replace straight cp with libtool --mode=install cp to fix
minor make install bug when doing consecutive make installs.
2010-05-11 Adam Moody <moody20@llnl.gov>
* scripts/scr_hostlist.pm,
scripts/scr_list_down_nodes.in,
scripts/scr_flush.in,
scripts/scr_inspect.in,
scripts/scr_glob_hosts.in,
scripts/scr_test_runtime.in,
scripts/README : Replaced Hostlist.pm with scr_hostlist.pm to remove dependency.
* scripts/scr_check_complete.in : Don't add entries for XOR files in scr_summary.txt
to fix bug in scr_fetch_files() which expects only FULL files to be listed.
2010-04-19 Adam Moody <moody20@llnl.gov>
* examples/makefile.examples.in : Removed hardcoded -lyogrt library from link line.
* scripts/scr_test_hostlist.in,
scripts/scr_test_datemanip.in,
scripts/scr_test_runtime.in,
scripts/scr_prerun.in : Add checks for pdsh and dshbak commands, as well as,
Hostlist and Date::Manip perl modules. Called from scr_prerun with fatal error.
* config/x_ac_prog_mpicc.m4 : Dropped check for MPICXX and modified check to
default MPICC to mpicc if nothing is found.
2010-04-06 Adam Moody <moody20@llnl.gov>
* config/x_ac_prog_pdsh : Changed check for pdsh and dshbak commands from fatal errors
to warnings, simply call pdsh and dshbak if not found. It may be a better option to
drop absolute path and just always calls "pdsh" or "dshbak".
* scripts/scr_list_down_nodes.in,
scripts/scr_srun.in : Add --log flag and an additional call to scr_list_down_nodes
to print reason to screen. Disable logging unless --log is specified.
* scripts/scr_check_node.in : Added test to touch and rm a file from directories.
2010-04-01 Adam Moody <moody20@llnl.gov>
* src/scr.c : Added logic to support SCR_GLOBAL_RESTART. When enabled, SCR will
distribute files on a restart, flush, and then clear the cache of all checkpoints.
Useful for codes which must restart from the parallel file system. Also fixed
a bug where flush file was not being restored to a consistent state on a rebuild.
* src/scr_io.c,
src/scr_rebuild_xor.c,
src/scr.c : Modified scr_read_pad_n and scr_write_pad_n to return an error
rather than exit upon an I/O failure. Modified XOR rebuild to handle such
errors, with the intent to better detect failures when the parallel file system
is full.
2010-03-30 Adam Moody <moody20@llnl.gov>
* src/scr_io.c,
src/scr_io.h : Added scr_read_attempt and scr_write_attempt to return
error if read or write fails. Have scr_copy_to return error if there
is a failure and also unlink destination file.
2010-03-24 Adam Moody <moody20@llnl.gov>
* : tag scr-1.1-6
* README : merged deps.txt into README.
2010-03-22 Adam Moody <moody20@llnl.gov>
* doc/scr.pdf : Updated user manual to match new implementation.
2010-03-12 Adam Moody <moody20@llnl.gov>
* src/scr_meta.c,
src/scr_meta.h,
src/scr_rebuild_xor.c,
src/scr.c : Moved scr_compute_crc from scr to scr_meta so it could be called by
scr_rebuild_xor. Now compute CRC after an XOR rebuild at drain time.
2010-03-05 Adam Moody <moody20@llnl.gov>
* scripts/scr_param.pm.in,
scripts/scr_list_down_nodes.in,
scripts/scr_cntl_dir.in : Added support to run without system config file. Need
to be sure any compile-time settings in here also match those in scr_conf.h.
And param->get_hash(), now returns a reference to hash, instead of a hash.
* src/scr.c : Added scr_in_checkpoint flag which gets set to 1 in Start_checkpoint
and back to zero in Complete_checkpoint. Route_file checks this flag to determine
whether it is being called in between Start/Complete or not. If not, Route_file
now returns an error if the routed filename cannot be accessed.
* src/scr_meta.h,
src/scr_meta.c,
src/scr_copy_xor.c,
src/scr.c,
scripts/scr_check_complete.in : Dropped src_ fields from meta data structure
and file, which have not been needed for some time.
* scr.mysql,
src/scr_log.c : Added a types table to enable type string to id lookups,
and added some more indicies to events and transfers table to speed expected
queries.
* src/scr_io.c,
src/scr_io.h,
src/scr_crc32.c,
src/scr.c : Added scr_crc32 and scr_compute_crc functions to compute a crc32
given a filename. Added scr_crc_on_copy for LOCAL and XOR (done in copy files)
set crc32 value at Complete checkpoint. Added scr_crc_on_delete to check crc32
on file before deleting from the cache (to monitor for bad drives in cache).
* src/scr_conf.h : Consolidated compile-time defaults to a single file which
all other files include.
2010-02-19 Adam Moody <moody20@llnl.gov>
* doc/scr.pdf : Updated scr.pdf to better match current version.
2010-02-09 Adam Moody <moody20@llnl.gov>
* scripts/scr_prerun.in : Don't rm cache and control directories. Didn't seem
to be necessary in practice, and avoid risk that user or admin mistypes the
cache or control base such that an rm -rf is done at the wrong level.
* scripts/scr_srun.in : Fixed bug to handle case when calling scr_env --runnodes
when no node file exists.
* src/scr.c : Fixed bug when writing exit reason to halt file. Fixed minor bugs
when reporting checkpoint number on rebuild.
* : Changed build system to use MPICC on scr library but CC on everything else.
Basic serial utilities programs had MPI library dependency otherwise.
2010-02-05 Adam Moody <moody20@llnl.gov>
* src/scr.c : Fixed bug in Need_checkpoint where only rank 0 was calling broadcast
when checkpoint whether it was time to halt.
2010-02-01 Adam Moody <moody20@llnl.gov>
* : Added support for a user configuration file in addition to compile-time constants
and the system configuration file.
* : Added code to enable multiple checkpoint configurations using multiple caches
within a single run. Reads configuration from user config file.
2010-01-12 Adam Moody <moody20@llnl.gov>
* : Converted filemap, nodes, and halt files to use hash file format.
Added initial code for asynchronous flush support.
Fixed efficiency bug which caused each file to be flushed twice under XOR.
2009-12-21 Adam Moody <moody20@llnl.gov>
* : retag scr-1.1-5
* : Prepared and uploaded tarball for distribution on sourceforge.
2009-12-04 Adam Moody <moody20@llnl.gov>
* : tag scr-1.1-5
* : Converted to use autoconf tools.
* src/scr_param.h,
src/scr_param.c : Added support for configuration file.
* src/scr_log.h,
src/scr_log.c,
src/scr_log_event.c,
src/scr_log_transfer.c : Added support for logging events and file
transfers.
2009-10-01 Adam Moody <moody20@llnl.gov>
* : tag scr-1.1-4
* src/scr.c : Added some logic to SCR_Need_checkpoint based on settings
user sets via SCR_CHECKPOINT_{OVERHEAD,SECONDS,FREQUENCY} variables.
* bin.in/scr_check_limits,
bin.in/scr_cache_dir : Call scr_cache_dir to get cache directory
for checking limits. Override /tmp via SCR_CACHE_BASE.
* bin.in/scr_srun : Added logic to track and record run time of each step
and node failures in syslog.
2009-09-09 Adam Moody <moody20@llnl.gov>
* : Removed more references to slurm and /tmp where found.
* src/scr_io.c : Return SCR_SUCCESS if scr_mkdir() succeeds,
and SCR_FAILURE otherwise.
* src/scr.c : Check return code on mkdir of cache and cntl directories.
2009-09-04 Adam Moody <moody20@llnl.gov>
* : Separated cache into control (job state files) and cache (checkpoint
file cache) so that user can target different directories (i.e., devices)
to use as the cache. Keep the control in a well-known location so SCR
utility scripts know where to look for files.
* test/test_common.h,
test/test_common.c : Consolidated code common to test_correctness.c
and test_correctness_1.1.c.
* bin.in/scr_check_limits: Renamed scr_report_tmp_limits to remove reference
to /tmp.
* : Now set SCR_PREFIX to current working directory if not specified.
2009-08-20 Adam Moody <moody20@llnl.gov>
* : Added logic to makefile to support AIX. Moved bin/in to bin.in and
man/man1/in to man.in/man1 to that "make clean" can simply remove bin
and man directories.
* src/scr.c : Fixed a bug that led to calling MPI_Comm_split with negative
color values during SCR_Init(). Fixed another bug when calling MPI_Bcast
using receive buffers with an upper bound rather than the precise byte
counts. Fixed a bug with calling dirname/basename recursively without
a strdup of the result. Modified flock() calls to adhere to stricter AIX
requirements.
* src/scr_meta.h,
src/scr_meta.c : Extracted meta data functions from scr_io.c into their
own file for consistency.
* src/scr_crc32.c : Impelement crc32 command functionality to remove
dependency.
* bin.in/scr_glob_hosts : Uses Hostlist.pm to implement functionality
needed from glob-hosts to remove dependency.
* bin.in/scr_env : Abstracts most uses of SLURM commands to a single
interface, which should reduce the effort in supporting other resource
managers. Still heavy use in scr_halt.
2009-07-21 Adam Moody <moody20@llnl.gov>
* : tag scr-1.1-3
* bin/in/scr_retries_halt : Another bug -- forgot to chomp the cache_dir
output to strip the trailing newline.
2009-07-17 Adam Moody <moody20@llnl.gov>
* : tag scr-1.1-2
* src/scr.c,
src/scr_halt.c
src/scr_halt_cntl.c,
include/scr_halt.h : Protect access to halt file with flock. Added
command to be invoked on node 0 from scr_halt to access and modify halt
file. This lets user change halt settings without worrying about state
of application as both the script and the app coordinate via flock().
* src/scr_hash.c : Fixed heap courruption bug.
* bin/in/scr_retries_halt : Fixed script typo.
2009-07-07 Adam Moody <moody20@llnl.gov>
* : tag scr-1.1-1
* doc/manual/scr.tex : Updated user manual with more detail on the interpose
library and the steps needed to integrate the API into an application.
2009-07-01 Adam Moody <moody20@llnl.gov>
* : retag scr-1.1-0
* bin/in/scr_cache_dir: Added script to return checkpoint cache directory name.
Changed default cache location from /tmp/scr to /tmp/username/scr.jobid.
Added code to scr_prerun to delete the cache in a new job (in case user
had a previous run with the same id on a system that does not clear /tmp).
2009-06-29 Adam Moody <moody20@llnl.gov>
* bin/in/scr_flush_file: Added to run queries against new flush file format.
Modified scr_postrun, scr_flush, and scr_copy to call on scr_flush_file.
2009-06-05 Adam Moody <moody20@llnl.gov>
* : Changed the API and updated the implementation. Lots of big changes:
- Removed SCR_Start/Route/Complete.
- Added SCR_Need_checkpoint/Start_checkpoint/Route_file/Complete_checkpoint.
- Added data structures and logic to track multiple files per process per checkpoint.
Now each process may write 0 or more files per checkpoint.
- Can now store a rolling window of checkpoints in the cache, set SCR_CACHE_SIZE.
- Can now run (and restart) with a different number of processes per node.
- Removed first XOR algorithm altogether, changed XOR2 to XOR.
- Updated scr_interpose to use new API and handle multiple files per process.
- Updated text_correctness.c to use new API and created test_correctness_1.1.c.
- Updated scr.pdf to discuss new API and features.
2009-05-04 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0-19
* src/scr_xor2_rebuild.c : Fixed bug in printf statements
where file descriptor was being printed instead of file
name. Led to core dumps on rebuild attempts.
* package.conf : Set OPT to -g -O3 before make to ensure
optimization is enabled for production build.
* src/scr.c : Changed MPI_CHAR to MPI_BYTE.
2009-04-20 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0-18
* src/scr.c : Added XOR2 copy and rebuild logic.
Moved old XOR to XOR1 and set copy_xor2 as new XOR.
Set SCR_XOR_SIZE=8 and SCR_COPY_TYPE=XOR as new defaults.
Fixed SCR_PARTNER_DISTANCE to select different
nodes for xor sets, e.g., every other node, etc.
* src/scr_io.c,
src/scr_err_serial.c,
src/scr_copy_xor2.c,
include/scr_io.h,
include/scr_err.h,
include/scr_copy_xor2.h : Moved shared code to separate files.
* bin/in/scr_check_complete,
src/scr_xor_rebuild.c,
src/scr_xor2_rebuild.c : Added code to rebuild missing files
via XOR2 redundancy scheme. Fixed a bug to rescan .scr files
before writing scr_summary.txt file after a rebuild.
2009-02-23 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0-17
* src/scr.c : Fixed bug in SCR_Route to echo passed in filename when
SCR is disabled. Also have SCR_Init create the directory in this case.
Added crc32 capability to flush/fetch opertaions, and now try
/old if fetch of /current fails.
* src/scr_interpose.c,
bin/in/srun-scr : Updated to match latest API and scr.h.
* bin/in/scr_srun : Merged srun-scr into scr_srun, sets LD_PRELOAD if
SCR_CHECKPOINT_PATTERN is set.
* bin/scr_list_down_nodes,
bin/in/scr_srun : Added logic to continue to exclude nodes if they
are ever marked down, even if they come back healthy to avoid deleting
a checkpoint set during distribute on a restart.
* bin/in/scr_check_complete,
etc/scr_meta.pm,
bin/in/scr_copy : Added crc32 option to scr_copy to compute crc32 value
of files and update meta data files before copying files to the parallel file system.
Updated scr_check_complete to match new scr_summary.txt format. Added scr_meta.pm
to abstract reading and writing of meta data files.
2009-01-15 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0-16
* src/scr.c : Added SCR_FLUSH_ON_RESTART to force a flush on restart.
Fixed bug where a flush could be missed when restarting.
* bin/in/scr_srun : Added SCR_EXCLUDE_NODES to manually avoid known
problem nodes in scr_srun.
* bin/scr_srun,
bin/scr_prerun,
bin/scr_postrun : Added logic to process SCR_ENABLE.
* bin/scr_srun : Changed SCR_RETRIES to SCR_RUNS which is less confusing.
* man/* : Added man pages for user interface commands.
* package.conf : Copy scripts to /usr/local/bin and man pages to /usr/local/man.
* bin/in : Dropped .in extension from scripts.
2008-12-05 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_15
* src/scr.c : Typo in debug output in fetch and flush.
2008-12-05 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_14
* src/scr.c : Forgot to print debug output in scr_copy_files.
2008-12-05 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_13
* src/scr.c : Fixed some bugs for the flush with XOR. Changed scr_fetch
to be non-fatal per Bert's request and updated SCR_FLUSH to go by global
checkpoint id (also now read in global checkpoint id during scr_fetch).
2008-12-01 Adam Moody <moody20@llnl.gov>
* src/scr.c : Added SCR_Route() to build full path to checkpoint file.
Added support for SCR_FETCH to read files in from Lustre into /tmp
during SCR_Init. Added support for SCR_FLUSH to flush files out to
Lustre with some frequency in SCR_Complete.
* : changed "drain" to "flush" to better match cache terminology.
* bin/in/scr_flush.in : Added case in failed pdsh where scr_flush
was missing retrying after a failed node.
2008-11-04 Adam Moody <moody20@llnl.gov>
* : retag scr-1.0_11
* package.conf,
makefile,
docs/manual/makefile : Add scr.pdf as binary file rather than
try to build pdf via latex source during package build.
2008-11-03 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_11
* docs/*,
makefile,
README : Added scr.pdf documentation built from latex via pdflatex.
Point readers of README file to scr.pdf file.
* bin/in/scr_halt.in,
bin/in/scr_retries_halt.in,
setdir : Added tmp prefix to scr_halt.
Added a number of new options to scr_halt, -c, -b, -a, -r.
Created scr_retries_halt to check whether retries should be halted
based on info in halt file.
* test/test_correctness.c : Added sleep option to sleep between checkpoints.
Added usage if given incorrect number of options.
2008-10-27 Adam Moody <moody20@llnl.gov>
* src/scr.c : Now unlink xor segments as they are gathered and appended
one-by-one to the XOR file during the gather. (Need to conserve memory.)
Changed 'SCR aborting' to 'Job exiting' which is less alarming. Rewrote
XOR algorithm to keep data in memory longer during reduce-scatter. Added
checkpoint id to drain.scrinfo file.
* bin/in/scr_check_complete.in,
setdir : Added logic to rebuild missing checkpoint files
from XOR segments if possible. Moved file to /in directory
since it now calls scr_xor_rebuild.
* src/scr_xor_rebuild.c,
makefile : The C program to actually rebuild missing checkpoint
files from XOR segments.
* bin/scr_report_tmp_limits,
bin/scr_prerun : Added "scr_" prefix to report_tmp_limits to be consistent.
* bin/scr_prerun,
bin/scr_postrun : Changed -f (from) and -t (to) options to be -p (prefix)
and -t (tmp), which is less confusing. Dropped old -p (pattern) and -c (ppn)
options since they weren't used anyway.
* bin/scr_halt : Renamed scr_kill to scr_halt. Have script mkdir /tmp/scr in
case it tries to touch halt file before the directory has been created.
* bin/in/scr_drain.in,
bin/scr_copy : When given an --id <id> argument, only copy files from node
if there is a drain file and its id matches the one requested.
* bin/scr_list_down_nodes,
bin/in/scr_srun.in,
bin/in/scr_drain.in : Print list of down (according to SLURM down state)
or unresponsive nodes (according to ping test) to stdout. Used to exclude
nodes in scr_drain and scr_srun.
2008-06-29 Adam Moody <moody20@llnl.gov>
* src/scr.c : Added SCR_Read() and SCR_Write() calls as simple wrappers.
* src/scr.c,
test/test_copy.c,
test/test_correctness.c,
test/test_demo.c,
test/test_interpose.c,
test/test_openclose.c,
test/test_signal.c,
test/time_all.c : Declared internal variables and functions as static to
not export symbols externally.
2008-04-23 Adam Moody <moody20@llnl.gov>
* : retag scr-1.0_10
* pacakge.conf : Had to reorder make / setdir commands since
make includes a clean, which now cleans up the scripts in /bin.
2008-04-23 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_10
* test/test_demo.c : Added simulated application
which repeatedly computes and writes new checkpoints
at each timestep.
* src/scr.c,
bin/scr_copy : Added scr_need_drain() and scr_halt()
functions which write out drain.scrinfo and halt.scrinfo
files respectively in /tmp/scr to pass info to postrun scripts.
* bin/scr_kill : Kill jobs by touching halt.scrinfo file in /tmp/scr
via pdsh command to entire nodeset of job.
* bin/in/scr_srun.in,
setdir : Wrapper of scr_prerun/srun/retries/scr_postrun.
* bin/in/scr_postrun.in : Set SCR_MARK_COMPLETE to be on be default.
* makefile : Created 'scripts' target, clean scripts in make clean,
and added test_demo.c
2008-04-22 Adam Moody <moody20@llnl.gov>
* : retag scr-1.0_9
* src/scr. : Forgot to guard filemap with scr_distribute flag.
Set scr_copy to ignore *.filemap files (since these don't have
unique names yet).
2008-04-22 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_9
* src/scr. : Implemented distribution logic for XOR files.
Moved Async to type=3, now Local=0, which has no copy.
2008-04-10 Adam Moody <moody20@llnl.gov>
* src/scr. : Most funtions now return SCR_SUCCESS or something
other than SCR_SUCCESS upon failure. Fixed a bug in
scr_split_path.
* makefile : Added OPT variable to makefile to quickly change
optimization flags. Building with -O3 vs -O0 has a big impact
on the XOR performance.
2008-03-26 Adam Moody <moody20@llnl.gov>
* src/scr.c : Added scr_distributefiles() and related functions
which shuffle files around during MPI_Init() with the goal
of priming up spare nodes to take over for failed nodes.
Also added scr_xorcopy() and related functions which computes
an xor file for a group of checkpoint files and distributes
the blocks of the xor files over the nodes.
2008-03-17 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_8
* src/scr.c,
include/scr.h : Added SCR_Open and SCR_Close calls.
Added scr_err and scr_dbg calls to print messages.
Added code to create xor file fragments from checkpoint
files, reduces memory footprint.
* test/test_openclose.c : Added test for SCR_Open/Close.
2008-03-06 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_7
* makefile : Copy .so files to lib instead of lib/shared.
* package.conf : Cleaned up copy of files during install
using $bindir, $libdir, etc from dpkg-mkdeb.
2008-03-05 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_6
* scr.lcc : Renamed scr.lcc.in.
2008-03-05 Adam Moody <moody20@llnl.gov>
* package.conf,
scr.lcc.in : Simplifying LCC plugin so others can use
this as a template to create their own plugins.
2008-02-13 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_5
* package.conf : Changed to use $lccdir and $pkgname which
are macros defined by the dpkg-mkdeb script.
2008-02-11 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_4
* package.conf,
setdir : Write to /usr/local/etc/lcc instead of
/usr/local/etc/lcc-plugins. Set @PREFIX@ to $prefix
since we'll source scr .macros file. Use PKG_DEFAULT=scr.
* scr.plugin.in,
scr.lcc.in : Renamed plugin file.
2008-02-06 Adam Moody <moody20@llnl.gov>
* package.conf : Set short desc to match format of MPI libraries.
2007-12-17 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_3
* bin/scr_flush,
bin/in/presrun,
bin/in/postsrun,
setdir : Cleanup of bin directory: removed scr_flush,
renamed presrun to scr_prerun and postsrun to scr_postrun.
* EXAMPLES,
IMPLEMENTATIONS : Removed during cleanup.
* README : Added note about SCR_USE_SIGNAL, SCR_ABORT_SIGNAL,
and SCR_IMMABORT_SIGNAL. Changed presrun / postsrun references.
* src/scr.c : Added support for SCR_IMMABORT_SIGNAL, which
is a second registered signal that aborts the job immediately.
* src/src_utility.c : Added code to handle EINTR and EAGAIN errors
on read and write.
* bin/scr_kill : A script to safely kill an scr job via sending signals.
2007-12-14 Adam Moody <moody20@llnl.gov>
* src/scr.c : Added scr_use_signal_immediate as placeholder.
Want to enable user have choice between
1) exit immediately (unless in middle of checkpoint)
2) abort at end of next checkpoint, whenever that may be
Currently only supports option 2, we could catch a second
signal or read variable setting to also enable mode 1.
* makefile,
test/test_signal.c : Link statically. Removed Awake printf.
* src/scr_old.c : Moved to dev since signal code added in main
scr.c file now.
2007-12-14 Adam Moody <moody20@llnl.gov>
* src/scr.c : Added signal handler to safely abort job, i.e.,
don't want to kill it during the middle of a checkpoint.
* src/scr.c : Removed SCR_Flush() code, which was commented out anyway.
* makefile,
test/test_signal.c : Added test program to check signal support.
2007-12-10 Adam Moody <moody20@llnl.gov>
* : tag scr-1.0_2
2007-12-03 Adam Moody <moody20@llnl.gov>
* package.conf : Updated to VERSION 2.
* package.conf : Remove src/dev subdirectory during install.
* package.conf : Removed PKG_DEPENDS=lcc (only build-time dependency).