forked from flame/blis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGELOG
22719 lines (17516 loc) · 920 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
commit 68b88aca6692c75a9f686187e6c4a4e196ae60a9 (HEAD -> master, tag: 0.7.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Apr 7 14:41:44 2020 -0500
Version file update (0.7.0)
commit b04de636c1702e4cb8e7ad82bab3cf43d2dbdfc6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Apr 7 14:37:43 2020 -0500
ReleaseNotes.md update in advance of next version.
Details:
- Updated docs/ReleaseNotes.md in preparation for next version.
commit 2cb604ba472049ad498df72d4a2dc47a161d4c3c (origin/master, origin/dev, origin/amd, origin/HEAD, dev, amd)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Apr 6 16:42:14 2020 -0500
Rename more bli_thread_obarrier(), _obroadcast().
Details:
- Renamed instances of bli_thread_obarrier() and bli_thread_obroadcast()
that were made in the supmt-specific code commited to the 'amd'
branch, which has now been merged with 'master'. Prior to the merge,
'master' received commit c01d249, which applied these renamings to
the existing, non-sup codebase.
commit efb12bc895de451067649d5dceb059b7827a025f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Apr 6 15:01:53 2020 -0500
Minor updates/elaborations to RELEASING file.
commit 2e3b3782cfb7a2fd0d1a325844983639756def7d
Merge: 9f3a8d4d da0c086f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Apr 6 14:55:35 2020 -0500
Merge branch 'master' into amd
commit da0c086f4643772e111318f95a712831b0f981a8
Author: Satish Balay <balay@mcs.anl.gov>
Date: Tue Mar 31 17:09:41 2020 -0500
OSX: specify the full path to the location of libblis.dylib (#390)
* OSX: specify the full path to the location of libblis.dylib so that it can be found at runtime
Before this change:
Appication gives runtime error [when linked with blis]
dyld: Library not loaded: libblis.3.dylib
balay@kpro lib % otool -L libblis.dylib
libblis.dylib:
libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0)
After this change:
balay@kpro lib % otool -L libblis.dylib
libblis.dylib:
/Users/balay/petsc/arch-darwin-c-debug/lib/libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0)
* INSTALL_LIBDIR -> libdir as INSTALL_LIBDIR has DESTDIR
Co-Authored-By: Jed Brown <jed@jedbrown.org>
* CREDITS file update.
Co-authored-by: Jed Brown <jed@jedbrown.org>
Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>
commit 2bca03ea9d87c0da829031a5332545d05e352211
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Sat Mar 28 22:10:00 2020 +0000
Updates, tweaks to runme.sh in test/1m4m.
Details:
- Made several updates to test/1m4m/runme.sh, including:
- Added missing handling for 1m and 4m1a implementations when setting
the BLIS_??_NT environment variables.
- Added support for using numactl to run the test executables.
- Several other cleanups.
commit c40a33190b94af5d5c201be63366594859b1233f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Thu Mar 26 16:55:00 2020 -0500
Warn user when auto-detection returns 'generic'.
Details:
- Added logic to configure that causes the script to output a warning
to the user if/when "./configure auto" is run and the underlying
hardware feature detection code is unable to identify the hardware.
In these cases, the auto-detect code will return 'generic', which
is likely not what the user expected, and a flag will be set so that
a message is printed at the end of the configure output. (Thankfully,
we don't expect this scenario to play out very often.) Thanks to
Devin Matthews for suggesting this fix #384.
commit 492a736fab5b9c882996ca024b64646877f22a89
Author: Devin Matthews <damatthews@smu.edu>
Date: Tue Mar 24 17:28:47 2020 -0500
Fix vectorized version of bli_amaxv (#382)
* Fix vectorized version of bli_amaxv
To match Netlib, i?amax should return:
- the lowest index among equal values
- the first NaN if one is encountered
* Fix typos.
* And another one...
* Update ref. amaxv kernel too.
* Re-enabled optimized amaxv kernels.
Details:
- Re-enabled the optimized, intrinsics-based amaxv kernels in the 'zen'
kernel set for use in haswell, zen, zen2, knl, and skx subconfigs.
These two kernels (for s and d datatypes) were temporarily disabled in
e186d71 as part of issue #380. However, the key missing semantic
properties that prompted the disabling of these kernels--returning the
index of the *first* rather than of the last element with largest
absolute value, and returning the index of the first NaN if one is
encountered--were added as part of #382 thanks to Devin Matthews.
Thus, now that the kernels are working as expected once more, this
commit causes these kernels to once again be registered for the
affected subconfigs, which effectively reverts all code changes
included in e186d71.
- Whitespace/formatting updates to new macros in bli_amaxv_zen_int.c.
Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>
commit e186d7141a51f2d7196c580e24e7b7db8f209db9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Sat Mar 21 18:40:36 2020 -0500
Disabled optimized amaxv kernels.
Details:
- Disabled use of optimized amaxv kernels, which use vector intrinsics
for both 's' and 'd' datatypes. We disable these kernels because the
current implementations fail to observe a semantic property of the
BLAS i?amax_() subroutine, which is to return the index of the
*first* element containing the maximum absolute value (that is, the
first element if there exist two or more elements that contain the
same value). With the optimized kernels disabled, the affected
subconfigurations (haswell, zen, zen2, knl, and skx) will use the
default reference implementations. Thanks to Mat Cross for reporting
this issue via #380.
- CREDITS file update.
commit 9f3a8d4d851725436b617297231a417aa9ce8c6a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Sat Mar 14 17:48:43 2020 -0500
Added missing return to bli_thread_partition_2x2().
Details:
- Added a missing return statement to the body of an early case handling
branch in bli_thread_partition_2x2(). This bug only affected cases
where n_threads < 4, and even then, the code meant to handle cases
where n_threads >= 4 executes and does the right thing, albeit using
more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti
for reporting this bug via issue #377.
- Whitespace changes to bli_thread.c (spaces -> tabs).
commit 8c3d9b9eeb6f816ec8c32a944f632a5ad3637593
Merge: 71249fe8 0f9e0399
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Mar 10 14:03:33 2020 -0500
Merge branch 'amd' of github.com:flame/blis into amd
commit 71249fe8ddaa772616698f1e3814d40e012909ea
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Mar 10 13:55:29 2020 -0500
Merged test/sup, test/supmt into test/sup.
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
to compile and run both single-threaded and multithreaded experiments.
This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
previous test/sup/octave scripts) as well as a test/sup/octave_mt
directory (based on the previous test/supmt/octave scripts). The
octave scripts are slightly different and not easily mergeable, and
thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
the previous test/supmt directory as test/sup/old/supmt.
commit 0f9e0399e16e96da2620faf2c0c3c21274bb2ebd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Thu Mar 5 17:03:21 2020 -0600
Updated sup performance graphs; added mt results.
Details:
- Reran all existing single-threaded performance experiments comparing
BLIS sup to other implementations (including the conventional code
path within BLIS), using the latest versions (where appropriate).
- Added multithreaded results for the three existing hardware types
showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc
(Zen1).
- Various minor updates to the text in docs/PerformanceSmall.md.
- Updates to the octave scripts in test/sup/octave, test/supmt/octave.
commit 90db88e5729732628c1f3acc96eeefab49f2da41
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Mar 2 15:06:48 2020 -0600
Updated sup[mt] Makefiles for variable dim ranges.
Details:
- Updated test/sup/Makefile and test/supmt/Makefile to allow specifying
different problem size ranges for the drivers where one, two, or three
matrix dimensions is large. This will facilitate the generation of
more meaningful graphs, particularly when two dimensions are tiny.
commit 31f11a06ea9501724feec0d2fc5e4644d7dd34fc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Thu Feb 27 14:33:20 2020 -0600
Updates to octave scripts in test/sup[mt]/octave.
Details:
- Optimized scripts in test/sup/octave and test/supmt/octave for use
with octave 5.2.0 on Ubuntu 18.04.
- Fixed stray 'end' keywords in gen_opsupnames.m and plot_l3sup_perf.m,
which were not only unnecessary but also causing issues with versions
5.x.
commit c01d249d7c546fe2e3cee3fe071cd4c4c88b9115
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Feb 25 14:50:53 2020 -0600
Renamed bli_thread_obarrier(), _obroadcast().
Details:
- Renamed two bli_thread_*() APIs:
bli_thread_obarrier() -> bli_thread_barrier()
bli_thread_obroadcast() -> bli_thread_broadcast()
The 'o' was a leftover from when thrcomm_t objects tracked both
"inner" and "outer" communicators. They have long since been
simplified to only support the latter, and thus the 'o' is
superfluous.
commit f6e6bf73e695226c8b23fe7900da0e0ef37030c1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Feb 24 17:52:23 2020 -0600
List Gentoo under supported external packages.
Details:
- Add mention of Gentoo Linux under the list of external packages in
the README.md file. Thanks to M. Zhou for maintaining this package.
commit 9e5f7296ccf9b3f7b7041fe1df20b927cd0e914b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Feb 18 15:16:03 2020 -0600
Skip building thrinfo_t tree when mt is disabled.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
address is equal to either &BLIS_GEMM_SINGLE_THREADED or
&BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
bli_l3_sup_decor_single.c that (by default) disables code that
creates and frees the thrinfo_t tree and instead passes
&BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
sup implementation.
- The net effect of the above changes is that a small amount of
thrinfo_t overhead is avoided when running small/skinny dgemm
problems when BLIS is compiled with multithreading disabled.
commit 90081e6a64b5ccea9211bdef193c2d332c68492f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Feb 17 14:57:25 2020 -0600
Fixed bug(s) in mt sup when single-threaded.
Details:
- Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of
changing function interface for the thread entry point function
(of type l3supint_t).
- Unfortunately, fixing the interface was not enough, as it caused
a memory leak in the sba at bli_finalize() time. It turns out that,
due to the new multithreading-capable variant code useing thrinfo_t
objects--specifically, their calling of bli_thrinfo_grow()--we
have to pass in a real thrinfo_t object rather than the global
objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED.
Thus, I inserted the appropriate logic from the OpenMP and pthreads
versions so that single-threaded execution would work as intended
with the newly upgraded variants.
commit c0558fde4511557c8f08867b035ee57dd2669dc6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Feb 17 14:08:08 2020 -0600
Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
commit d7a7679182d72a7eaecef4cd9b9a103ee0a7b42b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Fri Feb 7 17:37:03 2020 -0600
Fixed int-to-packbuf_t conversion error (C++ only).
Details:
- Fixed an error that manifests only when using C++ (specifically,
modern versions of g++) to compile drivers in 'test' (and likely most
other application code that #includes blis.h. Thanks to Ajay Panyala
for reporting this issue (#374).
commit d626112b8d5302f9585fb37a8e37849747a2a317
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Wed Jan 15 13:27:02 2020 -0600
Removed sorting on LDFLAGS in common.mk (#373).
Details:
- Removed a line of code in common.mk that passed LDFLAGS through the
sort function. The purpose was not to sort the contents, but rather
to remove duplicates. However, there is valid syntax in a string of
linker flags that, when sorted, yields different/broken behavior.
So I've removed the line in common.mk that sorts LDFLAGS. Also, for
future use, I've added a new function, rm-dupls, that removes
duplicates without sorting. (This function was based on code from a
stackoverflow thread that is linked to in the comments for that
code.) Thanks to Isuru Fernando for reporting this issue (#373).
commit e67deb22aaeab5ed6794364520190936748ef272
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Jan 14 16:01:34 2020 -0600
CHANGELOG update (0.6.1)
commit 10949f528c5ffc5c3a2cad47fe16a802afb021be (tag: 0.6.1)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Jan 14 16:01:33 2020 -0600
Version file update (0.6.1)
commit 5db8e710a2baff121cba9c63b61ca254a2ec097a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Jan 14 15:59:59 2020 -0600
ReleaseNotes.md update in advance of next version.
Details:
- Updated ReleaseNotes.md in preparation for next version.
commit cde4d9d7a26eb51dcc5a59943361dfb8fda45dea
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Jan 14 15:19:25 2020 -0600
Removed 'attic/windows' (to prevent confusion).
Details:
- Finally removed 'attic/windows' and its contents. This directory once
contained "proto" Windows support for BLIS, but we've since moved on
to (thanks to Isuru Fernando) providing Windows DLL support via
AppVeyor's build artifacts. Furthermore, since 'windows' was the only
subdirectory within 'attic', the directory path would show up in
GitHub's listing at https://github.com/flame/blis, which probably led
to someone being confused about how BLIS provides Windows support. I
assume (but don't know for sure) that nobody is using these files, so
this is admittedly a case of shoot first and ask questions later.
commit 7d3407d4681c6449f4bbb8ec681983700ab968f3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Jan 14 15:17:53 2020 -0600
CREDITS file update.
commit f391b3e2e7d11a37300d4c8d3f6a584022a599f5
Author: Dave Love <dave.love@manchester.ac.uk>
Date: Mon Jan 6 20:15:48 2020 +0000
Fix parsing in vpu_count on workstation SKX (#351)
* Fix parsing in vpu_count on workstation SKX
* Document Skylake-X as Haswell for single FMA
* Update vpu_count for Skylake and Cascade Lake models
* Support printing the configuration selected, controlled by the environment
Intended particularly for diagnosing mis-selection of SKX through
unknown, or incorrect, number of VPUs.
* Move bli_log outside the cpp condition, and use it where intended
* Add Fixme comment (Skylake D)
* Mostly superficial edits to commits towards #351.
Details:
- Moved architecture/sub-config logging-related code from bli_cpuid.c
to bli_arch.c, tweaked names, and added more set/get layering.
- Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c.
- Content, whitespace changes to new bullet in HardwareSupport.md that
relates to single-VPU Skylake-Xs.
* Fix comment typos
Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>
commit 5ca1a3cfc1c1cc4dd9da6a67aa072ed90f07e867
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Jan 6 12:29:12 2020 -0600
Fixed 'configure' breakage introduced in 6433831.
Details:
- Added a missing 'fi' (endif) keyword to a conditional block added in
the configure script in commit 6433831.
commit e7431b4a834ef4f165c143f288585ce8e2272a23
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Jan 6 12:01:41 2020 -0600
Updated 1m draft article link in README.md.
commit 6433831cc3988ad205637ebdebcd6d8f7cfcf148
Author: Jeff Hammond <jeff.r.hammond@intel.com>
Date: Fri Jan 3 17:52:49 2020 -0800
blacklist ICC 18 for knl/skx due to test failures
Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>
commit af3589f1f98781e3a94a8f9cea8d5ea6f155f7d2
Author: Jeff Hammond <jeff.science@gmail.com>
Date: Fri Jan 3 13:23:24 2020 -0800
blacklist Intel 19+
Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>
commit 60de939debafb233e57fd4e804ef21b6de198caf
Author: Jeff Hammond <jeff.science@gmail.com>
Date: Wed Jan 1 21:30:38 2020 -0800
fix link to docs
the comment contains an incorrect link, which is trivially fixed here.
@fgvanzee I hope you don't mind that I committed directly to master but this cannot break anything.
commit 52711073789b6b84eb99bb0d6883f457ed3fcf80
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Dec 16 16:30:26 2019 -0600
Fixed bugs in cblas_sdsdot(), sdsdot_().
Details:
- Fixed a bug in sdsdot_sub() that redundantly added the "alpha" scalar,
named 'sb'. This value was already being added by the underlying
sdsdot_() function. Thus, we no longer add 'sb' within sdsdot_sub().
Thanks to Simon Lukas Märtens for reporting this bug via #367.
- Fixed a second bug in order of typecasting intermediate products in
sdsdot_(). Previously, the "alpha" scalar was being added after the
"outer" typecast to float. However, the operation is supposed to first
add the dot product to the (promoted) scalar and THEN downcast the sum
to float. Thanks to Devin Matthews for catching this bug.
commit fe2560a4b1d8ef8d0a446df6002b1e7decc826e9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Fri Dec 6 17:12:44 2019 -0600
Annoted missing thread-related symbols for export.
Details:
- Added BLIS_EXPORT_BLIS annotation to function prototypes for
bli_thrcomm_bcast()
bli_thrcomm_barrier()
bli_thread_range_sub()
so that these functions are exported to shared libraries by default.
This (hopefully) fixes issue #366. Thanks to Kyungmin Lee for
reporting this bug.
- CREDITS file update.
commit 2853825234001af8f175ad47cef5d6ff9b7a5982
Merge: efa61a6c 61b1f0b0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Fri Dec 6 16:06:46 2019 -0600
Merge branch 'master' into amd
commit 61b1f0b0602faa978d9912fe58c6c952a33af0ac
Author: Nicholai Tukanov <nicholai@utexas.edu>
Date: Wed Dec 4 14:18:47 2019 -0600
Add prototypes for POWER9 reference kernels (#365)
Updates and fixes to power9 subconfig.
Details:
- Register s,c,z reference gemm and trsm ukernels that assume elements
of B have been broadcast.
- Added prototypes for level-3 ukernels that assume elements of B have
been broadcast. Also added prototype for an spackm function that
employs a duplication/broadcast factor of 4.
- Register virtual gemmtrsm ukernels that work with broadcasting of B.
- Disable right-side hemm, symm, trmm, and trmm3 in bli_family_power9.h.
- Thanks to Nicholai Tukanov for providing these updates.
commit efa61a6c8b1cfa48781fc2e4799ff32e1b7f8f77
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Fri Nov 29 16:17:04 2019 -0600
Added missing bli_l3_sup_thread_decorator() symbol.
Details:
- Defined dummy versions of bli_l3_sup_thread_decorator() for Openmp
and pthreads so that those builds don't fail when performing shared
library linking (especially for Windows DLLs via AppVeyor). For now,
these dummy implementations of bli_l3_sup_thread_decorator() are
merely carbon-copies of the implementation provided for single-
threaded execution (ie: the one found in bli_l3_sup_decor_single.c).
Thus, an OpenMP or pthreads build will be able to use the gemmsup
code (including the new selective packing functionality), as it did
before 39fa7136, even though it will not actually employ any
multithreaded parallelism.
commit 39fa7136f4a4e55ccd9796fb79ad5f121b872ad9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Fri Nov 29 15:27:07 2019 -0600
Added support for selective packing to gemmsup.
Details:
- Implemented optional packing for A or B (or both) within the sup
framework (which currently only supports gemm). The request for
packing either matrix A or matrix B can be made via setting
environment variables BLIS_PACK_A or BLIS_PACK_B (to any
non-zero value; if set, zero means "disable packing"). It can also
be made globally at runtime via bli_pack_set_pack_a() and
bli_pack_set_pack_b() or with individual rntm_t objects via
bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert
interface of either the BLIS typed or object APIs. (If using the
BLAS API, environment variables are the only way to communicate the
packing request.)
- One caveat (for now) with the current implementation of selective
packing is that any blocksize extension registered in the _cntx_init
function (such as is currently used by haswell and zen subconfigs)
will be ignored if the affected matrix is packed. The reason is
simply that I didn't get around to implementing the necessary logic
to pack a larger edge-case micropanel, though this is entirely
possible and should be done in the future.
- Spun off the variant-choosing portion of bli_gemmsup_ref() into
bli_gemmsup_int(), in bli_l3_sup_int.c.
- Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along
with corresponding headers, in which higher-level packm-related
functions are defined for use within the sup framework. The actual
packm variant code resides in bli_l3_sup_packm_var.c.
- Pass the following new parameters into var1n and var2m: packa, packb
bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now
always NULL), and pointer to a thrinfo_t* (which for nowis the address
of the global single-threaded packm thread control node).
- Added panel strides ps_a and ps_b to the auxinfo_t structure so that
the millikernel can query the panel stride of the packed matrix and
step through it accordingly. If the matrix isn't packed, the panel
stride of interest for the given millikernel will be set to the
appropriate value so that the mkernel may step through the unpacked
matrix as it normally would.
- Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate
panel strides (ps_a and ps_b, respectively) instead of computing them
on the fly.
- Spun off the environment variable getting and setting functions into
a new file, bli_env.c (with a corresponding prototype header). These
functions are now used by the threading infrastructure (e.g.
BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing
infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B).
- Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER.
- Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER,
for use within the definition of BLIS_MEM_INITIALIZER.
- Moved the global_rntm object to bli_rntm.c and extern it where needed.
This means that the function bli_thread_init_rntm() was renamed to
bli_rntm_init_from_global() and relocated accordingly.
- Added a new bli_pack.c function, which serves as the home for
functions that manage the pack_a and pack_b fields of the global
rntm_t, including from environment variables, just as we have
functions to manage the threading fields of the global rntm_t in
bli_thread.c.
- Reorganized naming for files in frame/thread, which mostly involved
spinning off the bli_l3_thread_decorator() functions into their own
files. This change makes more sense when considering the further
addition of bli_l3_sup_thread_decorator() functions (for now limited
only to the single-threaded form found in the _single.c file).
- Explicitly initialize the reference sup handlers in both
bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more
obvious how to customize to a different handler, if desired.
- Removed various snippets of disabled code.
- Various comment updates.
commit bbb21fd0a9be8c5644bec37c75f9396eeeb69e48
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Thu Nov 21 18:15:16 2019 -0600
Tweaked SIAM/SC Best Prize language in README.md.
commit 043366f92d5f5f651d5e3371ac3adb36baf4adce
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Thu Nov 21 18:13:51 2019 -0600
Fixed typo in previous commit (SIAM/SC prize).
commit 05a4d583e65a46ff2a1100ab4433975d905d91f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Thu Nov 21 18:12:24 2019 -0600
Added SIAM/SC prize to "What's New" in README.md.
commit 881b05ecd40c7bc0422d3479a02a28b1cb48383f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Thu Nov 21 16:34:27 2019 -0600
Fixed blastest failure for 'generic' subconfig.
Details:
- Fixed a subtle and complicated bug that only manifested via the BLAS
test drivers in the generic subconfiguration, and possibly any other
subconfiguration that did not register complex-domain gemm ukernels,
or registered ONLY real-domain ukernels as row-preferential. This is
a long story, but it boils down to an exception to the "transpose the
operation to bring storage of C into agreement with ukernel pref"
optimization in bli_hemm_front.c and bli_symm_front.c sabotaging the
proper functioning of the 1m method, but only when the imaginary
component of beta is zero. See the comments in issue #342 for more
details. Thanks to Dave Love for identifying the commit in which this
bug was introduced, and other feedback related to this bug.
commit 0c7165fb01cdebbc31ec00124d446161b289942f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Thu Nov 14 16:48:14 2019 -0600
Fixed obscure bug in bli_acquire_mpart_[mn]dim().
Details:
- Fixed a bug in bli_acquire_mpart_mdim(), bli_acquire_mpart_ndim(),
and bli_acquire_mpart_mndim() that allowed the use of a blocksize b
that is too large given the current row/column index (i.e., the i/j
argument) and the size of the dimension being partitioned (i.e., the
m/n argument). This bug only affected backwards partitioning/motion
through the dimension and was the result of a misplaced conditional
check-and-redirect to the backwards code path. It should be noted
that this bug was discovered not because it manifested the way it
could (thanks to the callers in BLIS making sure to always pass in
the "correct" blocksize b), but could have manifested if the
functions were used by 3rd party callers. Thanks to Minh Quan Ho for
reporting the bug via issue #363.
commit fb8bef9982171ee0f60bc39e41a33c4d31fd59a9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Thu Nov 14 13:05:28 2019 -0600
Fixed copy-paste bug in bli_spackm_6xk_bb4_ref().
Details:
- Fixed a copy-paste bug in the new bli_spackm_6xk_bb4_ref() that
manifested as failures in single-precision real level-3 operations.
Also replaced the duplication factor constants with a const-qualifed
varialbe, dfac, so that this won't happen again.
- Changed NC for single-precision real from 4080 to 8160 so that the
packed matrix B will have the same byte footprint in both single
and double real.
commit 8f399c89403d5824ba767df1426706cf2d19d0a7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Nov 12 15:32:57 2019 -0600
Tweaked/added notes to docs/Multithreading.md.
Details:
- Added language to docs/Multithreading.md cautioning the reader about
the nuances of setting multithreading parameters via the manual and
automatic ways simultaneously, and also about how these parameters
behave when multithreading is disabled at configure-time. These
changes are an attempt to address the issues that arose in issue #362.
Thanks to Jérémie du Boisberranger for his feedback on this topic.
- CREDITS file update.
commit bdc7ee3394500d8e5b626af6ff37c048398bb27e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Nov 11 15:47:17 2019 -0600
Various fixes to support packing duplication in B.
Details:
- Added cpp macros to trmm and trmm3 front-ends to optionally force
those operations to be cast so the structured matrix is on the left.
symm and hemm already had such macros, but these too were renamed so
that the macros were individual to the operation. We now have four
such macros:
#define BLIS_DISABLE_HEMM_RIGHT
#define BLIS_DISABLE_SYMM_RIGHT
#define BLIS_DISABLE_TRMM_RIGHT
#define BLIS_DISABLE_TRMM3_RIGHT
Also, updated the comments in the symm and hemm front-ends related to
the first two macro guards, and added corresponding comments to the
trmm and trmm3 front-ends for the latter two guards. (They all
functionally do the same thing, just for their specific operations.)
Thanks to Jeff Hammond for reporting the bugs that led me to this
change (via #359).
- Updated config/old/haswellbb subconfiguration (used to debug issues
related to duplicating B during packing) to register: a packing
kernel for single-precision real; gemmbb ukernels for s, c, and z;
trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c
and z; and to use non-default cache and register blocksizes for s, c,
and z datatypes. Also declared prototypes for all of the gemmbb,
trsmbb, and gemmtrsmbb ukernel functions within the
bli_cntx_init_haswellbb() function. This should, once applied to the
power9 configuration, fix the remaining issues in #359.
- Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a
duplication factor of 4. This function is defined in the same file as
bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c).
commit 0eb79ca8503bd7b237994335b9687457227d3290
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Fri Nov 8 14:48:48 2019 -0600
Avoid unused variable warning in lread.c (#356).
Details:
- Replaced the line
f = f;
with
( void )f;
for the unused variable 'f' in blastest/f2c/lread.c. (Hopefully)
addresses issue #356, but since we don't use xlc who knows. Thanks
to Jeff Hammond for reporting this.
commit f377bb448512f0b578263387eed7eaf8f2b72bb7
Author: Jérôme Duval <jerome.duval@gmail.com>
Date: Thu Nov 7 23:39:29 2019 +0100
Add Haiku to the known OS list (#361)
commit e29b1f9706b6d9ed798b7f6325f275df4e6be973
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Tue Nov 5 17:15:19 2019 -0600
Fixed failing testsuite gemmtrsm_ukr for power9.
Details:
- Added code that fixes false failures in the gemmtrsm_ukr module of the
testsuite. The tests were failing because the computation (bli_gemv())
that performs the numerical check was not able to properly travserse
the matrix operands bx1 and b11 that are views into the micropanel of
B, which has duplicated/broadcast elements under the power9 subconfig.
(For example, a micropanel of B with duplication factor of 2 needs to
use a column stride of 2; previously, the column stride was being
interpreted as 1.)
- Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride()
static functions in bli_obj_macro_defs.h. (Previously, only the
function bli_obj_set_strides() was defined. Amazing to think that we
got this far without these former functions.)
- Updated/expounded upon comments.
commit 49177a6b9afcccca5b39a21c6fd8e243525e1505
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Nov 4 18:09:37 2019 -0600
Fixed latent testsuite ukr module bugs for power9.
Details:
- Fixed a latent bug in the testsuite ukernel modules (gemm, trsm, and
gemmtrsm) that only manifested once we began running with parameters
that mimic those of power9. The problem was rooted in the way those
modules were creating objects (and thus allocating memory) for the
micropanel operands to the microkernel being tested. Since power9
duplicates/broadcasts elements of B in memory, we needed an easy way
of asking for more than one storage element per logical element in
the matrix. I incorrectly expressed this as:
bli_obj_create( datatype, k, n, ldbp, 1, &bp );
The problem here is that bli_obj_create() is exceedingly efficient
at calculating the size it passes to malloc() and doesn't allocate a
full leading dimension's worth of elements for the last column (or
row, in this example). This would normally not bother anyone since
you're not supposed to access that memory anyway. But here, my
attempted "hack" for getting extra elements was insufficient, and
needed to be changed to:
bli_obj_create( datatype, k, ldbp, ldbp, 1, &bp );
That is, the extra elements needed to be baked into the dimensions of
the matrix object in order to have the intended effect on the number
of elements actually allocated. Thanks to Jeff Hammond for reporting
this bug.
- Fixed a typically harmless memory leak in the aforementioned test
modules (the objects for the packed micropanels were not being freed).
- Updated/expanded a common comment across all three ukr test modules.
commit c84391314d4f1b3f73d868f72105324e649f2a72
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Nov 4 13:57:12 2019 -0600
Reverted minor temp/wspace changes from b426f9e.
Details:
- Added missing license header to bli_pwr9_asm_macros_12x6.h.
- Reverted temporary changes to various files in 'test' and 'testsuite'
directories.
- Moved testsuite/jobscripts into testsuite/old.
- Minor whitespace/comment changes across various files.
commit 4870260f6b8c06d2cc01b7147d7433ddee213f7f
Author: Jeff Hammond <jeff.r.hammond@intel.com>
Date: Mon Nov 4 11:55:47 2019 -0800
blacklist GCC 5 and older for POWER9 (#360)
commit b426f9e04e5499c6f9c752e49c33800bfaadda4c
Author: Nicholai Tukanov <nicholai@utexas.edu>
Date: Fri Nov 1 17:57:03 2019 -0500
POWER9 DGEMM (#355)
Implemented and registered power9 dgemm ukernel.
Details:
- Implemented 12x6 dgemm microkernel for power9. This microkernel
assumes that elements of B have been duplicated/broadcast during the
packing step. The microkernel uses a column orientation for its
microtile vector registers and thus implements column storage and
general stride IO cases. (A row storage IO case via in-register
transposition may be added at a future date.) It should be noted that
we recommend using this microkernel with gcc and *not* xlc, as issues
with the latter cropped up during development, including but not
limited to slightly incompatible vector register mnemonics in the GNU
extended inline assembly clobber list.
commit 58102aeaa282dc79554ed045e1b17a6eda292e15
Merge: 52059506 b9bc222b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Oct 28 17:58:31 2019 -0500
Merge branch 'amd'
commit 52059506b2d5fd4c3738165195abeb356a134bd4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Wed Oct 23 15:26:42 2019 -0500
Added "How to Download BLIS" section to README.md.
Details:
- Added a new section to the README.md, just prior to the "Getting
Started" section, titled "How to Download BLIS". This section details
the user's options for obtaining BLIS and lays out four common ways
of downloading the library. Thanks to Jeff Diamond for his feedback
on this topic.
commit e6f0a96cc59aef728470f6850947ba856148c38a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Oct 14 17:05:39 2019 -0500
Updated README.md to ack Facebook as funder.
commit b9bc222bfc3db4f9ae5d7b3321346eed70c2c3fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Oct 14 16:38:15 2019 -0500
Call bli_syrk_small() before error checking.
Details:
- In bli_syrk_front(), moved the conditional call to bli_syrk_check()
(if error checking is enabled) and the conditional scaling of C by
beta (if alpha is zero) so that they occur after, instead of before,
the call to bli_syrk_small(). This sequencing now matches that of
bli_gemm_small() in bli_gemm_front() and bli_trsm_small() in
bli_trsm_front().
commit f0959a81dbcf30d8a1076d0a6348a9835079d31a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Mon Oct 14 15:46:28 2019 -0500
When manual config is blacklisted, output error.
Details:
- Fixed and adjusted the logic in configure so that a more informative
error message is output when a user runs './configure ... <conf>' and
<conf> is present in the configuration blacklist. Previously, this
particular set of conditions would result in the message:
'user-specified configuration '' is NOT registered!
That is, the error message mis-identified the targeted configuration
as the empty string, and (more importantly) mis-identifies the
problem. Thanks to Tze Meng Low for reporting this issue.
- Fixed a nearby error messages somewhat unrelated to the issue above.
Specifically, the wrong string was being printed when the error
message was identifying an auto-detected configuration that did not
appear to be registered.
commit 6218ac95a525eefa8921baf8d0d7057dfacebe9c
Merge: 0016d541 a617301f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Fri Oct 11 11:53:51 2019 -0500
Merge branch 'master' into amd
commit 0016d541e6b0da617b1fae6612d2b314901b7a75
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Fri Oct 11 11:09:44 2019 -0500
Changed -march=znver2 to =znver1 for clang on zen2.
Details:
- In config/zen2/make_defs.mk, changed the -march= flag so that
-march=znver1 is used instead of -march=znver2 when CC_VENDOR is
clang. (The gcc branch attempts to differentiate between various
versions, but the equivalent version cutoffs for clang are not
yet known by us, so we have to use a single flag for all versions
of clang. Hopefully -march=znver1 is new enough. If not, we'll
fall back to -march=bdver4 -mno-fma4 -mno-tbm -mno-xop -mno-lwp.)
This issue was discovered thanks to AppVeyor.
commit e94a0530e5ac4c78a18f09105f40003be2b517f7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Fri Oct 11 10:48:27 2019 -0500
Corrected zen NC that was non-multiple of NR.
Details:
- Updated an incorrectly set cache blocksize NC for single real within
config/zen/bli_cntx_init_zen.c that was non a multiple of the
corresponding value of NR. This issue, which was caught by Travis CI,
was introduced in 29b0e1e.
commit a2ffac752076bf55eb8c1fe2c5da8d9104f1f85b
Merge: 1cfe8e25 29b0e1ef
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Fri Oct 11 10:31:18 2019 -0500
Merge branch 'amd-master' into amd
commit 29b0e1ef4e8b84ce76888d73c090009b361f1306
Merge: 1cfe8e25 fdce1a56
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Fri Oct 11 10:24:24 2019 -0500
Code review + tweaks to AMD's AOCL 2.0 PR (#349).
Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was