Note: For some releases, credit for individuals' contributions are shown in parentheses.
- Changes in 0.7.0
- Changes in 0.6.1
- Changes in 0.6.0
- Changes in 0.5.2
- Changes in 0.5.1
- Changes in 0.5.0
- Changes in 0.4.1
- Changes in 0.4.0
- Changes in 0.3.2
- Changes in 0.3.1
- Changes in 0.3.0
- Changes in 0.2.2
- Changes in 0.2.1
- Changes in 0.2.0
- Changes in 0.1.8
- Changes in 0.1.7
- Changes in 0.1.6
- Changes in 0.1.5
- Changes in 0.1.4
- Changes in 0.1.3
- Changes in 0.1.2
- Changes in 0.1.1
- Changes in 0.1.0
- Changes in 0.0.9
- Changes in 0.0.8
- Changes in 0.0.7
- Changes in 0.0.6
- Changes in 0.0.5
- Changes in 0.0.4
- Changes in 0.0.3
- Changes in 0.0.2
- Changes in 0.0.1
April 7, 2020
Improvements present in 0.7.0:
Framework:
- Implemented support for multithreading within the sup (skinny/small/unpacked) framework, which previously was single-threaded only. Note that this feature works harmoniously with the selective packing introduced into the sup framework in 0.6.1. (AMD)
- Renamed
bli_thread_obarrier()
andbli_thread_obroadcast()
functions to drop the 'o', which was left over from whenthrcomm_t
objects tracked both "inner" and "outer" communicators. - Fixed an obscure
int
-to-packbuf_t
type conversion error that only affects certain C++ compilers (including g++) when compiling application code that includes the BLIS header fileblis.h
. (Ajay Panyala) - Added a missing early
return
statement inbli_thread_partition_2x2()
, which provides a slight optimization. (Kiran Varaganti)
Kernels:
- Fixed the semantics of the
bli_amaxv()
kernels ('s' and 'd') within thezen
kernel set. Previously, the kernels (incorrectly) returned the index of the last element whose absolute value was largest (in the event there were multiple of equal value); now, it (correclty) returns the index of the first of such elements. The kernels also now return the index of the first NaN, if one is encountered. (Mat Cross, Devin Matthews)
Build system:
- Warn the user at configure-time when hardware auto-detection returns the
generic
subconfiguration since this is probably not what they were expecting. (Devin Matthews) - Removed unnecessary sorting (and duplicate removal) on
LDFLAGS
incommon.mk
. (Isuru Fernando) - Specify the full path to the location of the dynamic library on OSX so that other dynamic libraries that depend on BLIS know where to find the library. (Satish Balay, Jed Brown)
Testing:
- Updated and reorganized test drivers in
test/sup
so that they work for either single-threaded or multithreaded purposes. (AMD) - Updated/optimized octave scripts in
test/sup
for use with octave 5.2.0. - Minor updates/tweaks to
test/1m4m
.
Documentation:
- Updated existing single-threaded sup performance graphs with new data and added multithreaded sup graphs to
docs/PerformanceSmall.md
. - Added mention of Gentoo support under the external packages section of the
README.md
. - Tweaks to
docs/Multithreading.md
that clarify that setting anyBLIS_*_NT
variable to 1 will be considered manual specification for the purposes of determining whether to auto-factorize viaBLIS_NUM_THREADS
. (AMD)
January 14, 2020
Improvements present in 0.6.1:
Framework:
- Added support for pre-broadcast when packing B. This causes elements of B to be repeated (broadcast) in the packed copy of B so that subsequent vector loads will result in the element already being pre-broadcast into the vector register.
- Added support for selective packing to
gemmsup
(controlled via environment variables and/or therntm_t
object). (AMD) - Fixed a bug in
sdsdot_sub()
that redundantly added the "alpha" scalar and a separate bug in the order of typecasting intermediate products insdsdot_()
. (Simon Lukas Märtens, Devin Matthews) - Fixed an obscure bug in
bli_acquire_mpart_mdim()
/bli_acquire_mpart_ndim()
. (Minh Quan Ho) - Fixed a subtle and complicated bug that only manifested via the BLAS test drivers in the
generic
subconfiguration, and possibly any other subconfiguration that did not register complex-domaingemm
ukernels, or registered ONLY real-domain ukernels as row-preferential. (Dave Love) - Always use
sumsqv
to computenormfv
instead of the "dot product trick" that was previously employed for performance reasons. (Roman Yurchak, Devin Matthews, and Isuru Fernando) - Fixed bug in
thrinfo_t
debugging/printing code.
Kernels:
- Implemented and registered an optimized
dgemm
microkernel for thepower9
kernel set. (Nicholai Tukanov) - Pacify a
restrict
warning in thegemmtrsm4m1
reference ukernel. (Dave Love, Devin Matthews)
Build system:
- Fixed parsing in
vpu_count()
on some SkylakeX workstations. (Dave Love) - Reimplemented
bli_cpuid_query()
for ARM to usestdio
-based functions instead ofpopen()
. (Dave Love) - Use
-march=znver1
for clang onzen2
subconfig. - Updated
-march
flags forsandybridge
,haswell
subconfigurations to use newer syntax (e.g.haswell
instead ofcore-avx2
andsandybridge
instead ofcorei7-avx
. - Correctly use
-qopenmp-simd
for reference kernels when compiling with icc. (Victor Eikjhout) - Added
-march
support for select gcc version ranges where flag syntax changes or new flags are added. The ranges we identify are: versions older than 4.9.0; versions older than 6.1.0 (but newer than 4.9.0); versions older than 9.1.0 (but newer than 6.1.0). - Use
-funsafe-math-optimizations
and-ffp-contract=fast
for all reference kernels when using gcc or clang. - Updated MC cache blocksizes used by
haswell
subconfig. - Updated NC cache blocksizes used by
zen
subconfig. - Fixed a typo in the context registration of the
cortexa53
subconfiguration inbli_gks.c
. (Francisco Igual) - Output a more informative error when the user manually targets a subconfiguration that configure places in the configuration blacklist. (Tze Meng Low)
- Set execute bits of shared library at install-time. (Adam J. Stewart)
- Added missing thread-related symbols for export to shared libraries. (Kyungmin Lee)
- Removed (finally) the
attic/windows
directory since we offer Windows DLL support via AppVeyor's build artifacts, and thus that directory was only likely confusing people.
Testing:
- Fixed latent testsuite microkernel module bug for
power9
subconfig. (Jeff Hammond) - Added
test/1m4m
driver directory for test drivers related to the 1m paper. - Added libxsmm support to
test/sup drivers
. (Robert van de Geijn) - Updated
.travis.yml
anddo_sde.sh
to automatically accept SDE license and download SDE directly from Intel. (Devin Matthews, Jeff Hammond) - Updated standalone test drivers to iterate backwards through the specified problem space. This often helps avoid the situation whereby the CPU doesn't immediately throttle up to its maximum clock frequency, which can produce strange discontinuities (sharply rising "cliffs") in performance graphs.
- Pacify an unused variable warning in
blastest/f2c/lread.c
. (Jeff Hammond) - Various other minor fixes/tweaks to test drivers.
Documentation:
- Added libxsmm results to
docs/PerformanceSmall.md
. - Added BLASFEO results to
docs/PerformanceSmall.md
. - Added the page size and location of the performance drivers to
docs/Performance.md
anddocs/PerformanceSmall.md
. (Dave Love) - Added notes to
docs/Multithreading.md
regarding the nuances of setting multithreading parameters the manual way vs. the automatic way. (Jérémie du Boisberranger) - Added a section on reproduction to
docs/Performance.md
anddocs/PerformanceSmall.md
. (Dave Love) - Documented Eigen
-march=native
hack indocs/Performance.md
anddocs/PerformanceSmall.md
. (Sameer Agarwal) - Inserted multithreading links and disclaimers to
BuildSystem.md
. (Jeff Diamond) - Fixed typo in description for
bli_?axpy2v()
indocs/BLISTypedAPI.md
. (Shmuel Levine) - Added "How to Download BLIS" section to
README.md
. (Jeff Diamond) - Various other minor documentation fixes.
June 3, 2019
Improvements present in 0.6.0:
Framework:
- Implemented small/skinny/unpacked (sup) framework for accelerated level-3 performance when at least one matrix dimension is small (or very small). For now, only
dgemm
is optimized, and this new implementation currently only targets Intel Haswell through Coffee Lake, and AMD Zen-based Ryzen/Epyc. (The existing kernels should extend without significant modification to Zen2-based Ryzen/Epyc once they are available.) Also, multithreaded parallelism is not yet implemented, though application-level threading should be fine. (AMD) - Changed function pointer usages of
void*
to new, typedef'ed typevoid_fp
. - Allow compile-time disabling of BLAS prototypes in BLIS, in case the application already has access to prototypes.
- In
bli_system.h
, define_POSIX_C_SOURCE
to200809L
if the macro is not already defined. This ensures that things such as pthreads are properly defined by an application that has#include "blis.h"
but omits the definition of_POSIX_C_SOURCE
from the command-line compiler options. (Christos Psarras)
Kernels:
- None.
Build system:
- Updated the way configure and the top-level Makefile handle installation prefixes (
prefix
,exec_prefix
,libdir
,includedir
,sharedir
) to better conform with GNU conventions. - Improved clang version detection. (Isuru Fernando)
- Use pthreads on MinGW and Cygwin. (Isuru Fernando)
Testing:
- Added Eigen support to test drivers in
test/3
. - Fix inadvertently hidden
xerbla_()
in blastest drivers when building only shared libraries. (Isuru Fernando, M. Zhou)
Documentation:
- Added
docs/PerformanceSmall.md
to showcase new BLIS small/skinnydgemm
performance on Kaby Lake and Epyc. - Added Eigen results (3.3.90) to performance graphs showcased in
docs/Performance.md
. - Added BLIS thread factorization info to
docs/Performance.md
.
March 19, 2019
Improvements present in 0.5.2:
Framework:
- Added support for IC loop parallelism to the
trsm
operation. - Implemented a pool-based small block allocator and a corresponding
configure
option (enabled by default), which minimizes the number of calls tomalloc()
andfree()
for the purposes of allocating small blocks (on the order of 100 bytes). These small blocks are used by internal data structures, and the repeated allocation and freeing of these structures could, perhaps, cause memory fragmentation issues in certain application circumstances. This was never reproduced and observed, however, and remains entirely theoretical. Still, the sba should be no slower, and perhaps a little faster, than repeatedly callingmalloc()
andfree()
for these internal data structures. Also, the sba was designed to be thread-safe. (AMD) - Refined and extended the output enabled by
--enable-mem-tracing
, which allows a developer to follow memory allocation and release performed by BLIS. - Initialize error messages at compile-time rather than at runtime. (Minh Quan Ho)
- Fixed a potential situation whereby the multithreading parameters in a
rntm_t
object that is passed into an expert interface is ignored. - Prevent a redefinition of
ftnlen
in thef2c_types.h
in blastest. (Jeff Diamond)
Kernels:
- Adjusted the cache blocksizes in the
zen
sub-configuration forfloat
,scomplex
, anddcomplex
datatypes. The previous values, taken directly from thehaswell
subconfig, were merely meant to be reasonable placeholders until more suitable values were determined, as had already taken place for thedouble
datatype. (AMD) - Rewrote reference kernels in terms of simplified indexing annotated by the
#pragma omp simd
directive, which a compiler can use to vectorize certain constant-bounded loops. The#pragma
is disabled via a preprocessor macro layer if the compiler is found byconfigure
to not support-fopenmp-simd
. (Devin Matthews, Jeff Hammond)
Build system:
- Added symbol-export annotation macros to all of the function prototypes and global variable declarations for public symbols, and created a new
configure
option,--export-shared=[public|all]
, that controls which symbols--only those that are meant to be public, or all symbols--are exported to the shared library. (Isuru Fernando) - Standardized to using
-O3
in various subconfigs, and also-funsafe-math-optimizations
for reference kernels. (Dave Love, Jeff Hammond) - Disabled TBM, XOP, LWP instructions in all AMD subconfigs. (Devin Matthews)
- Fixed issues that prevented using BLIS on GNU Hurd. (M. Zhou)
- Relaxed python3 requirements to allow python 3.4 or later. Previously, python 3.5 or later was required if python3 was being used. (Dave Love)
- Added
thunderx2
sub-configuration. (Devangi Parikh) - Added
power9
sub-configuration. For now, this subconfig only uses reference kernels. (Nicholai Tukanov) - Fixed an issue with
configure
failing on OSes--including certain flavors of BSD--that contain a slash '/' character in the output ofuname -s
. (Isuru Fernando, M. Zhou)
Testing:
- Renamed
test/3m4m
directory totest/3
. - Lots of updates and improvements to Makefiles, shell scripts, and matlab scripts in
test/3
.
Documentation:
- Added a new
docs/Performance.md
document that showcases single-threaded, single-socket, and dual-socket performance results ofsingle
,double
,scomplex
, anddcomplex
level-3 operations in BLIS, OpenBLAS, and MKL/ARMPL for Haswell, SkylakeX, ThunderX2, and Epyc hardware architectures. (Note: Other implementations such as Eigen and ATLAS may be added to these graphs in the future.) - Updated
README.md
to include new language on external packages. (Dave Love) - Updated
docs/Multithreading.md
to be more explicit about the fact that multithreading is disabled by default at configure-time, and the fact that BLIS will run executed single-threaded at runtime by default if no multithreaded specification is given. (M. Zhou)
December 18, 2018
Improvements present in 0.5.1:
Framework:
- Added mixed-precision support to the 1m method implementation.
- Track internal scalar datatypes in the
obj_t
info bitfield. This allows slightly better handling of scalars during mixed-datatypegemm
computation. - Fixed a bug that allowed execution of 1m with mixed-precision
gemm
, despite such usage not yet being officially supported. (Devangi Parikh) - Added missing internal calls to
bli_init_once()
inbli_thread_set_num_threads()
andbli_thread_set_ways()
. (Ali Emre Gülcü)
Kernels:
- Redefined
packm
kernels to handle edge cases and zero-filling, and updated their APIs accordingly. This was needed in order to fully support the use of non-default/non-reference packm kernels. (Devin Matthews)
Build system:
- Disallow explicit requests to use 64-bit integers in the BLAS API while simultaneously using 32-bit integers in the BLIS API. (Jeff Hammond, Devin Matthews)
- Fixed an msys2/Windows build failure. (Isuru Fernando, Costas Yamin)
- Fixed a MinGW build failure. (Isuru Fernando)
- Disabled
arm32
,arm64
configuration families since we don't yet have logic to choose the correct context at runtime.
Testing:
- Make sure the testsuite fails for
NaN
,Inf
in input operands. (Devin Matthews) - Added
hemm
driver totest/3m4m
. - Minor updates to
test/mixeddt
drivers, matlab scripts. - Added additional matlab plotting scripts to
test/3m4m
.
Documentation:
- Updated
docs/Multithreading.md
to include discussion of setting affinity via OpenMP. - Updated
docs/Testsuite.md
to include discussion of mixed-datatype settings. - Updated
docs/MixedDatatypes.md
to include a brief section on running the testsuite to exercise mixed-datatype functionality, and other minor updates. - Fixed broken links in
docs/KernelsHowTo.md
. (Richard Goldschmidt) - Spelling fixes in FAQ. (Rhys Ulerich)
- Updated 3-clause license comment blocks to refer generically to copyright holders rather than just the original copyright holder, UT-Austin.
October 25, 2018
Improvements present in 0.5.0:
Framework:
- Implemented support for matrix operands of mixed datatypes (domains and precisions) within the
gemm
operation. - Added configure-time option to use slab or round-robin partitioning within JR and IR loops of most level-3 operations' macrokernels.
- Allow parallelism in the JC loop for
trsm_l
, which previously was unnecessarily disabled. (Field Van Zee, Devangi Parikh) - Added Fortran-77/90-compatible APIs for some thread-related functions. (Kay Dewhurst)
- Defined a new level-1d operation
shiftd
, which adds a scalar value to every element along an arbitrary diagonal of a matrix. - Patched an issue (#267) that may arise when linking against OpenMP-configured BLIS from which parallelism is requested at runtime and a level-3 operation (e.g.
gemm
) is called from within an OpenMP parallel region of an application where OpenMP nested parallelism is disabled. (Devin Matthews)
Kernels:
- Imported SkylakeX
dgemm
microkernel fromskx-redux
branch, which contains optimizations (mostly better prefetching on C) over the previous implementation. (Devin Matthews) - Renamed/relocated level-3
zen
microkernels to thehaswell
kernel set. Please see a recent message to blis-devel for more information on this rename [1]. - BG/Q kernel fixes. (Ye Luo)
Build system:
- Added support for building Windows DLLs via AppVeyor [2], complete with a built-in implementation of pthreads for Windows, as well as an implementation of the
pthread_barrier_*()
APIs for use on OS X. (Isuru Fernando, Devin Matthews, Mathieu Poumeyrol, Matthew Honnibal) - Defined a
cortexa53
sub-configuration, which is similar tocortexa57
except that it uses slightly different compiler flags. (Mathieu Poumeyrol) - Added python version checking to
configure
script. - Added a script to automate the regeneration of the symbols list file (now located in
build/libblis-symbols.def
). - Various tweaks in preparation for BLIS's inclusion within Debian. (M. Zhou)
- Various fixes and cleanups.
Testing:
- Added tests for
cortexa15
andcortexa57
in Travis CI. (Mathieu Poumeyrol) - Added tests for mixed-datatype
gemm
and the simulation of application-level threading (salt) in Travis CI. - Add statistics-collecting
irun.py
script. - Include various threading parameters in the initial comment block of testsuite output.
- Various fixes and cleanups.
Documentation:
- Added
MixedDatatypes.md
documentation for mixed-datatypegemm
. - Added example code demonstrating use of mixed-datatype
gemm
(object API only). - Added description of
shiftd
toBLISTypedAPI.md
andBLISObjectAPI.md
. - Added "Known issues" sections to
Multithreading.md
andSandboxes.md
. - Updated
FAQ.md
. - Various other documentation updates.
[1] https://groups.google.com/forum/?fromgroups#!topic/blis-devel/pytWRjIzxVY [2] https://ci.appveyor.com/project/shpc/blis/
August 30, 2018
Improvements present in 0.4.1:
Framework:
- Improved thread safety by homogenizing all critical sections to unconditionally use pthread mutexes. (AMD)
- Fixed
bli_finalize()
, which had become uncallable due to sharingpthread_once_t
objects between the initialization and finalization steps. This manifested as a rather large memory leak (many megabytes) if/when the application manually finalized BLIS in the middle of its execution. (Devangi Parikh, Field Van Zee) - Fixed a minor memory leak in the global kernel structure. (Devangi Parikh, Field Van Zee)
- Replaced extensive use of function "chooser" macros in object API functions with use of a new set of functions using the suffix
_qfp()
("query function pointer"). These functions can be used to query function pointers for most families of typed functions. - Fixed an obscure integer size bug due to improper use of integer literal constants with
va_arg()
. This oddly manifested as LP64 systems using the general stride output case of microkernels even when the output matrix storage matched that of the microkernel output preference. (Devangi Parikh, Field Van Zee)
Kernels:
- Fixed compilation of
armv7a
kernels. (Mathieu Poumeyrol)
Build system:
- Generate makefile fragments within the
obj
directory rather than inconfig
,kernels
,ref_kernels
, andframe
. This allows a user to perform an out-of-tree build even if the BLIS source distribution is read-only. (Devin Matthews) - Allow a dependent sub-project such as example code or the testsuite to compile and link against an installation of BLIS rather than implicitly searching for a local (uninstalled) copy. (Victor Eijkhout, Field Van Zee)
- Fixed a link error that manifested after building only a shared library (e.g.
--disable-static
) and then trying to build a dependent sub-project such as example code or the testsuite. (Sajid Ali) - Changed
test
make target of top-levelMakefile
to behave more likecheck
by printing a color-coded characterization of the test results. - Fixed the
-p
option toconfigure
, which had likely been broken since May 7, 2018. The--prefix
option was unaffected. (Dave Love) - Running
configure
no longer requires a C++ compiler given that a C++ compiler was only ever envisioned for optional use in the sandbox. (Devangi Parikh, Field Van Zee)
Testing:
- Added the ability to "simulate" multiple application-level threads in the testsuite by executing the individual experiments with multiple threads. This should make it easier to test for thread-safety in the future. (AMD)
- Removed borderline useless wall clock time from test drivers' output.
Documentation:
- Updated typed and object API documents to include language on
rntm_t
parameters in the expert interfaces. - Updates to
README.md
, including language on sandboxes. - Added table of make targets to
BuildSystem.md
. - Added missing language to
ConfigurationHowTo.md
on updating the architecture string array inbli_arch.c
. (Devangi Parikh, Field Van Zee)
July 27, 2018
Framework:
- Added support for "sandboxes" for employing alternative
gemm
implementations. A ready-to-use reference C99 sandbox provides developers with a starting point for experimentation. - Separated expert, non-expert typed APIs (levels 1v, 1d, 1f, 1m, 2, and 3, and utility functions).
- Defined new
rntm_t
structure and API to provide a uniform way of storing user-level threading information (equivalent ofBLIS_NUM_THREADS
andBLIS_*_NT
environment variables), and also conveying that information to expert APIs. (Matthew Honnibal, Nathaniel Smith) - Renamed various
obj_t
accessor macros, converted to static functions, and inserted explicit typecasting to facilitate #including blis.h from a C++ application. (Jacob Gorm Hansen) - Cache and reuse
arch_t
architecture query result at runtime. (Devin Matthews) - Implemented object-based functions
bli_projm()
/_projv()
, which project objects from one domain to another (within the same precision), andbli_castm()
/_castv()
, which typecast objects from one datatype to another. - Implemented object-based functions
bli_setrm()
/_setrv()
,bli_setim()
/_setiv()
, which allow the caller to broadcast a scalar to all real elements or all imaginary elements within an object. - Enforce consistent datatypes in most object APIs.
- For native execution, initialize a context's virtual microkernel slots to the function pointers of native microkernels. This simplifies query routines and paves the way for more generalized use of virtual microkernels beyond those for induced methods.
- Various bugfixes. (Devangi Parikh)
Kernels:
- Re-expressed x86_64 microkernels in terms of assembly language macros, which support lower- and upper-case, AT&T and Intel syntax. (Devin Matthews)
- Various bugfixes. (Robin Christ, Francisco Igual, Devangi Parikh, qnerd)
Build system:
- Added support for
--libdir
,--includedir
configure options. (Nico Schlömer) - Adopted Linux-like shared library versioning and enabled building shared libraries by default.
- Improved shared library handling on OS X. (Alex Arslan)
- Added configure support for preset
CFLAGS
,LDFLAGS
. (Dave Love) - Improvements to version file handling.
- Implemented configure option hack for circumventing small/limited values of
ARG_MAX
. - Reorganized
cc
,cc_vendor
detection responsibilities fromMakefile
toconfigure
. (Alex Arslan) - Cross-compilation fixes.
- Preliminary Windows ABI suport using
clang
, appveyor. (Isuru Fernando) - Better support for typical development environment on OpenBSD, FreeBSD. (Alex Arslan)
- Bumped shared library
soname
version number to 1.0.0. - Various build system fixes and cleanups. (Mathieu Poumeyrol, Nico Schlömer, Tony Skjellum)
Testing:
- Rewrote Travis CI testing config file and supporting logic to use Intel's SDE emulator. This allows multiple x86_64 microarchitectures to be tested regardless of what hardware Travis happens to be using at the time. (Devin Matthews)
- Added
docs/studies
hardware-specific test driver directory to track individual performance studies. (Devangi Parikh) - Streamlined
testsuite/input.operations
file format.
Documentation:
- Relocated all wiki documents to a
docs
directory and adjusted all links, andREADME.md
, accordingly. - Added a
CONTRIBUTING.md
file to top-level directory. - Added
docs/CodingConventions.md
. - Added
docs/Sandboxes.md
. - Added
docs/BLISObjectAPI.md
. - Renamed and updated
docs/BLISTypedAPI.md
. - Updated
docs/KernelsHowTo.md
. - Updated
docs/BuildSystem.md
. (Stefanos Mavros) - Updated
docs/Multithreading.md
. - Updated indentation in
docs/ConfigurationHowTo.md
for easier reading. - Added example code for the BLIS typed API in
examples/tapi
. - Expanded existing example code for the object API in
examples/oapi
. - Added links to RHEL/Fedora and Debian packages to
README.md
. - Various cleanups. (Tony Skjellum, Dave Love, Nico Schlömer)
April 28, 2018
- Added
setijm
,getijm
operations for updating and querying individual matrix elements via the object API. - Added
examples/oapi
directory containing a code-based tutorial on using the object-based API in BLIS. - Track separate reference kernel
CFLAGS
for each sub-configuration. - Added support for blacklisting sub-configurations based on the assembler/binutils.
- Added 64-bit support to BLAS test drivers.
- Various bugfixes.
April 4, 2018
- Enable use of new zen kernels in haswell sub-configuration.
- Added row-storage optimizations to zen
dotxf
kernels (now also used by haswell). - Integrated an
f2c
ed version of the BLAS test drivers from netlib LAPACK into BLIS build system (e.g.make testblas
,make checkblas
). See the Testsuite document for more info. Also scheduled these BLAS drivers to execute regularly via Travis CI. - Added a new
make check
target that executes a fast version of the BLIS testsuite as well as the BLAS test drivers (primarily targeting package maintainers). - Allow individual operation overriding in the BLIS testsuite. (This makes it easy to quickly test one or two operations of interest.)
- Added build system support for libmemkind. If present,
hbw_malloc()
is used as the default value forBLIS_MALLOC_POOL
instead ofmalloc()
. It can be disabled via--disable-memkind
. - Tweaks and fixes to BLAS compatibility layer, courtesy of the new BLAS test drivers.
- Output the active sub-configuration in testsuite output header.
- Allow arbitrary nesting of "umbrella" configuration families in
config_registry
, allowing us to define x86_64 in terms of amd64 and intel64. - Added skx and knl to intel64 (and by proxy, x86_64) configuration families.
- Implemented basic support for ARM hardware detection (via
/proc/cpuinfo
). - Various bugfixes.
February 23, 2018
This version contains significant improvements from 0.2.2. Major changes include:
- Real and complex domain (s,d,c,z) assembly-based gemm microkernels for AMD's Zen microarchitecture. (AMD, Field Van Zee)
- Real domain (s,d) assembly-based
gemmtrsm_l
andgemmtrsm_u
microkernels for Zen. (AMD, Field Van Zee) - Real domain (s,d) intrinsics-based
amaxv
,axpyv
,dotv
,dotxv
,scalv
,axpyf
, anddotxf
kernels for Zen. (AMD, Field Van Zee) - Generalized the configuration system to allow multi-configuration builds targeting configuration "families". A single sub-configuration is chosen at runtime via some heuristic, such as querying CPUID (e.g. runtime hardware detection). This change was extensive and required a reorganization of the build system, configuration semantics, reference kernels, a new naming scheme for native kernels, and a rewrite of the global kernel structure (gks). Please see the rewritten Configuration Guide for details.
- Implemented runtime hardware detection for x86_64 hardware.
- Reimplemented configure-time hardware detection in terms of new runtime hardware detection code, which queries for CPU features rather than individual models.
- Implemented library self-initialization by rewriting
bli_init()
in terms ofpthread_once()
and inserting invocations tobli_init()
in key places throughout BLIS. The expectation is that through normal use of any BLIS API (BLAS, typed BLIS, or object-based BLIS), the user no longer needs to explicitly initialize the library, and thatbli_finalize()
should never be called by the user unless he is absolutely sure he no longer needs BLIS functionality. Related to this: global scalar constants (BLIS_ONE
,BLIS_ZERO
, etc.) are now statically initialized and thus ready to use immediately. Collectively, these changes provide improved thread safety at the application level. - Compile with and install a single monolithic (flattened)
blis.h
header to (1) speed up compilation and (2) reduce the number of build product files. - Added a sub-API for setting multithreading environment variables at runtime. For a few examples, please see the Multithreading guide.
- Reimplemented OpenMP/pthread barriers in terms of GNU atomic built-ins.
- Other small changes and fixes.
May 2, 2017
- Implemented the 1m method for inducing complex matrix multiplication. (Please see ACM TOMS publication "Implementing high-performance complex matrix multiplication via the 1m method" for more details.)
- Switched to simpler
trsm_r
implementation. - Relaxed constraints that
MC % NR = 0
andNC % MR = 0
, as this was only needed for the more sophisticatedtrsm_r
implementation. - Automatic loop thread assignment. (Devin Matthews)
- Updates to
.travis.yml
configuration file. (Devin Matthews) - Updates to non-default haswell microkernels.
- Match storage format of the temporary micro-tiles in macrokernels to that of the microkernel storage preference for edge cases.
- Added support for Intel's Knight's Landing. (Devin Matthews)
- Added more flexible options to specify multithreading via the configure script. (Devin Matthews)
- OS X compatibility fixes. (Devin Matthews)
- Other small changes and fixes.
Also, thanks to Elmar Peise, Krzysztof Drewniak, and Francisco Igual for their contributions in reporting/fixing certain bugs that were addressed in this version.
October 5, 2016
- Implemented distributed
thrinfo_t
structure management. (Ricardo Magana) - Redesigned BLIS's level-3 algorithmic control tree structure. (suggested by Tyler Smith)
- Consolidated
gemm
,herk
, andtrmm
blocked variants into one set of three bidirectional variants. - Integrated a new "memory broker" (
membrk_t
) abstraction in place of the previous memory allocator, which allows one set of pools per broker (or, in other words, per memory space). (Ricardo Magana) - Reorganized multithreading APIs, including more consistent namespace prefixes:
bli_thrinfo_*()
,bli_thrcomm_*()
, etc. - Added
randnm
,randnv
operations, which produce random powers of two in a narrow range, and integrated a corresponding option into the testsuite. (suggested by AMD) - Reclassified
amaxv
as a level-1v operation and kernel. - Added complex
gemm
microkernels for haswell, which have register allocations consistent with the existing 6x16sgemm
and 6x8dgemm
microkernels. - Adjusted existing microkernels to work properly when BLIS is configured to use 32-bit integers. (Devin Matthews)
- Relaxed alignment constraints in sandybridge and haswell microkernels. (Devin Matthews)
- Define CBLAS API with
f77_int
instead ofint
, which means the BLAS compatibility integer size is inherited by the CBLAS compatibility layer. (Devin Matthews) - Added an alignment switch to the testsuite to globally enable/disable starting address and leading dimension alignment. (suggested by Devin Matthews)
- Various enhancements to configure script. (Devin Matthews)
- Avoid compiling BLAS/CBLAS compatibility layer when it is disabled via configure. (suggested by Devin Matthews)
- Disabled compilation of object-based blocked partitioning code for level-2 operations, as it was already functionally disabled.
- Fixes and tweaks to POSIX thread support. (Tyler Smith, Jeff Hammond)
- Other small changes and fixes.
April 11, 2016
Most of BLIS 0.2.0's changes are contained within a single commit, 537a1f4 (aka "the big commit"). An executive summary of the most consequential of these changes follows:
- BLIS has been retrofitted with a new data structure, known as a "context," affecting virtually every internal API for every computational operation, as well as many supporting, non-computational functions that must access information within the context.
- In addition to appearing within these internal APIs, the context--specifically, a pointer to a
cntx_t
--is now present within all user-level datatype-aware APIs, e.g.bli_zgemm()
, appearing as the last argument. - User-level object APIs, e.g.
bli_gemm()
, were unaffected and continue to be "context-free." However, these APIs were duplicated so that corresponding "context-aware" APIs now also exist, differentiated with an_ex
suffix (for "expert"). - Contexts are initialized very soon after a computational function is called (if one was not passed in by the caller) and are passed all the way down the function stack, even into the kernels, and thus allow the code at any level to query information about the runtime instantiation of the current operation being executed, such as kernel addresses, microkernel storage preferences, and cache/register blocksizes.
- Contexts are thread-friendly. For example, consider the situation where a developer wishes two or more threads to execute simultaneously with somewhat different runtime parameters. Contexts also inherently promote thread-safety, such as in the event that the original source of the information stored in the context changes at run-time (see next two bullets).
- BLIS now consolidates virtually all kernel/hardware information in a new "global kernel structure" (gks) API. This new API will allow the caller to initialize a context in a thread-safe manner according to the currently active kernel configuration. For now, the currently active configuration cannot be changed once the library is built. However, in the future, this API will be expanded to allow run-time management of kernels and related parameters.
- The most obvious application of this new infrastructure is the run-time detection of hardware (and the implied selection of appropriate kernels). With contexts, kernels may even be "hot swapped" within the gks, and once execution begins on a level-3 operation, the memory allocator will be reinitialized on-the-fly, if necessary, to accommodate the new kernels' blocksizes. If a different application thread is executing with another (previously loaded) kernel, it will finish in a deterministic fashion because its kernel info was loaded into its context before computation began, and also because the blocks it checked out from the memory pools will be unaffected by the newer threads' reinitialization of the allocator.
This version contains other changes that were committed prior to 537a1f4:
- Inline assembly FMA4 microkernels for AMD bulldozer. (Etienne Sauvage)
- A more feature-rich configure script and build system. Certain long-style options are now accepted, including convenient command-line switches for things like enabling debugging symbols. Important definitions were also consolidated into a new makefile fragment,
common.mk
, which can be included by the BLIS build system as well as quasi-independent build systems, such as the BLIS test suite. (Devin Matthews) - Updated and improved armv8 microkernels. (Francisco Igual)
- Define
bli_clock()
in terms ofclock_gettime()
intead ofgettimeofday()
, which has been languishing on my to-do list for years, literally. (Devin Matthews) - Minor but extensive modifications to parts of the BLAS compatibility layer to avoid potential namespace conflicts with external user code when
blis.h
is included. (Devin Matthews) - Fixed a missing BLIS integer type definition (
BLIS_BLAS2BLIS_INT_TYPE_SIZE
) when CBLAS was enabled. Thanks to Tony Kelman reporting this bug. - Merged
packm_blk_var2()
intopackm_blk_var1()
. The former's functionality is used by induced methods for complex level-3 operations. (Field Van Zee) - Subtle changes to treatment of row and column strides in
bli_obj.c
that pertain to somewhat unusual use cases, in an effort to support certain situations that arise in the context of tensor computations. (Devin Matthews) - Fixed an unimplemented
beta == 0
case in the penryn (formerly "dunnington")sgemm
microkernel. (Field Van Zee) - Enhancements to the internal memory allocator in anticipation of the context retrofit. (Field Van Zee)
- Implemented so-called "quadratic" matrix partitioning for thread-level parallelism, whereby threads compute thread index ranges to produce partitions of roughly equal area (and thus computation), subject to the (register) blocksize multiple, even when given a structured rectangular subpartition with an arbitrary diagonal offset. Thanks to Devangi Parikh for reporting bugs related to this feature. (Field Van Zee)
- Enabled use of Travis CI for automatic testing of github commits and pull requests. (Xianyi Zhang)
- New
README.md
, written in github markdown. (Field Van Zee) - Many other minor bug fixes.
Special thanks go to Lee Killough for suggesting the use of a "context" data structure in discussions that transpired years ago, during the early planning stages of BLIS, and also for suggesting such a perfectly appropriate name.
July 29, 2015
This release contains only two commits, but they are non-trivial: we now have configuration support for AMD Excavator (Carrizo) and microkernels for Intel Haswell/Broadwell.
June 19, 2015
- Replaced the static memory allocator used to manage internal packing buffers with one that dynamically allocates memory, on-demand, and then recycles the allocated blocks in a software cache, or "pool". This significantly simplifies the memory-related configuration parameter set, and it completely eliminates the need to specify a maximum number of threads.
- Implemented default values for all macro constants previously found in
bli_config.h
. The default values are now set inframe/include/bli_config_macro_defs.h
. Any value #defined inbli_config.h
will override these defaults. - Initial support for configure-time detection of hardware. By specifying the
auto
configuration at configure-time, the configure script chooses a configuration for you. If an optimized configuration does not exist, the reference implementation serves as a fallback. - Completely reorganized implementations for complex induced methods and added support for new algorithms.
- Added optimized microkernels for AMD Piledriver family of hardware.
- Several bugfixes to multithreaded execution.
- Various other minor tweaks, code reorganizations, and bugfixes.
October 23, 2014
- New complex domain AVX microkernels are now available and used by default by the sandybridge configuration.
- Added new high-level 4m and 3m implementations presently known as "4mh" and "3mh".
- Cleaned up 4m/3m front-end layering and added routines to enable, disable, and query which implementation will be called for a given level-3 operation. The test suite now prints this information in its pre-test summary. 4m (not 4mh) is still the default when complex microkernels are not present.
- Consolidated control tree code and usage so that all level-3 multiplication operations use the same gemm_t structure, leaving only
trsm
to have a custom tree structure and associated code. - Re-implemented micropanel alignment, which was removed in commit c2b2ab6 earlier this year.
- Relaxed the long-standing constraint that
KC
be a multiple ofMR and
NRby allowing the developer to specify target values and then adjusting them up to the next multiple of
MRor
NR, as needed by the affected operations (
hemm,
symm,
trmm, trsm
). - Added a new "row preference" flag that the developer can use to signal to the framework that a microkernel prefers to output micro-tiles of C that are row-stored (rather than column-stored). Column storage preference is still the default.
- Changed semantics of blocksize extensions to instead be "maximum" blocksizes (and thus emphasizing the "extended" values rather than the difference).
- Various other minor tweaks, code reorganizations, and bugfixes.
Thanks go to those whose contributions, feedback, and bug reports led to these improvements--in particular, Tony Kelman, Kevin Locke, Devin Matthews, Tyler Smith, and perhaps others whose feedback I've lost track of.
August 4, 2014
- Added a CBLAS compatibility layer, which can be enabled at configure-time via
BLIS_ENABLE_CBLAS
inbli_config.h
. Enabling the CBLAS layer implicitly forces the BLAS compatibility layer to also be enabled. Once enabled, the application may access CBLAS prototypes viablis.h
orcblas.h
. - Fixed a packing bug for cases when
MR
orNR
(or both) are 1. - Redefined bit field macros in
bli_type_defs.h
with bitshift operator to ease future rearranging, expanding, or adding of info bits.
July 27, 2014
- Added shared library support to build system.
- Preliminary parallelization of
trsm
(Tyler Smith). - Added generic
_void()
microkernel wrappers so that users (or developers) can call the microkernel without knowing the implementation/developer-specific function names, which are specified at configure-time. - Added
bli_info_*()
API for querying general information about BLIS, including blocksizes. - Reimplemented initialization/finalization for thread safety.
- Fixed a possible
Inf
/NaN
issue in several level-3 operations when beta is zero. - Minor fixes to BLAS compatibility layer.
- Added initial support for Emscripten (Marat Dukhan).
June 23, 2014
This is a relatively minor release. The changes can be summarized as:
- Added experimental support for PNaCL (Marat Dukhan).
- Fixed aligned memory allocation on Windows (Tony Kelman).
- Fixed missing version string in build products when downloading tarballs/zip files (Field Van Zee). Thanks to Victor Eijkhout for pointing out this bug.
June 2, 2014
Tyler has been hard at work developing and refining extensions to BLIS that provide multithreading support (currently via OpenMP, though POSIX threads may be supported in the future). These extensions enable multithreading within all level-3 operations except for trsm
. We are pleased to announce that these code changes are now part of BLIS.
February 25, 2014
I. I am excited to announce that BLIS now provides high-performance complex domain support to ALL level-3 operations when ONLY the same-precision real domain equivalent gemm microkernel is present and optimized. In other words, BLIS's productivity lever just got twice as strong: optimize the dgemm
microkernel, and you will get double-precision complex versions of all level-3 operations, for free. Same for sgemm
microkernel and single-precision complex.
II. We also now offer complex domain support based on the 3m method, but this support is ONLY accessible via separate interfaces. This separation is a safety feature, since the 3m method's numerical properties are inherently less robust. Furthermore, we think the 3m method, as implemented, is somewhat performance-limited on systems with L1 caches that have less than 8-way associativity.
We plan on writing a paper on (I) and (II), so if you are curious how exactly we accomplish this, please be patient and wait for the paper. :)
III. The second, user-oriented change facilitates a much more developer-friendly configuration system. This "change" actually represents a family of smaller changes. What follows is a list of those changes taken from the git log:
- We now have standard names for reference kernels (levels-1v, -1f and 3) in the form of macro constants. Examples:
BLIS_SAXPYV_KERNEL_REF
BLIS_DDOTXF_KERNEL_REF
BLIS_ZGEMM_UKERNEL_REF
- Developers no longer have to name all datatype instances of a kernel with a common base name; [sdcz] datatype flavors of each kernel or microkernel (level-1v, -1f, or 3) may now be named independently. This means you can now, if you wish, encode the datatype-specific register blocksizes in the name of the microkernel functions.
- Any datatype instances of any kernel (1v, 1f, or 3) that is left undefined in
bli_kernel.h
will default to the corresponding reference implementation. For example, ifBLIS_DGEMM_UKERNEL
is left undefined, it will be defined to beBLIS_DGEMM_UKERNEL_REF
. - Developers no longer need to name level-1v/-1f kernels with multiple datatype chars to match the number of types the kernel WOULD take in a mixed type environment, as in
bli_dddaxpyv_opt()
. Now, one char is sufficient, as inbli_daxpyv_opt()
. - There is no longer a need to define an obj_t wrapper to go along with your level-1v/-1f kernels. The framework now provides a
_kernel()
function, as inbli_axpyv_kernel()
, which serves as theobj_t
wrapper for whatever kernels are specified (or defaulted to) viabli_kernel.h
. - Developers no longer need to prototype their kernels, and thus no longer need to include any prototyping headers from within
bli_kernel.h
. The framework now generates kernel prototypes, with the proper type signature, based on the kernel names defined (or defaulted to) viabli_kernel.h
. - If the complex datatype x (of [cz]) implementation of the gemm microkernel is left undefined by
bli_kernel.h
, but its same-precision real domain equivalent IS defined, BLIS will enable the automatic complex domain feature described above in (1a) for the datatype x implementations of all level-3 operations, using only the corresponding real domain gemm microkernel. If the complex gemm microkernel for x IS defined, then all complex level-3 operations will be defined in terms of that microkernel.
The net effect of (III) is that your bli_kernel.h
files can be MUCH simpler and less cluttered. (Extreme example: the reference configuration's bli_kernel.h
is now completely empty!) I have updated all configurations and kernels that are currently part of BLIS by stripping out unnecessary/outdated definitions and migrating existing definitions to their new names. (If you ever need to reference the complete list of options and macros, please refer to the bli_kernel.h
inside the template configuration.) Please set aside some time to test and, if necessary, tweak the configurations which you originally developed and submitted. I may have broken some of them. If so, please accept my apologies and contact me for assistance. I will work with you to get them functional again.
The changes mentioned in (I), (II), and (III), along with all other changes since 0.1.0, are included BLIS 0.1.1 (fde5f1fd).
I know these changes may be a little disruptive to some, but I think that most developers will find the new complex functionality very useful, and the new configuration system much easier to use.
November 9, 2013
- Added
sgemm
microkernel for dunnington. - Added
dgemm
microkernels and configurations for sandybridge, bgq, mic, power7, piledriver, loonson3a, which were used to gather performance data in our second ACM TOMS paper. Many thanks to Francisco Igual, Tyler Smith, Mike Kistler, and Xianyi Zhang for developing, testing, and contributing these kernels. - Migrated to signed integer for
dim_t
,inc_t
(to facilitate calling BLIS from Fortran). - Added "template" configuration and kernel set for developers to use as a starting point when developing new kernels from scratch.
- Improvements to test suite, including section overrides and standalone level-1f/level-3 kernel modules.
- Improvements to Windows build system (though it may still not yet be functional out-of-the-box). Thanks to Martin Schatz for his help here.
- Removed support for element "duplication" in level-3 macrokernels.
- Several bug fixes to BLAS compatibility layer. Thanks to Vladimir Sukharev for his numerous bug reports wrt the LAPACK test suite.
- Various other minor bugfixes.
July 18, 2013
- A few algorithmic optimizations and bug fixes to
trmm
andtrsm
. - Parameter checking in the compatibility layer that mimics netlib BLAS.
- Default use of
stdint.h
types (int64_t
,uint64_t
by default). - Optional (and very much untested) C99 built-in complex type/arithmetic support.
Note that bli_config.h
has changed since 0.0.8. Added configuration macros are:
#define BLIS_ENABLE_C99_COMPLEX
#define BLIS_ENABLE_BLAS2BLIS_INT64
#define PASTEF770(name) // ...
The first macro enables C99 built-in complex types. The second causes a Fortran integer to be defined as an int64_t (rather than int32_t
). The third is a macro to name-mangle a full routine name for Fortran (ie: add an underscore) and should be obtained from config/reference/bli_config.h
.
June 12, 2013
This version includes several kernel optimizations and bug fixes.
While neither bli_config.h
nor bli_kernel.h
has changed formats since 0.0.7, make_defs.mk
has changed, so please update your copy of this file when you git-pull. Specifically, we now define a new CFLAGS_KERNELS
variable that allows one to use different compiler flags when compiling kernels. It works like this: At compile time, make will use CFLAGS_KERNELS
to compile any source code that resides in any directory that begins with the name kernels
. My recommendation is to simply apply this naming convention to the symbolic link to your kernels directory that resides in your configuration directory. Thanks to Tyler for suggesting this change.
April 30, 2013
This version incorporates many small fixes and feature enhancements made during our SC13 collaboration.
April 13, 2013
Several changes regarding memory alignment were made since 0.0.5, including modifications to bli_config.h
. Also, this update fixes a few bugs.
March 24, 2013
The most obvious change in this version is the migration to the bli
function (and source code filename) prefix, from the old bl2
prefix, as well as a rename of the main BLIS header (blis2.h
-> blis.h
). The test suite seems to indicate that the change was successful.
A few other much more minor changes were made, one pertaining to a renamed constant in the _config.h
file.
March 15, 2013
The changes included in 0.0.4 mostly relate to the contiguous (static) memory allocator. The previous implementation was intended as a temporary solution that would work for benchmarking purposes, until enough other priorities had been tended to that I could go back and do it right.
I began with the assumption that the benefit of packing matrices into contiguous memory is non-negligible and worth the effort. Furthermore, we assume that:
- the only portable way to acquire contiguous memory is to reserve a region of static memory and manage it ourselves;
- the cache blocksizes used for one level-3 operation will be the same as those used for another level-3 operation, since all of them boil down to some form of matrix-matrix multiplication;
- only three types of contiguous memory will ever be needed (for level-3 operations): a block of matrix A, a panel of matrix B, or a panel of matrix C--and the last case is not commonly used;
- when a block or panel is to be acquired from the allocator, the caller knows which of the three types of memory is needed.
Given these assumptions, I was able to come up with an implementation that is simple, easy to understand, and thread-safe (provided you add OpenMP directives to protect the critical sections, which are clearly marked with comments). It can also both allocate and release in O(1) time. And of course, page-alignment is taken care of behind the scenes. So while it is not a generalized solution by any means, I think it will work very well for our purposes.
Also, note that based on the level of the overall matrix multiplication algorithm at which you parallelize, the minimum number of blocks/panels of each type of contiguous memory will vary. For example, if you want all of your threads to work on different iterations of a single rank-k update (via block-panel multiply), the threads share the packed panel of B, but each one needs memory to hold its own packed block of A. Thus, the memory allocator needs to be initialized so that it contains enough memory for at least one panel of B and at least t blocks of A, where t is the number of threads. All of this can be adjusted at configure-time in bl2_config.h
.
February 22, 2013
The biggest change in this version is that the BLAS-to-BLIS compatibility layer is now available. Virtually every BLAS interface is included, even those corresponding to functionality that BLIS does not implement (such as banded and packed level-2 operations). If the application code attempts to call one of these unimplemented routines, the code aborts with a generic not-yet-implemented error message.
The compatibility layer is enabled via a configuration option in bl2_config.h
. For now, it is enabled by default (provided you have an up-to-date copy of bl2_config.h
).
February 11, 2013
Most notably, this version contains the new test suite I've been working on for the last month.
What is the test suite? It is a highly configurable test driver that allows one to test an arbitrary set of BLIS operations, with an arbitrary set of parameter combinations, and matrix/vector storage formats, as well as whichever datatypes you are interested in. (For now, only homogeneous datatyping is supported, which is what most people want.) You can also specify an arbitrary problem size range with arbitrary increments, and arbitrary ratios between dimensions (or anchor a dimension to a single value), and you can output directly to files which store the output in matlab syntax, which makes it easy to generate performance graphs.
BLIS developers: note that 0.0.2 makes small changes to the configuration files. This new version also contains many bug fixes. (Most of these fixes address bugs which were found using the test suite.)
December 10, 2012
- Added auto-detection of string version (via
git
). - Wrote basic INSTALL, CHANGELOG, AUTHORS, and CREDITS files.
- Updates to standalone
test
directoryMakefile
. - Added initial build system
- Various code reorganizations.