29 Jul 19:13

cliffburdick

fa9e872

v0.9.2 Latest

Latest

New operator: interp

Other Additions:

Improvements to sparse support including new batched tri-diagonal solver
Automatic vectorization and ILP support
DLPack updated to 1.1
Many bug fixes

What's Changed

Fix partial any/all reduction by @simonbyrne in #959
interp1: add support for higher dimensional sample points and values by @simonbyrne in #963
Introduce DIA and SkewDIA format by @aartbik in #964
Refactor MATX_CUDA_CHECK to prevent multiple evaluation by @tmartin-gh in #957
Introduce DIA format factory method by @aartbik in #965
reformat sparse files with clang-format by @aartbik in #966
Implement DIA SpMV kernel by @aartbik in #967
Generalize SpMV from square to m x n DIA by @aartbik in #969
replace static_assert(false) with host-only THROW by @aartbik in #968
Generalize DIA to DIA-I and DIA-J by @aartbik in #972
Avoid name collision with cpu_set_t from sched.h by @tbensonatl in #971
Add axis argument to interp1. by @simonbyrne in #970
Add operator tests back by @cliffburdick in #977
clang-format on sparse tests by @aartbik in #973
Add SpMV test for DIA-I and DIA-J by @aartbik in #974
(re) enable all sparse tests by @aartbik in #979
Let X = solve(A, B) take X and B along rows by @aartbik in #981
Add tri-diagonal solve support by @aartbik in #982
update doc with latest DIA support by @aartbik in #983
minor sparse documentation refinement by @aartbik in #984
Updating Google Test by @cliffburdick in #985
Minor fix in UST level order for DIA by @aartbik in #986
Vectorization and ILP by @cliffburdick in #980
Fixing compile error with FFT conv by @cliffburdick in #989
Fixing another 12.9 compiler bug by @cliffburdick in #991
Removing unused parameter in lambda causing error on clang by @cliffburdick in #992
proper lvl2dim computation for add/sub by @aartbik in #994
add braces to if-then-else by @aartbik in #997
Avoid fmod become ambiguous once CCCL specializes it for extended floating point types by @miscco in #996
clang formatting by @aartbik in #998
implement batched tri-diagonal direct solve by @aartbik in #999
add streams to alloc/free in cusparse sequences by @aartbik in #1001
test for batched tri-diag direct solver by @aartbik in #1000
fix minor typos in comments by @aartbik in #1002
DLPack 1.1 update by @cliffburdick in #1004
Fix host compiler errors when using -Wall -Werror by @tmartin-gh in #1006
Fix ARM relocation trucation build errors by @dylan-eustice in #1008
Allocate pinned host memory instead of managed when managed isn't available by @cliffburdick in #1010
Added executor to cache by @cliffburdick in #1009
Remove template parameters in constructor by @cliffburdick in #1012
fix flipud for 1D tensors by @simonbyrne in #1011
Fix warnings in clang19 by @cliffburdick in #1015
Missing unit test syncs by @dylan-eustice in #1013
add convenience constructor for batched tri diag sparse tensor by @aartbik in #1019
Remove runtime checks on memory spaces by @aartbik in #1018
build each test file as a separate executable by @simonbyrne in #1017
use batched sparse solve for interp by @simonbyrne in #1016

New Contributors

@miscco made their first contribution in #996
@dylan-eustice made their first contribution in #1008

Full Changelog: v0.9.1...v0.9.2

Contributors

simonbyrne, miscco, and 5 other contributors

Assets 2

14 May 15:43

cliffburdick

v0.9.1

4475c22

v0.9.1

Sparse support + bugfixes

New operators: argminmax, dense2sparse, sparse2dense, interp1, normalize, argsort
Removed requirement for --relaxed-constexpr
Added MatX NVTX domain
Significantly improved speed of svd and inv
Python integration sample
Experimental sparse tensor support (SpMM and solver routines supported)
Significantly reduced FFT memory usage

What's Changed

Moving definition of CUB cache up by @cliffburdick in #771
Added documentation of memory types by @cliffburdick in #770
Cleaning up non-const operator() to avoid code duplication by @cliffburdick in #769
Switch to CUB/Thrust backend for cuda executor argmax by @tmartin-gh in #772
Refactor cub argmax to generic cub reduce, use for argmin. Fixes #774. by @tmartin-gh in #776
Change any() and all() to use CUB's reduce by @tmartin-gh in #777
Add argminmax operator by @tmartin-gh in #778
Fix matx::HostExecutor segfault with argmin/argmax by @tmartin-gh in #780
Added new cusolverDnXsyevBatched API for batched eigen calls for CTK 12.6.2 and up by @cliffburdick in #781
cub.h CUDACC guards for custom ops by @nvjonwong in #782
Add example compiled with host compiler to catch regressions. by @tmartin-gh in #783
Remove relaxed constexpr by @cliffburdick in #775
Cleanup versions.json so jq can parse it. by @alliepiper in #785
Allow rapids-cmake's version file to be overridden. by @alliepiper in #786
Update rapids-cmake (branch-24.12@03ec7ef) by @alliepiper in #787
Created MatX NVTX domain by @cliffburdick in #784
Update docs github action by @tmartin-gh in #789
Update docs github action by @tmartin-gh in #790
Work around compiler parser bug by @cliffburdick in #791
Updating developer documentation by @cliffburdick in #793
Modify concat op to enable concatenating float3. by @nvjonwong in #792
Fix rapids cmake by @alliepiper in #799
Switched to getRs instead of getRi for faster inverse by @cliffburdick in #797
Update CMakeLists.txt by @cliffburdick in #801
Support half precision R2C transforms by @cliffburdick in #796
Fix gcc13 erroneous warning by @cliffburdick in #802
fixed missing forwarding code for allocate by @aartbik in #804
Fix bug with eye, and also zero workspace before LU factorization by @cliffburdick in #807
Change shape_type for the remap op by @nvjonwong in #806
Faster batched SVD for small sizes by @cliffburdick in #805
Fixing broadcasting in all operator() by @cliffburdick in #795
Add a better error on memory allocation failure by @cliffburdick in #808
Fix solver interfaces to use executor in cache by @cliffburdick in #809
Python integration sample by @tmartin-gh in #812
Fixes for clang17 errors/warnings by @cliffburdick in #815
Misc Cleanup by @tmartin-gh in #814
frexp_fix by @cliffburdick in #817
Adding structures needed for sparse support by @cliffburdick in #819
fix missing newline at EOF (to avoid future diff issues) by @aartbik in #822
add size() to container storage by @aartbik in #824
minor edit for sparse (layout and proper swap def) by @aartbik in #820
add a to-string method for memory space by @aartbik in #823
Cleanup cmake usage when MatX is a dependent project by @tmartin-gh in #827
Fixing warnings issues by clang-19, both host and device by @cliffburdick in #825
Update build_docs actions to newest. Add CI_RUN_DATETIME in version.rst by @tmartin-gh in #829
introduce a versatile sparse tensor type to MatX (experimental) by @aartbik in #821
Add initial tiff support by @tmartin-gh in #831
Make dim2lvl translation for printing more in the style of MatX by @aartbik in #832
Expose tensor format (and lvl specs) to sparse tensor data by @aartbik in #833
Add cross product operator by @mfzmullen in #818
remove LVL depth restriction with constexpr templating by @aartbik in #834
Guard all DIM/LVL recursion against completely empty format by @aartbik in #835
Adjust half-type threshold for cross product unit tests by @mfzmullen in #838
Added fp32 version of normcdf by @cliffburdick in #839
Changing black scholes to float and improving performance by @cliffburdick in #840
Implement the () operator on sparse tensors by @aartbik in #837
Support operators into einsum interface by @cliffburdick in #845
Add print function with nonzero dim args by @tbensonatl in #844
Updated CCCL to fix regression in newer CTK versions by @cliffburdick in #846
First version of MATX SpMM (using dispatch to cuSPARSE) by @aartbik in #843
Moved sparse operator() into tensor_impl_t by @cliffburdick in #841
Adding timing metrics to CUDA and host executors by @cliffburdick in #842
Remove dense "testers" from the sparse tensor format type by @aartbik in #847
cuDSS by @cliffburdick in #848
Update deprecated CUB types by @cliffburdick in #851
Renamed versatile into universal for sparse tensor types by @aartbik in #850
Ignore incorrect gcc warning in einsum by @cliffburdick in #853
Added documentation on integrating with existing software by @cliffburdick in #852
Add compile-time check for minimum CUDA arch by @tbensonatl in #855
First version of MATX Sparse-Direct-Solve (using dispatch to cuDSS) by @aartbik in #849
First version of MATX sparse2dense conversion (dispatch to cuSPARSE) by @aartbik in #856
Improve cuFFT errors by @cliffburdick in #860
workaround for CTAD bug in NVC++ by @cliffburdick in #859
Add note about host-allocated memory to external guide by @cliffburdick in #862
Cleanup to use pass-by-reference more consistently by @aartbik in #861
Move empty storage construction to inline helper method by @aartbik in #857
Make CCCL copy false by @cliffburdick in #865
Remove test for free memory on FFTs by @cliffburdick in #864
Fix initializer list order by @tmartin-gh in #867
Initialize host cuRAND API when using host compiler by @cliffburdick in #866
Add user-friendly assertions to make_sparse_tensor by @aartbik in #869
Add "zero" matrix factor methods for COO,CSR,CSC by @aartbik in #870
First version of MATX dense2sparse conversion (dispatch to cuSPARSE) by @aartbik in #868
Add sparse factory method tests by @aartbik in #871
Enforce library restrictions on MatX transformations by @aartbik in #872
Add sparse conversion tests (dense2sparse, sparse2dense) by @aartbik in #873
Add sparse direct-solver tests by @aartbik in #874
Add SpMM tests by @aartbik in #875
Refactored OperatorTests.cu for faster compilation time by @cliffburdick in #876
Test feeding dense output as intermediate for the new sparse ops by @aartbik in #877
Use transitive include in benchmarks cmake by @cliffburdick in #880
Remove const qualifier on input to thrust ...

Contributors

alliepiper, simonbyrne, and 8 other contributors

Assets 2

15 Oct 18:12

cliffburdick

v0.9.0

af55b57

v0.9.0

Version v0.9.0 adds comprehensive support for more host CPU transforms such as BLAS and LAPACK, including multi-threaded versions.

Beyond the CPU support, there are many more minor improvements:

Added several new operators include vector_norm, matrix_norm, frexp, diag, and more
Many compiler fixes to support a wider range of older and newer compilers
Performance improvements to avoid overhead of permutation operators when unnecessary
Much more!

A full changelist is below

What's Changed

Update pybyind to v2.12.0. Fixes issue #591. by @tmartin-gh in #604
Change print macro to matx namespaced function by @tmartin-gh in #607
Added frexp() operator by @cliffburdick in #609
Disable CUTLASS compile option by @cliffburdick in #610
Created dimensionless versions of ones() and zeros() by @cliffburdick in #611
Add smem-based polyphase channelizer kernel by @tbensonatl in #613
Eigen guide by @tylera-nvidia in #612
Multithreaded docs build Fix by @tylera-nvidia in #614
Fixed issues with static tensor unit tests compiling by @cliffburdick in #615
Implement csqrt by @tylera-nvidia in #619
Automatic Enumeration of NVTX Range IDs by @tylera-nvidia in #616
Fixing Clang errors to compile with clang-17 by @cliffburdick in #621
Update to CCCL 2.4.0 and fix CMake to not use system includes by @cliffburdick in #623
Remove options that nvc++ doesn't support by @cliffburdick in #624
Fixing some warnings on certain compilers by @cliffburdick in #625
More nvc++ warning fixes. Increase minimum supported CUDA to 11.5 by @cliffburdick in #627
More nvc++ fixes + code coverage generation by @cliffburdick in #628
fixed printing 0D tensors by @tylera-nvidia in #618
Remove conversion for double to half by @cliffburdick in #631
Add NVTX Tests for Code Coverage by @tylera-nvidia in #632
Feature/add complex cast operators by @tbensonatl in #633
Avoid array indices passthrough in matxOpTDKernel by @tbensonatl in #634
Add mixed precision support for channelize_poly by @tbensonatl in #640
Add test cases for stride kernels by @cliffburdick in #641
Basic synchronization support with sync() by @aayushg55 in #642
Converting old std:: types to cuda::std:: types by @cliffburdick in #629
Fix pybind iterator bug on newer g++ by @cliffburdick in #643
Initialize NVTX variable by @cliffburdick in #644
Fixed remaining nvc++ warnings by @cliffburdick in #645
Change cmake option/project order by @raplonu in #649
Change check on build type to avoid short circuiting by @cliffburdick in #647
Add complex cast operators for split inputs by @tbensonatl in #650
Added norm() operator by @cliffburdick in #620
Add zero-copy interface from MatX to NumPy by @cliffburdick in #653
Added host multithreading support for FFTW by @aayushg55 in #652
Fixed OpenMP compiler flags by @aayushg55 in #654
Fixed issue with operator types used as both lvalue/rvalue not assigning by @cliffburdick in #655
Smaller FFT test sizes for faster CI/CD by @aayushg55 in #656
Docs for matrix/vector norm by @cliffburdick in #657
Change matmul to use tensor_t temp until issue with impl is fixed by @cliffburdick in #658
Added plan caching for FFTW host plans by @aayushg55 in #659
Fixed fftw guards and temp allocation by @aayushg55 in #660
Fixed fftw guards to be fine-grained by @aayushg55 in #661
Enabled FFT conv for host by @aayushg55 in #662
NVPL BLAS Support by @aayushg55 in #665
Change supported CUDA to 11.8 by @cliffburdick in #670
enh: add macro to define cuda functions accessible at global scope by @mfzmullen in #668
Add workaround for pre-11.8 CTK smem init errors by @tbensonatl in #673
Fix to ConvCorr tests to skip host tests when host not enabled by @aayushg55 in #674
Expanded Host BLAS support by @aayushg55 in #675
Update README.md by @HugoPhibbs in #676
Improved the error messages when sizes are incompatible by @cliffburdick in #682
Added toeplitz operator by @cliffburdick in #683
Simplified cmake file so no definitions are required by default by @cliffburdick in #684
fix type for permuted ops in norm. by @luitjens in #696
Fix c++20 warning by @cliffburdick in #698
Update Cub Cache Creation to new Method by @tylera-nvidia in #694
Fixed base operator types by @cliffburdick in #703
Update slice.rst by @HugoPhibbs in #704
Fixed issues with host compiler with C++17 and C++20 modes by @cliffburdick in #706
NVPL LAPACK Solver Support on ARM by @aayushg55 in #701
Add detail:: namespace to CUB struct by @cliffburdick in #708
OpenBLAS LAPACK Solver Support for x86 by @aayushg55 in #709
Exclude examples/cmake_sample_project/build* from doxygen search by @tmartin-gh in #711
Fixed random pre/post run signature by @cliffburdick in #715
Rapids cmake 24 06 package by @cliffburdick in #716
Add support for UINT Generation by @tylera-nvidia in #695
Update svd docstring by @cliffburdick in #717
Solver SVD Optimizations and Improved cuSolver batching by @aayushg55 in #721
MATX_EN_CUTENSOR / MATX_ENABLE_CUTENSOR Unified Variable by @tylera-nvidia in #720
mtie should output the correct rank and size for the output operator. by @luitjens in #726
Update bug_report.md by @HugoPhibbs in #729
eliminate auto spills in permute by @luitjens in #731
Revert accidental commit to main by @cliffburdick in #734
Host Solver workspace query fix by @aayushg55 in #733
Add in-place transform support for inv() by @tbensonatl in #736
Allow access to Data() pointer from device by @tmartin-gh in #738
Use cublasmatinvBatched() for N <= 32 by @tbensonatl in #739
Added new pinv() operator and updated Reduced SVD by @aayushg55 in #740
optimize our iterator to avoid an unnecessary constructor call by @luitjens in #741
Updated Solver documentation by @aayushg55 in #742
Updated documentation for CPU support by @aayushg55 in #743
Slice optimizations to reduce spills by @cliffburdick in #732
Fixing shadow declaration by @cliffburdick in #745
Workaround for constexpr bug inside lambda in CUDA 11.8 by @cliffburdick in #671
Added diag operator taking 1D operator to generate 2D operator by @cliffburdick in #746
Add normcdf docs by @cliffburdick in #747
Refactor template arguments to reductions to force no permutes when unnecessary by @cliffburdick in #749
Adding workarounds for false positives on gcc14 by @cliffburdick in #751
Visibility fix for cache static deinit issue by @nvjonwong in #752
Don't allow in-place make_tensor to change ownership by @cliffburdick in #753
Fix for erroneous errors on gcc14.1 by @cliffburdick in #755
Create temp contiguous tensors if needed for sor...

Contributors

jjomier, raplonu, and 9 other contributors

Assets 2

04 Apr 17:27

cliffburdick

v0.8.0

7719779

v0.8.0

Release highlights:

Features
- Updated cuTENSOR and cuTensorNet versions
- Added configurable print formatting
- ARM FFT support via NVPL
- New operators: abs2(), outer(), isnan(), isinf()
- Many more unit tests for CPU tests
Bug fixes for matmul on Hopper, 2D FFTs, and more

Full changelist:

What's Changed

Increase cublas workspace to 32 MiB for Hopper+ by @tbensonatl in #545
matmul bug fixes. by @luitjens in #547
Added missing synchronization by @luitjens in #552
Refine some file I/O functions' doxygen comments by @AtomicVar in #549
Update docs by @tmartin-gh in #551
Export used environment variables in sphinx config by @tmartin-gh in #553
Import os by @tmartin-gh in #554
Add version info by @tmartin-gh in #555
Fix typo by @tmartin-gh in #556
Adds IsNan and IsInf Operators by @nvjonwong in #557
Use cmake project version info in sphinx config by @tmartin-gh in #560
outer() operator for outer product by @cliffburdick in #559
Fix nans in QR and SVD. by @luitjens in #558
Update CMakeLists.txt by @cliffburdick in #548
Fix CMake to allow multiple rapids-cmake to coexist by @cliffburdick in #562
Return 0D arrays for 0D shape in operators by @cliffburdick in #561
Fix NVTX3 include path by @AtomicVar in #564
Add .npy File I/O by @AtomicVar in #565
SVD & QR improvements by @luitjens in #563
chore: Fix typo s/whereever/wherever/ by @hugo-syn in #566
Add rapids-cmake-dir, if defined, to CMAKE_MODULE_PATH by @tbensonatl in #567
Add abs2() operator for squared abs() by @tbensonatl in #568
Fixed issue on g++13 with nullptr dereference that cannot happen at r… by @cliffburdick in #571
Force max(min) size of direct convolution dimension to be < 1024 by @cliffburdick in #573
Remove incorrect warning check for any compiler other than gcc by @cliffburdick in #577
stream memory cleanup by @cliffburdick in #579
Update reshape indices by @cliffburdick in #580
Update matlabpython.rst by @cliffburdick in #583
Prevent potential oob read in matxOpTDKernel by @tbensonatl in #586
Broadcast lower-rank tensors during batched matmul by @tbensonatl in #585
Fix bugs in 2D FFTs and add tests by @benbarsdell in #587
Added ARM FFT Support by @cliffburdick in #576
Various bug fixes for older compilers by @cliffburdick in #588
Renamed rmin/rmax functions to min/max and element-wise are now minimum/maximum to match Python by @cliffburdick in #589
Fix clang macro by @cliffburdick in #592
Fix misplaced sentence in README by @lucifer1004 in #594
Add configurable print formatting types by @tmartin-gh in #593
Fixing return types to allow either prvalue or lvalue in operator() by @cliffburdick in #598
Rework einsum for new cache style. Fix for issue #597 by @tmartin-gh in #599
Updated cutensornet to 24.03 and cutensor to 2.0.1 by @cliffburdick in #600
adding file name and line number to ease debug by @bhaskarrakshit in #601
Updating versions and notes for v0.8.0 by @cliffburdick in #602

New Contributors

@hugo-syn made their first contribution in #566
@benbarsdell made their first contribution in #587
@lucifer1004 made their first contribution in #594
@bhaskarrakshit made their first contribution in #601

Full Changelog: v0.7.0...v0.8.0

Contributors

benbarsdell, luitjens, and 8 other contributors

Assets 2

04 Jan 21:06

cliffburdick

v0.7.0

13076b0

v0.7.0

Features

Convert libcudacxx to CCCL by @cliffburdick in #501
Add PreRun and tests for at/clone/diag operators by @tbensonatl in #502
Add explicit FFT length to fft_conv example by @tbensonatl in #503
Add Pre/PostRun support for collapse, concat ops by @tbensonatl in #506
polyval operator by @cliffburdick in #508
Optimize resample poly kernels by @tbensonatl in #512
Allow negative indexing on slices by @cliffburdick in #516
Automatically publish docs to GH Pages on merge to main by @tmartin-gh in #520
Add configurable precision support of print(). by @AtomicVar in #521
Make matxHalf trivially copyable by @tbensonatl in #513
Added operator for matvec by @cliffburdick in #514
New rapids and nvbench by @cliffburdick in #529

Fixes

Add FFT1D tensor size checks by @tbensonatl in #499
Fix errors which caused some unit tests failed to compile. by @AtomicVar in #504
Fix upsample output size by @cliffburdick in #507
removing print characters accidently left behind by @tylera-nvidia in #510
Renamed host executor and prepared for multi-threaded additions by @cliffburdick in #511
removing old hardcoded limit for repmat rank size by @tylera-nvidia in #515
Avoid async alloc in some Cholesky decomp cases by @tbensonatl in #517
Workaround for maybe_unused parse bug in old gcc by @tbensonatl in #522
Fix matvec output dims to match A rather than B by @tbensonatl in #523
Remove CUDA system include by @cliffburdick in #525
Zero-initialize batches field in CUB params by @tbensonatl in #527
Fixing host include guard on resample poly by @cliffburdick in #528
Update device.h for host compiler by @cliffburdick in #530
Made allocator an inline function by @cliffburdick in #532
Build and publish documentation on merge to main by @tmartin-gh in #533
Remove doxygen parameter to match tensor_t constructor signature by @tmartin-gh in #534
Update iterator.h by @cliffburdick in #536
Update Bug Report Issue Template by @AtomicVar in #539
Fix CCCL libcudacxx path by @cliffburdick in #537
Check matmul types and error at compile-time if the backend doesn't support them by @cliffburdick in #540
Fix batched cov transform by @tbensonatl in #541
Update caching for transforms to fixing all leaks reported by compute-sanitizer by @cliffburdick in #542
Update docs for v0.7.0 by @cliffburdick in #544

Full Changelog: v0.6.0...v0.7.0

Contributors

ZJUGuoShuai, cliffburdick, and 3 other contributors

Assets 2

02 Oct 16:50

cliffburdick

v0.6.0

7b69822

v0.6.0

Notable Updates

Transforms as operators by @cliffburdick in #452
resample_poly optimizations and operator support by @tbensonatl in #465

Full changelog below:

What's Changed

Added upsample and downsample operators by @cliffburdick in #442
Added lvalue semantics to operators that needed it by @cliffburdick in #443
Added operator support to solver functions by @cliffburdick in #444
Added shapeless version of diag() and eye() by @cliffburdick in #445
Deprecated random interface by @cliffburdick in #446
Updated cuTENSOR/cuTensorNet and added example for trace by @cliffburdick in #447
Fixing host compilation where device code snuck in by @cliffburdick in #453
Added Protections for Shift Operator inputs and fixed issues with size/Shape returns for certain input sizes by @tylera-nvidia in #454
Added isclose and allclose functions by @cliffburdick in #448
Adds normalization options for fft and ifft by @nvjonwong in #456
Updated 0D tensor syntax and expanded simple radar pipeline by @cliffburdick in #458
Add initial polyphase channelizer operator by @tbensonatl in #459
Fixed inverse from stomping on input by @cliffburdick in #461
Fix cache issue with strides by @cliffburdick in #460
Added const to Pre/PostRun by @cliffburdick in #462
Revert inv by @cliffburdick in #463
Added proper LHS handling for transforms by @cliffburdick in #464
Updated incorrect license by @cliffburdick in #466
Use device mem instead of managed for fft workbuf by @tbensonatl in #467
Added at() and percentile() operators by @cliffburdick in #471
Add overlap operator by @cliffburdick in #472
Support stride 0 A/B batches for GEMMs by @cliffburdick in #473
Added FFT-based convolution to conv1d() by @cliffburdick in #475
Documentation cleanup by @tmartin-gh in #477
Adding FFT convolution benchmarks by @cliffburdick in #476
Fixed rank of output in matmul operator when A/B had 0 stride by @cliffburdick in #478
Updating header image by @cliffburdick in #480
Add pwelch operator by @tmartin-gh in #479
Docs cleanup. Enforce warning-as-error for doxygen and sphinx. by @tmartin-gh in #481
Fixes for CUDA 12.3 compiler by @cliffburdick in #483
Update pwelch.h by @cliffburdick in #486
Fixes for new compiler issues by @cliffburdick in #488
Fixing sample Cmake Project by @tylera-nvidia in #489
Update base_operator.h by @cliffburdick in #490
Add window operator input to pwelch by @tmartin-gh in #491
Add PreRun methods for slice/fftshift operators by @tbensonatl in #493
PreRun support for r2c and other fft related fixes by @tbensonatl in #494

New Contributors

@tmartin-gh made their first contribution in #477

Full Changelog: v0.5.0...v0.6.0

Contributors

cliffburdick, tmartin-gh, and 3 other contributors

Assets 2

03 Jul 21:38

cliffburdick

v0.5.0

7457329

v0.5.0

Notable Updates

Documentation rewritten to include working examples for every function based on unit tests
Polyphase resampler based on SciPy/cuSignal's resample_poly

Full changelog below:

What's Changed

Modifies TensorViewToNumpy and NumpyToTensorView for rank = 5 by @nvjonwong in #427
NumpyToTensorView overload which returns new TensorView by @nvjonwong in #428
Added fftfreq() generator by @cliffburdick in #430
Latest NumpyToTensorView function requires complex conversion for complex types by @nvjonwong in #431
Fixed print function to work on device in certain cases by @cliffburdick in #436
Fixed unused variable warning by @cliffburdick in #435
Adding initial polyphase resampler transform by @tbensonatl in #437
Revamped documentation by @cliffburdick in #438
Fixing typo in Cholesky docs by @cliffburdick in #439
Added broadcasting documentation by @cliffburdick in #440
Broadcast docs by @cliffburdick in #441

New Contributors

@nvjonwong made their first contribution in #427

Full Changelog: v0.4.1...v0.5.0

Contributors

cliffburdick, tbensonatl, and nvjonwong

Assets 2

02 Jun 15:17

cliffburdick

v0.4.1

e751134

v0.4.1

This is a minor release mostly focused on bug fixes for different compilers and CUDA versions. One major feature added was all reductions are supported on the host using a single threaded executor. Multi-threaded executor support coming soon.

What's Changed

Host reductions by @cliffburdick in #385
Reduced cuBLASLt workspace size by @cliffburdick in #404
Fix benchmarks that broke with new executors by @cliffburdick in #405
All operator tests converted to use host and device, and improved 16b by @cliffburdick in #403
Add single argument copy() and copy() tests by @tbensonatl in #407
Add rank0 tensor remap support by @tbensonatl in #408
Add Mutex to support multithread NVTX markers by @tylera-nvidia in #406
Fix a few issues highlighted by linters/clang by @tbensonatl in #409
Fixed compilation for Pascal by @cliffburdick in #412
Fixed issue with constructor when passing strides and sizes by @cliffburdick in #413
CMake fixes found by user by @cliffburdick in #416
Update libcudacxx to 2.1.0 by @cliffburdick in #417
Fixed cupy check for unit tests, default constructors, and file IO by @cliffburdick in #419
Added delta degrees of freedom on var() to mimic Python by @cliffburdick in #421
Adding correct license on files that were wrong by @cliffburdick in #423
Fixed two issues with release mode and DLPack and reductions on the host by @cliffburdick in #424

Full Changelog: v0.4.0...v0.4.1

Contributors

cliffburdick, tbensonatl, and tylera-nvidia

Assets 2

03 Apr 15:55

cliffburdick

v0.4.0

c9a2521

v0.4.0

New Features

slice optimization to use builtin tensor function when possible by @luitjens in #360
Slice support for std::array shapes by @luitjens in #363
svd power iteration example, benchmark and unit tests. by @luitjens in #366
matmul: support real/complex tensors by @kshitij12345 in #362
Adding sign/index operators: by @luitjens in #369
optimized cast and conj op to return a tensor view when possible. by @luitjens in #371
implement QR for small batched matrices. by @luitjens in #373
Implement block power iteration (qr iterations) for svd by @luitjens in #375
Added output iterator support for CUB sums, and converted all sum() by @cliffburdick in #380
Removing inheritance from std::iterator by @cliffburdick in #381
DLPack support by @cliffburdick in #392
Adding ref-count for DLPack by @cliffburdick in #394
updating cub optimization selection for >= 2.0 by @tylera-nvidia in #395
Refactored make_tensor to allow lvalue init by @cliffburdick in #397
Updated notebook documentation and refactored some code by @cliffburdick in #398
Allow 0-stride dimensions for cublas input/output by @tbensonatl in #400
16-bit float reductions + updated softmax by @cliffburdick in #399

Bug Fixes

Fix Duplicate Print and remove member prints by @tylera-nvidia in #364
cublasLT col major detection fix. by @luitjens in #368
Fixes for 32b mode by @cliffburdick in #388
Fixed a bogus maybe-unitialized warning/error in release mode by @cliffburdick in #389
Fixed issue with using const pointers by @cliffburdick in #393
Generator Printing Patch by @tylera-nvidia in #370

New Contributors

@kshitij12345 made their first contribution in #362
@tbensonatl made their first contribution in #400

Full Changelog: v0.3.0...v0.4.0

Contributors

luitjens, kshitij12345, and 3 other contributors

Assets 2

20 Jan 19:43

cliffburdick

v0.3.0

20e00a2

v0.3.0

v0.3.0 marks a major release with over 100 features and bug fixes. Release cadence will occur more frequently after this release to support users not living at the HEAD.

What's Changed

Added squeeze operator by @cliffburdick in #163
Change name of squeeze to flatten by @cliffburdick in #164
Updated version of cuTENSOR and fixed paths by @cliffburdick in #166
Added reduction example with einsum by @cliffburdick in #168
Fixed bug with wrong type on argmin/max by @cliffburdick in #170
Fixed missing return on operator() for sum by @cliffburdick in #171
Fixed error with reduction with invalid indices. Only shows up on Jetson by @cliffburdick in #172
Fixed bug with matmul use-after-free by @cliffburdick in #173
Added test for batches GEMMs by @cliffburdick in #174
Throw an exception if using SetVals on non-managed pointer by @cliffburdick in #176
Added missing assert in release mode by @cliffburdick in #178
Fixed einsum in release mode by @cliffburdick in #179
Updates to docs by @cliffburdick in #180
Added unit test for transpose and fixed bug with grid size by @cliffburdick in #181
Fix grid dimensions for transpose. by @galv in #182
Added missing include by @cliffburdick in #184
Remove CUB from sum reduction while bug is being investigated by @cliffburdick in #186
Fix for cub reductions by @luitjens in #187
Reenable CUB tests by @cliffburdick in #188
Fixing incorrect parameter to CUB sort for 2D tensors by @cliffburdick in #190
Remove 4D restriction on Clone by @cliffburdick in #191
Added support for N-D convolutions by @cliffburdick in #189
Download RAPIDS.cmake only if it does not exist. by @cwharris in #192
Fix 11.4 compilation issues by @cliffburdick in #195
Improve FFT batching by @cliffburdick in #196
Fixed argmax initialization value by @cliffburdick in #198
Fix issue #199 by @pkestene in #200
Fix type on concatenate by @cliffburdick in #201
Fix documentation type-o by @dagardner-nv in #202
Missing host annotation on some generators by @cliffburdick in #203
Fixed TotalSize on cub operators by @cliffburdick in #204
Implementing remap operator. by @luitjens in #205
Update reverse/shift APIs by @luitjens in #207
batching conv1d across filters. by @luitjens in #208
Added Print for operators by @cliffburdick in #211
Complex div by @cliffburdick in #213
Added lcollapse and rcollapse operator by @luitjens in #212
Baseops by @luitjens in #214
Only allow View() on contigious tensors. by @luitjens in #215
Remove caching on some CUB types temporarily by @cliffburdick in #216
Fixed convolution mode SAME and added unit tests by @cliffburdick in #217
Added convolution VALID support by @cliffburdick in #218
Allow operators on cumsum by @cliffburdick in #219
Using async allocation in median() by @cliffburdick in #220
Various CUB fixes -- got rid of offset pointers (async allocation + copy), allowed operators on more types, and fixed caching on sort by @cliffburdick in #222
Fixed memory leak on CUB cache bypass by @cliffburdick in #223
Update to pipe type through for scalars on set operation by @tylera-nvidia in #225
Added complex version of mean and variance by @cliffburdick in #227
Fixed FFT batching for non-contiguous tensors by @cliffburdick in #228
Added fmod operator by @cliffburdick in #230
Fmod by @cliffburdick in #231
Changing name to fmod by @cliffburdick in #232
Cloneop by @luitjens in #233
Making the shift parameter in shift an operator by @luitjens in #234
Change sign of shift to match python/matlab. by @luitjens in #235
Changing output operator type to by value to allow temporary operators to be used as an output type. by @luitjens in #236
Adding slice() operator. by @luitjens in #237
Fix cuTensorNet workspace size by @leofang in #241
adding permute operator by @luitjens in #239
Cleaning up operators/transforms. by @luitjens in #243
Rapids cmake no fetch by @cliffburdick in #245
Cleanup of include directory by @luitjens in #246
Fixed conv SAME mode by @cliffburdick in #248
Use singleton on GIL interpreter by @cliffburdick in #249
make owning a runtime parameter by @luitjens in #247
Fixed bug with batched 1D convoultion size by @cliffburdick in #250
Adding 2d convolution tests by @luitjens in #251
Properly initialize pybind object by @cliffburdick in #252
Fixed sum() using wrong iterator type by @cliffburdick in #253
g++11 fixes by @cliffburdick in #254
Fixed size on conv and added benchmarks by @cliffburdick in #256
Adding unit tests for collapse with remap by @luitjens in #255
Collapse tests by @luitjens in #257
adding madd function to improve convolution throughput by @luitjens in #258
Conv opt by @luitjens in #259
Fixed compiler errors in release mode by @cliffburdick in #261
Add streaming make_tensor APIs. by @luitjens in #262
adding random benchmark by @luitjens in #264
remove depricated APIs in make_tensor by @luitjens in #266
Host unit tests by @luitjens in #267
Fixed bug with FFT size shorter than length of tensor by @cliffburdick in #270
removing unused pybind call made before pybind initialize by @tylera-nvidia in #271
Fixed visualization tests by @cliffburdick in #275
Fix cmake function check_python_libs. by @pkestene in #274
Support CubSortSegmented by @tylera-nvidia in #272
Executor cleanup. by @luitjens in #277
Transpose operators changes by @luitjens in #278
Remove Deprecated Shape and add metadata to Print by @tylera-nvidia in #280
Update Documentation by @tylera-nvidia in #282
NVTX Macros by @tylera-nvidia in #276
Adding throw to file reading by @tylera-nvidia in #281
Adding str() function to generators and operators by @luitjens in #283
Added reshape op by @luitjens in #287
0D tensor printing was broken since they don't have a stride by @cliffburdick in #289
Allow hermitian to take any rank by @cliffburdick in #292
Hermitian nd by @cliffburdick in #293
Fixed batched inverse by @cliffburdick in #294
Added 4D matmul unit test and fixed batching bug by @cliffburdick in #297
Fixing batched half precision complex GEMM by @cliffburdick in #298
Rename simple_pipeline to simple_radar_pipeline for added clarity by @awthomp in #299
Remove cuda::std::min/max by @cliffburdick in #301
Fixed chained concatenations by @cliffburdick ...

Contributors

cwharris, galv, and 8 other contributors

Assets 2

Releases: NVIDIA/MatX

v0.9.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.9.1

Sparse support + bugfixes

What's Changed

Contributors

Uh oh!

v0.9.0

What's Changed

Contributors

Uh oh!

v0.8.0

Release highlights:

Full changelist:

What's Changed

New Contributors

Contributors

Uh oh!

v0.7.0

Features

Fixes

Contributors

Uh oh!

v0.6.0

Notable Updates

What's Changed

New Contributors

Contributors

Uh oh!

v0.5.0

Notable Updates

What's Changed

New Contributors

Contributors

Uh oh!

v0.4.1

What's Changed

Contributors

Uh oh!

v0.4.0

New Features

Bug Fixes

New Contributors

Contributors

Uh oh!

v0.3.0

What's Changed

Contributors

Uh oh!