Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LU info #111

Merged
merged 7 commits into from
Oct 21, 2023
Merged

LU info #111

merged 7 commits into from
Oct 21, 2023

Conversation

mgates3
Copy link
Collaborator

@mgates3 mgates3 commented Sep 17, 2023

[Depends on #115]

Adds info error handling to LU and Aasen symmetric indefinite factorization and solves. Abbreviated output [outdated]:

slate/test> mpirun -np 4 ./tester --matrix rand,identity,one,zero,rand_zerocol0,rand_zerocol1.0,rand_zerocol1,rand_zerocol13,rand_zerocol0.5 --dim 137 --nb 32 --ib 8 getrf
% SLATE version 2023.08.25, id 61d7ba6c
% input: ./tester --matrix rand,identity,one,zero,rand_zerocol0,rand_zerocol1.0,rand_zerocol1,rand_zerocol13,rand_zerocol0.5 --dim 137 --nb 32 --ib 8 getrf
% 2023-09-16 22:00:32, 4 MPI ranks, CPU-only MPI, 4 OpenMP threads per MPI rank
                                                                                                                                                                                                                      
A    m    n  nb  ib  p  q  la  pt     error  status  
1  137  137  32   8  2  2   1   2  5.10e-19  pass    
2  137  137  32   8  2  2   1   2  0.00e+00  pass    
3  137  137  32   8  2  2   1   2  7.25e-03  FAILED  info = 2  
4  137  137  32   8  2  2   1   2       inf  FAILED  info = 1  
5  137  137  32   8  2  2   1   2  6.14e-03  FAILED  info = 1  
6  137  137  32   8  2  2   1   2  6.16e-03  FAILED  info = 137  
7  137  137  32   8  2  2   1   2  6.24e-03  FAILED  info = 2  
8  137  137  32   8  2  2   1   2  6.48e-03  FAILED  info = 14  
9  137  137  32   8  2  2   1   2  6.38e-03  FAILED  info = 69  

% Matrix kinds:
%  1: rand, cond unknown
%  2: identity, cond = 1
%  3: one, cond = inf
%  4: zero, cond = inf
%  5: rand_zerocol0, cond = inf
%  6: rand_zerocol1.0, cond = inf
%  7: rand_zerocol1, cond = inf
%  8: rand_zerocol13, cond = inf
%  9: rand_zerocol0.5, cond = inf

% 7 tests FAILED: getrf

Currently, one inconsistency is zerocolN takes 0-based index N in [ 0, n-1 ], while returned info is 1-based index in [ 1, n ]. info = 0 is generally considered to mean "no error". For instance, above, rand_zerocol13 has info = 14.

I guess if info != 0, then it should be marked as "pass", since the routine is correctly catching the singularity. Also, the tester should inspect U and verify that U( i, i ) == 0. [Updated]

@mgates3
Copy link
Collaborator Author

mgates3 commented Sep 17, 2023

One bug that was discovered and fixed is that gbtrf and hetrf were calling internal::getrf_panel with pivot_threshold = max_panel_threads, max_panel_threads = priority_1, and priority = 0 (default value), due to pivot_threshold being added as an argument. The default values were removed to avoid this. Code change in gbtrf:

                 internal::getrf_panel<Target::HostTask>(
-                    A.sub(k, i_end-1, k, k), diag_len, ib, pivots.at(k),
-                    max_panel_threads, priority_1 );
+                    A.sub(k, i_end-1, k, k), diag_len, ib, pivots.at(k),
+                    pivot_threshold, max_panel_threads, priority_1, tag_0, &info );

(I moved pivots.at(k) up a line for clarity here.)

Copy link
Contributor

@neil-lindquist neil-lindquist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big picture looks good to me, although you forgot to add src/internal/internal_reduce_info.cc.

For the inconsistency with the info and _zerocolN, one option would be to make the info 0-indexed and make a constant like factor_success=-1. I don't think _zerocolN should be 1-indexed since the generators are documented with 0-indexed (except orthog, but I'll fix that since I just noticed another issue with riemann).

src/gbtrf.cc Outdated Show resolved Hide resolved
src/internal/Tile_getrf.hh Show resolved Hide resolved
test/matrix_generator.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@neil-lindquist neil-lindquist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a few more places where this can be improved.

src/gesv_mixed.cc Show resolved Hide resolved
src/getrf.cc Outdated Show resolved Hide resolved
src/internal/Tile_getrf_nopiv.hh Outdated Show resolved Hide resolved
src/internal/internal_getrf.cc Outdated Show resolved Hide resolved
src/internal/internal_getrf_tntpiv.cc Outdated Show resolved Hide resolved
src/internal/internal_getrf_tntpiv.cc Show resolved Hide resolved
Copy link
Contributor

@neil-lindquist neil-lindquist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@mgates3 mgates3 added the failing CI checks are known to be failing label Oct 10, 2023
@neil-lindquist
Copy link
Contributor

neil-lindquist commented Oct 12, 2023

I was able to un-stick the deadlock by calling internal::getrf_panel in hetrf with max_panel_threads=1. It looks like the variable shift issue with the pivot threshold made it so that the master branch is setting max_panel_threads=priority_1=1. (See this minor change which results in a successful CI run)

But, I'm not sure why using multiple threads in hetrf's panel causes a deadlock. Presumably, it didn't before threshold pivoting was added and messed up what was being passed in for max_panel_threads.

@mgates3
Copy link
Collaborator Author

mgates3 commented Oct 21, 2023

gpu_nvidia error was out of memory, in function stream_create, most likely an error on the CI machine due to a user allocating the whole GPU. Needs rerun.
...
Passing now.

@mgates3 mgates3 merged commit 8354553 into icl-utk-edu:master Oct 21, 2023
7 of 8 checks passed
@mgates3 mgates3 mentioned this pull request Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
failing CI checks are known to be failing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants