-
Notifications
You must be signed in to change notification settings - Fork 917
Quasi-Newton convergence acceleration/stabilization of discrete adjoints #1020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * \file CQuasiNewtonDriver.hpp | ||
| * \brief Implements a method to accelerate and stabilize the convergence | ||
| * of fixed point iterations, the history of past iterates is used to compute | ||
| * a least squares approximation of the inverse of the Jacobian, which is then | ||
| * used to correct the natural solution of the fixed point iteration. | ||
| * \note Based on the IQN-ILS method, see DOI 10.1007/s11831-013-9085-5 and | ||
| * references therein. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the implementation, explanation and references are in this file.
| /* DESCRIPTION: Number of samples for quasi-Newton methods. */ | ||
| addUnsignedShortOption("QUASI_NEWTON_NUM_SAMPLES", nQuasiNewtonSamples, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature is activated by setting a number of samples greater than 1.
| if (QNDriver.size()) { | ||
| GetAllSolutions(ZONE_0, true, QNDriver.FPresult()); | ||
| SetAllSolutions(ZONE_0, true, QNDriver.compute()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this is basically how it is used after running one discrete adjoint iteration.
| void shiftHistoryLeft() { | ||
| for (Index i = 1; i < X.size(); ++i) { | ||
| /*--- Swap instead of moving to re-use the memory of the first sample. | ||
| * This is why X and R are not stored as contiguous blocks of mem. ---*/ | ||
| std::swap(X[i-1], X[i]); | ||
| std::swap(R[i-1], R[i]); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the benefit of those less familiar with c++11, there are these things called move operations, this operation of shifting matrices to the left is nearly 0 cost because swapping two su2matrices only swaps the pointers and sizes and does not copy the actual data.
That is why in this case X and R are not stored as a single chunk of memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, swapping them X.size() times is in fact the obvious replacement for std::deque... :)
| /*--- Tiled part of the loop. ---*/ | ||
| Index begin = 0; | ||
| while (end-begin >= BLOCK_SIZE) { | ||
| computeNormalEquations<BLOCK_SIZE>(mat, rhs, begin); | ||
| begin += BLOCK_SIZE; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those interested in learning about performance techniques, I use something here called loop tiling, or loop blocking, or strip mining.
We need to compute A^T A where A is narrow, doing the natural "row dot column" algorithm is very inefficient because it makes poor use of the CPU cache, i.e. after doing the dot product of column 0 with itself, we go back to the beginning of column 0 to dot it with column 1, but by now the beginning of column 0 was evicted from cache (getting data from main memory is orders of magnitude slower than getting it from cache).
By tiling the loop we go over all the different combinations of columns and rows without evicting anything from the cache (since we work on much smaller sizes).
For general matrix multiplication we would need to tile both rows and columns.
| template<Index StaticSize> | ||
| void computeNormalEquations(su2vector<Scalar>& mat, | ||
| su2vector<Scalar>& vec, | ||
| Index start, | ||
| Index dynSize = 0) const { | ||
| /*--- Select either the static or dynamic size, optimizes inner loop. ---*/ | ||
| const auto blkSize = StaticSize? StaticSize : dynSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we tiled the loop, we have many work chunks with a static size, and one remainder chunk with dynamic size (not known during compilation).
This method is templated on the static size, so that we can "generate" two versions, for the version where the compiler knows the size of the loop it can generate much more efficient code because:
The size is a multiple of the SIMD length, and likely also of the loop unrolling count, which means only the necessary instructions are generated (better use of instruction cache), and size checks are not needed (less branching).
|
Great stuff, thanks for taking the first step to integrate quasi-Newton techniques to the adjoint fixed point iterations. I'm reviewing this the next couple of days. Just one question right away so that I'm on the right track. It seems that this implementation is based on equation 130 from Degroote's paper. Right? Though you're not using any QR decomposition for solving it? What's the approach then (assumingly in And you're not using |
|
Yes equation 130, the approach to solve equation 130, is to solve equation 130 (which are the normal equations mentioned in |
oleburghardt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Pedro, just some small comments below. In gerenal, I think it's good to be directly merged into the code. I like that the functionalities are well split so that one can use this as a basis for other quasi-Newton stabilizations (that require keeping track of the solution and residual history).
| void shiftHistoryLeft() { | ||
| for (Index i = 1; i < X.size(); ++i) { | ||
| /*--- Swap instead of moving to re-use the memory of the first sample. | ||
| * This is why X and R are not stored as contiguous blocks of mem. ---*/ | ||
| std::swap(X[i-1], X[i]); | ||
| std::swap(R[i-1], R[i]); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, swapping them X.size() times is in fact the obvious replacement for std::deque... :)
|
Thanks for the review @oleburghardt , corrections coming up. |
| q.compute(); | ||
| } | ||
|
|
||
| TEST_CASE("QN-ILS", "[Toolboxes]") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clarkpede @talbring , I've joined the unit test bandwagon, let me known if the naming and so on is appropriate (I borrowed inspiration from the other tests so I guess it's alright).
Proposed Changes
So I took a common method that folks use to converge multi physics problems expressed as fixed point iterations (the IQN-ILS) and applied it to the inner iterations of the discrete adjoint drivers (single and multizone), it seems to work well... I was having some issues when the primal solver does not converge so well for optimization edge cases, and this keeps the adjoint from diverging.
I'll post some results at some point.
Other than storing a number of solution snapshots (20 seems like a good number) the overhead is minimal (and also provided you have Lapack/MKL, compile for OpenMP, with fast-math, and AVX support).
Related Work
Already has #1015
Resolves #1021
Resolves #1025
Resolves #1029
PR Checklist