-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized algorithms not implemented for 1D and 2D geometry (even with the non-adaptive method) #428
Comments
In fact, vectorization is available in 2D, but not in the
|
I'm unlucky. |
There is a reason why it has not been done yet. We estimate that the performance improvement would probably not be significant. We could be wrong of course. We are currently working on vectorization for AM geometry but there is no schedule for vectorization in 1D for the moment. |
"There is a reason why it has not been done yet. We estimate that the performance improvement would probably not be significant. We could be wrong of course." The way I see it, it should accelerate every loops even in 1D Cartesian geometry! "We are currently working on vectorization for AM geometry but there is no schedule for vectorization in 1D for the moment." Ok. I'll have a look about what has been done in 2D Cartesian case and I'll try to extend it to the 1D one. En te remerciant, |
Ciao Michaël, |
Coucou Micka, Yes, it would be wonderful if it is implemented in 1D. I'm running expensive 1D-3V PIC simulations of TNSA using SMILEI resolving the few nm of rear side layer of Aluminum oxide contaminated by water vapor and hydrocarbures diffusion inside it. I've already run the simulations with laser pulses from 30 fs FWHM to few hundreds of fs FWHM interacting with a 10 microns thick Aluminum target on tesseract of DIRAC supercomputing center in UK. I wanted to run the less expensive simulations of 3 microns thick targets with the same pulse durations as mentioned before on my nice computer that has the AMD threadripper CPU (64 cores). The vectorization would make it more or less ~4 times faster the simulations in both cases according to my experience using OSIRIS using AVX or AVX2 and GNU/OpenMPI compilers concerning other kind of simulations (mainly relativistic collisionless shocks). I fully understand the problem of human resources. That is why I proposed myself to implement it if it was not plan on the short time scale. I would however earn time if you tell me the source files that should be modified (1D Yee scheme, 1D Esirkepov, 1D current deposit and 1D interpolations of fields and related input deck parameters management). I can basically copy / past what has already been done for the 2D Cartesian case. For sure, it won't be with the "adaptative mode" in a first time such as the already implemented 2D Cartesian case. Wishing you a good afternoon, |
Hi Michael, I can assist you in this work. First I would like to mention few things:
Where you need to work: The main operators that you have to vectorize are the interpolator and the projector. The pusher is already vectorized for all geometries. Maxwell solver is partially vectorized and quick enough. You will have to work in the directories If we look at the Interpolator directory, you will see that there is a class per geometry (1D, 2D 3D...), order (2 and 4) and level of optimization (scalar or vecto). For instance You will therefore have to create a class called Then to be able to use this class, you have to modify the factory: else if( ( params.geometry == "3Dcartesian" ) && ( params.interpolation_order == 2 ) ) {
if( !vectorization ) {
Interp = new Interpolator3D2Order( params, patch );
}
#ifdef _VECTO
else {
Interp = new Interpolator3D2OrderV( params, patch );
}
#endif you can see that there is a if-condition specific to vectorization, you have to do the same for 1D. The projector works the same way. |
Hello Xxirii, Thank you very much for all these important pieces of information. Do you confirm https://doi.org/10.1016/j.cpc.2019.05.001 is the paper that you mentioned? |
Hi, Do not hesitate to ask any question. Here or in Element. I confirm this is the paper I was mentioning. |
Merci beaucoup |
For your information, I will test in the next few days the possibility of 1D vectorized operators. I will let you know about the efficiency. |
So, I have tried a vectorized version of the interpolator. You can switch the content of the function void Interpolator1D2Order::fieldsWrapper( ElectroMagn *EMfields, Particles &particles,
SmileiMPI *smpi, int *istart, int *iend, int ithread, int ipart_ref )
{
std::vector<double> *Epart = &( smpi->dynamics_Epart[ithread] );
std::vector<double> *Bpart = &( smpi->dynamics_Bpart[ithread] );
int *iold = &( smpi->dynamics_iold[ithread][0] );
double *delta = &( smpi->dynamics_deltaold[ithread][0] );
int nparts = particles.size();
double * __restrict__ position_x = particles.getPtrPosition(0);
double * __restrict__ Epart_x= &( smpi->dynamics_Epart[ithread][0*nparts] );
double * __restrict__ Epart_y= &( smpi->dynamics_Epart[ithread][1*nparts] );
double * __restrict__ Epart_z= &( smpi->dynamics_Epart[ithread][2*nparts] );
double * __restrict__ Bpart_x= &( smpi->dynamics_Bpart[ithread][0*nparts] );
double * __restrict__ Bpart_y= &( smpi->dynamics_Bpart[ithread][1*nparts] );
double * __restrict__ Bpart_z= &( smpi->dynamics_Bpart[ithread][2*nparts] );
double * __restrict__ Ex = EMfields->Ex_->data();
double * __restrict__ Ey = EMfields->Ey_->data();
double * __restrict__ Ez = EMfields->Ez_->data();
double * __restrict__ Bx = EMfields->Bx_m->data();
double * __restrict__ By = EMfields->By_m->data();
double * __restrict__ Bz = EMfields->Bz_m->data();
int idx; //dual index computed
int ipx; //prim index computed
double xjn;
double xjmxi2;
double coeffd[3];
double coeffp[3];
#pragma omp simd private(xjn, xjmxi2, coeffd, coeffp, idx, ipx)
for( int ipart=*istart ; ipart<*iend; ipart++ ) {
//Interpolation on current particle
// Particle position (in units of the spatial-step)
xjn = position_x[ipart]*dx_inv_;
// Dual
idx = round( xjn+0.5 ); // index of the central point
xjmxi = xjn - ( double )idx +0.5; // normalized distance to the central node
xjmxi2 = xjmxi*xjmxi; // square of the normalized distance to the central node
// 2nd order interpolation on 3 nodes
coeffd[0] = 0.5 * ( xjmxi2-xjmxi+0.25 );
coeffd[1] = ( 0.75-xjmxi2 );
coeffd[2] = 0.5 * ( xjmxi2+xjmxi+0.25 );
idx -= index_domain_begin;
// Primal
ipx = round( xjn ); // index of the central point
xjmxi = xjn -( double )ipx; // normalized distance to the central node
xjmxi2 = xjmxi*xjmxi; // square of the normalized distance to the central node
// 2nd order interpolation on 3 nodes
coeffp[0] = 0.5 * ( xjmxi2-xjmxi+0.25 );
coeffp[1] = ( 0.75-xjmxi2 );
coeffp[2] = 0.5 * ( xjmxi2+xjmxi+0.25 );
ipx -= index_domain_begin;
// // Interpolate the fields from the Dual grid : Ex, By, Bz
Epart_x[ipart] = coeffd[0] * Ex[idx-1] + coeffd[1] * Ex[idx] + coeffd[2] * Ex[idx+1];
Bpart_y[ipart] = coeffd[0] * By[idx-1] + coeffd[1] * By[idx] + coeffd[2] * By[idx+1];
Bpart_z[ipart] = coeffd[0] * Bz[idx-1] + coeffd[1] * Bz[idx] + coeffd[2] * Bz[idx+1];
// Interpolate the fields from the Primal grid : Ey, Ez, Bx
Epart_y[ipart] = coeffp[0] * Ey[ipx-1] + coeffp[1] * Ey[ipx] + coeffp[2] * Ey[ipx+1];
Epart_z[ipart] = coeffp[0] * Ez[ipx-1] + coeffp[1] * Ez[ipx] + coeffp[2] * Ez[ipx+1];
Bpart_x[ipart] = coeffp[0] * Bx[ipx-1] + coeffp[1] * Bx[ipx] + coeffp[2] * Bx[ipx+1];
//Buffering of iol and delta
iold[ipart] = ipx;
delta[ipart] = xjmxi;
}
} Note that this version is supposed to update the scalar version and you don't have to activate the vectorization to use it. I did some tests using the Intel compiler. You have a substantial gain between 1 and 128 particles per cell. The speed-up is then reduced for larger numbers. This is not the typical behavior but here we do not use the cell sorting as it is usually the case in vecto. I suspect the particles to generate more cache misses when loading the fields from the grids a they get mixed. The performance results may depend on the patch size and the case. Smaller patch will increase the patch coherency but increase the time spent in particle synchronization. The code is also written to provide good performance with GNU and LLVM compilers. But I did not test them yet. |
Did we ever try such large numbers of particles per cell in 2D or 3D? |
Thank you very much @xxirii. I will test it this week and/or next week using GNU and OpenMPI compilers together with AVX and AVX2 vectorization compiling options on my threadripper desktop. I agree with you, it should depend on the case, the number of patches and the decomposition domain strategy, especially if no particle sorting is performed. What was the test case that you were simulating to perform this computational time scaling per particle per iteration per socket. If you were simulating a periodic collisionless plasma at Maxwell-Boltzmann equilibrium, it should have avoid this re-increase of computational time. I agree with you @mccoys : in 2D, when resolving the Debye length to simulate a collisionless plasma, a number of macroparticles per cell higher than 128 is quite scarce so in 3D... However, when simulating collisional plasmas using the multiple binary Coulomb collision Monte Carlo module, it is quite common to need more than 128 macroparticles per cell. For my part, I recently had to consider more than 128 macroparticles per cell even for collisionless plasmas when simulating relativistic collisionless shocks on very large time scales in order to limit the numerical heating. Thanking you a lot again, |
@mccoys : Yes, I did. In this case the time per particle in vectorized mode stabilizes or slightly increases but the difference with the scalar mode stays high. @michaeltouati I performed this test using a homogeneous well-balanced Maxwellian plasma. I always use such a case for preliminary performance tests. |
@xxirii Is this issue partly fixed now? |
It is now really an issue but a need for features. It is partially addressed with extension of the vectorization in 2D and vectorization of the interpolator in 1D. I think we can close it and keep it in our roadmap. |
Do you plan to add vectorization of SMILEI for 1D and 2D Cartesian geometry soon?
Since it is already implemented for simulation using a 3D Cartesian geometry, I think it wouldn't take a lot of time to extend it to the simpler 1D and 2D Cartesian geometry cases.
Waiting for your answer,
Best wishes,
Michaël
The text was updated successfully, but these errors were encountered: