-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new version gives error on parsing *.JSON inputs (cuda/10.2.89) #99
Comments
Hi @Jaberh ,
It seems that it cannot parse your config as JSON for some reason and then it cannot parse it as plain text config (because it is not) which results in fail. Now it's weird that Release/Debug differs in such manner. Which host compiler do you use? Do you wish to use config files or rather config strings in your final code?
This happens because AMGX parsed this config as plain text and it has slightly different syntax. Grid and solver stats parameter need to be explicitly specified for a solver. Again, i strongly recommend using JSON configs, but if you wish i can explain to you how to use plain text configs. |
Hi Marsaev, my first choice was also using the JSON file, the issue is when I integrate the amgx to the bigger code, using JSON file it does not iterate, I used the config file from your example, stand alone works perfect adding to the bigger code using JSON number of iterations kept at 0. Using the same JSON file that works as stand alone, I think by now I have almost memorized your entire AMGX_Reference.pdf file. I am using gcc/8.2.0 with mpich/3.2.1 for mpi and cuda/10.2.89 |
You can also provide JSON as a string. Here is example of extension of
You can also hardcode config, for example here i copy-pasted
and output is also identical to what's above. For everything here i used same gcc and Release build. Would any of those options work for you in your app? |
Interesting, I copied your hard coded version and still the error is the same as before AMGX ERROR: file *.cu line 215 |
Just to check, when you build Release AMGX, can you check that |
I will check it now, but I remember the CMakeLists.txt enforces it to one, I will double check again, btw spasiba for all your help, We did a fresh build, now for a simple test case the first option works and the second one still gives the same error. The only change that I made to your config file is to add "store_res_history": 1, now with your proposed alternatives, and the new build, I do not get the read error However if I use, |
did you get a chance to look at the above issue? |
Hey Jaberth, I suspect none of your configs are parsed as expected. It would definitely help to print exact solver components during solver constructions just to confirm solver structure but there is no such functionality at the moment. Can we try identify issue step by step?
|
For the same config and input data, but for two ranks it should be:
|
Hi Marsaev |
Is it possible somehow to load this example matrix into your industrial code to check if same solve can be repeated in your app?
replace it with:
and check that each rank solves this system identically and each rank's AMGX's output is similar to output of standalone example. If your distributed setup is homogeneous, there shouldn't be any major differences (IIRC possible difference - parallel reduction result, but it shouldn't affect solve drastically) If output is somehow different - can you try log every AMG API call just to check order of calls and parameters? Sorry, there is no built-int logging of API calls, but if you wrap AMGX calls with error checking you can do something like:
|
Hi Marsaev |
It's great that there are no errors, but it would be great to see actual order of the calls with parameters, hence macro expansion with hashtag |
here is the output with parameters and here is the same case called from a unit test and its output |
Just to check - are there |
yes, a singleton class is reposible for initialization and cleanup, as this library will be used by other developers I wanted to prevent multiple calls, I did no put safe_call of initialize and finalize that is why it is not shown there |
Alright, those API calls looks good. I made some progress today. There are double try/catch blocks inside the code and real issue is caught within the code, but error reported to the C API is incorrectly identified. |
Do you suspect that this might be an issue related to something lower level that the programming, such as build, cuda installation, drivers, ... ? |
Hi marsaev, I am git pulling it right now and will let you know |
I built this one which has the latest commit. Unfortunately nothing changed, still unit test works fine and integrated version does not iterate. same output as above commit 7b4d431
|
Got it. Can you check what solver status is returned with Also, can you try adding "solver_verbose" = 1 for both solver and smoother in the config, so something like this for AGGREGATION_JACOBI:
you should see something like this in the output:
info for jacobi should be repeated for each amg level. |
Hi Marat. sure, I have been checking for solver return value which is 0. and here is the output using your new config file
|
Looks all right. |
exactly, global initializer called.
|
I don't have any ideas why it might be happening without hands on on the code, considering that standalone execution functions properly. Is it for both release and debug? If it happens in debug - can you try tracing with gdb through solve process and see where early exit happens? Function of interest would be AMGX/base/src/solvers/solver.cu Line 589 in d0019e5
AMGX/base/src/solvers/solver.cu Line 798 in d0019e5
|
ok, this is very odd to me as well, I will build a debug version and monitor that function! It might be beneficial to share this with a build specialist at NVIDIA as well, I am building a debug version and digging into this. Spasiba |
For debugging purposes you can set coarse solver and smoother to NOSOLVER, so that only AMG solver would go to this code piece. AMG would still iterate in that case, but residual should not decrease - this should be enough to try debug issue where it's not iterating at all. For example:
|
I quickly tried this no luck, I will try this in the debugger as well. I was gonna put the break point in the loop you mentioned above,
|
Hi Marat.
|
It seems that produced code is too large for the linker to process. There are number of ways to reduce amount of generated code, but that might be little to adventurous :) I can try to provide you debug build on centos:centos6.9 with devtoolset-8 - would that work? Which cuda/mpi are you compiling against? |
openmpi/3.1.3, cuda/10.2.89 |
here is the comparison of what test unit and real code outputs calling the same method:
I also noticed that this function
Assuming that initial residual is the same for both cases, converged() should return the same boolean for both cases, |
the solver_verbose does not print out the tolerance, so I tracked down why that function returns differently. Apparently, |
Great progress! Still it would be good to udnerstand where incorrect read comes from - from JSON parser, or it is modified somewhere. I have sent you debug binary if you are willing to track this issue further using gdb |
sure, I am happy to help debug the issue as I am passed overdue to get this to work |
all the doubles are wrong basically
|
Hi Marat, |
Sorry, was away for a holiday. I got your email from github commit logs, but i think it is wrong sine i've got not delivered notification. You can get binary here: https://drive.google.com/file/d/1qB1Q5SpqtsG54JVJrOst6S3lmFw56Lcd/view?usp=sharing |
So, the value in |
Here is the issue with RAPID JASON here is the buggy function in RAPID JASON, I guess we have to debug for third party libs as well |
Was the N passed to that function wrong? I wonder why it worked in one case but not another. |
No the real code uses a lot of static global stuff so I think at some point it runs out. I don't like the look-up table there, bad practice, |
I'm glad that you were able to identify the issue. I'm still not sure about real reason on what's happening, but i can agree that in our case possible performance benefit of lookup table is negligible and we can safely use pow. Since rapidjson is 3rd party code - let me clarify few things about making changes and perform few tests. |
Thanks for the followup, I have one more issue to resolve and that is for certain cases some of my ranks have 0 number of elements which leads to failure at matrix construction, what is the best way to go about this? by the way the following might be better than good ole pow inline double Pow10(int n) { |
one more question, is it possible to disable the print out of Using Normal MPI (Hostbuffer) communicator... |
Yep, will move it to higher verbosity level. Thanks, |
Hi Marat., |
We don't handle such cases specifically, so result would be unpredictable. |
I think the most robust way to handle this is via communicators, it happens a lot in internal combustion engine simulations |
Yes, this definitely would be robust regarding solver, however there would be also a toll for creating/releasing communicators for each such change. We also have plans to experiment with some additional logic behind communications and exploiting topology - this might have additional load on the initialization for each MPI comm. |
I'll close this issue for now. Please, let me know is you will have other questions. |
I have one more question,
if I comment out AMGX_SAFE_CALL(AMGX_resources_destroy(m_resources)); then the error changes to, which makes sense as it detects the non freed resources handle.
Hopefully this is the last test case, I would like to know your advice on this before debugging further
It is probably related to
|
I am trying to integrate AMGX to an industrial code. if I build in in debug mode, there are no parsing issues, but when in release mode, it gives the following error; the input file is AGGREGATION_JACOBI.json
Converting config string to current config version
Error parsing parameter string: Incorrect config entry (number of equal signs is not 1) : "config_version": 2
To make stuff more interesting, if I use AMGX_config_create(&m_config, "config_version=2,algorithm=AGGREGATION,selector=SIZE_2,print_grid_stats=1,max_iters=1000,monitor_residual=1,obtain_timings=1,print_solve_stats=1,print_grid_stats=1");
it refuses to print grid and solver stats but accepts the rest of the inputs.
The text was updated successfully, but these errors were encountered: