-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jetson TX2 issues running DeDe #278
Comments
Hi, the error involving the FYI, DD runs fine on TX1, so there shouldn't be too many issues with TX2. The Jetson is nice for prediction, the GPU may not have enough memory for training the types of default architectures for images for instance. I would recommend you start by trying out https://deepdetect.com/tutorials/imagenet-classifier/ if you haven't already, as well as https://deepdetect.com/tutorials/object-detector/. Please provide the output of Btw, thank you for using the issue template, makes things much cleaner and faster for us to understand! |
./ut_caffeapiRunning main() from gtest_main.cc INFO - 18:43:24 - Device id: 0 INFO - 18:43:24 - Creating layer / name=mnist / type=Data Not sure if the above helps or not, I will give the imagenet tests a try and report back, |
TX2 bears a Pascal GPU, and you'd need to compile DD / Caffe with the correct CUDA compute capability code. The compute capability for TX1 and its Maxwell GPU is To specify the compute capability, look at the README, but basically you'd need to add
to your cmake call. Make sure to run |
Was able to dig up, it seems that the TX2 is 62, and when I tried
I saw it dropping back to 62 as well, so compiled with Same results from ctest and from ./ut_caffeapi as above. Tried my same text test from before, and atleast saw it get to about 20% training progress before the process stopped, and then tried your suggestion of changes, no hard crash, but the job failed with the following messages at about the 15% mark: INFO - 23:02:44 - Ignoring source layer inputl Will be doing the image test next here once the images are done downloading, in the mean time anything else I should check, verify, etc? |
Have you tried the image classifier test ? You don't need to download much for this. |
You can also provide the full build log. We have some of the newest cards but no TX2 handy yet. |
Here's a full build log (inc me updating vars to get things to build, and showing 63 vs 62, etc) Ran out of time today, will try the image tests tomorrow when I awake. |
So was doing the full training for the image side, was going good for a good long while (couple of hours), and then I get this on server out: INFO - 21:26:00 - This network produces output loss3/top-1 ERROR - 21:26:00 - {"code":500,"msg":"InternalError","dd_code":1007,"dd_msg":"src/caffe/util/im2col.cu:61 / Check failed (custom): (error) == (cudaSuccess)"} From the watch script on the example page: DD has stayed running, but the training aborts there. |
You should really start with the prediction tutorials from pre-trained models. CudaSuccess errors are usually due to a lack of memory on your GPU. This is very likely happening at the testing phase. Try setting the |
Sorry for the delay, just loaded the prebuilt clothing model from the examples and seems to be working, the the blogspot example link, seems to be around 2.7-3.1s for a response, would this be typical response time? I do see the CPU spiking to around 50-70% when I run a query? |
you are certainly not running on the GPU |
That was my fear, seems a bit sluggish, anything I can check? Be happy to run any tests, or anything you'd like. |
can you post the full API calls and the server logs ? basically three leads, the GPU is not activated or the build is incorrect or there's a missing option in the calls. I can't find doc on the Caffe and TF builds on tx2 which may mean it s not a problem. On TX1 there's a script in the home that activates the GPU at full speed. Look for the .sh files |
Here's the calls and their respective outputs, I've ran jetson clocks which is the only sh I can find, but I have ran a couple of the examples that ship with Jetpack, and they seem to work okay. Let me know if you'd like to look at anything else, I think at this point I"m going to reflash it with latest jetpack again since I've tinkered with it so much, just to make sure the environment is sane before I push to much further |
Qdd You don't need to reflash IMO. |
Hmm, Ill play around with compiling caffee outside of DD and see what I can dig up and report back. Thanks again! |
If you can temporarily share access to your device, you can join gitter and PM me for details so I may help directly. |
Was actually about to suggest that myself, did some other tests, gave myself a clean start again just to make sure I didn't break anything from my tinkering, and rebuilt everything. Building caffe direct, seems to pass all the tests and I can see it using the cuda cores, but rebuild of DD with non-static and so forth, seemed to give the same results. Let me work on getting the unit so it can be reached from the outside world, and then Ill PM you the details. I have to be up in a few hours for a meeting and still haven't slept, so Ill be back as soon as I can :) |
Build on TX2 that is working fine for me (~58ms on single image prediction with Googlenet and Caffe):
|
It seems it was working for me, so closing for now. |
Configuration
Ubuntu 16.04.2 LTS on Jetson TX2
b42115e
Your question / the problem you're facing:
Error message (if any) / steps to reproduce the problem:
https://deepdetect.com/tutorials/txt-training/
If I follow the directions exactly, and leave the archive in models/n20, execute the two wget API calls, results in the server consuming 100% of the CPU, and then eventually crashing, with no output to the console (Example 1 below)
Removing the archive, cleaning and rerunning, looks like its processing, I can clearly see it mentioning the Jetson cuda cores, but the training results in an error, Example 2 below.
Example 1:
INFO - 18:07:27 - Creating layer / name=inputl / type=MemoryData
INFO - 18:07:27 - Creating Layer inputl
INFO - 18:07:27 - inputl -> data
INFO - 18:07:27 - inputl -> label
INFO - 18:07:27 - Setting up inputl
INFO - 18:07:27 - Top shape: 359 88631 1 1 (31818529)
INFO - 18:07:27 - Top shape: 359 (359)
INFO - 18:07:27 - Memory required for data: 127275552
INFO - 18:07:27 - Creating layer / name=ip0 / type=InnerProduct
INFO - 18:07:27 - Creating Layer ip0
INFO - 18:07:27 - ip0 <- data
Killed
Example 2:
DeepDetect [ commit b42115e ]
INFO - 18:10:13 - Running DeepDetect HTTP server on localhost:8080
loaded vocabulary of size=88631
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0317 18:10:38.373723 14722 caffelib.cc:120] instantiating model template mlp
I0317 18:10:38.373878 14722 caffelib.cc:124] source=../templates/caffe//mlp/
I0317 18:10:38.373900 14722 caffelib.cc:125] dest=models/n20/mlp.prototxt
INFO - 18:10:38 - Fri Mar 17 18:10:38 2017 UTC - 127.0.0.1 "PUT /services/n20" 201 2099
INFO - 18:10:38 - Fri Mar 17 18:10:38 2017 UTC - 127.0.0.1 "POST /train" 201 0
I0317 18:10:38.557075 14736 txtinputfileconn.cc:68] txtinputfileconn: list subdirs size=20
I0317 18:10:51.864900 14736 txtinputfileconn.cc:186] vocabulary size=88631
data split test size=3770 / remaining data size=15078
vocab size=88631
I0317 18:11:04.393987 14736 caffelib.cc:2583] user batch_size=300 / inputc batch_size=15078
I0317 18:11:04.394039 14736 caffelib.cc:2620] batch_size=359 / test_batch_size=290 / test_iter=13
INFO - 18:11:04 - Device id: 0
INFO - 18:11:04 - Major revision number: 6
INFO - 18:11:04 - Minor revision number: 2
INFO - 18:11:04 - Name: GP10B
INFO - 18:11:04 - Total global memory: 8235577344
INFO - 18:11:04 - Total shared memory per block: 49152
INFO - 18:11:04 - Total registers per block: 32768
INFO - 18:11:04 - Warp size: 32
INFO - 18:11:04 - Maximum memory pitch: 2147483647
INFO - 18:11:04 - Maximum threads per block: 1024
INFO - 18:11:04 - Maximum dimension of block: 1024, 1024, 64
INFO - 18:11:04 - Maximum dimension of grid: 2147483647, 65535, 65535
INFO - 18:11:04 - Clock rate: 1300500
INFO - 18:11:04 - Total constant memory: 65536
INFO - 18:11:04 - Texture alignment: 512
INFO - 18:11:04 - Concurrent copy and execution: Yes
INFO - 18:11:04 - Number of multiprocessors: 2
INFO - 18:11:04 - Kernel execution timeout: No
INFO - 18:11:04 - Initializing solver from parameters:
INFO - 18:11:04 - Creating training net specified in net_param.
INFO - 18:11:04 - The NetState phase (0) differed from the phase (1) specified by a rule in layer inputlt
INFO - 18:11:04 - The NetState phase (0) differed from the phase (1) specified by a rule in layer losst
INFO - 18:11:04 - Initializing net from parameters:
INFO - 18:11:04 - Creating layer / name=inputl / type=MemoryData
INFO - 18:11:04 - Creating Layer inputl
INFO - 18:11:04 - inputl -> data
INFO - 18:11:04 - inputl -> label
INFO - 18:11:04 - Setting up inputl
INFO - 18:11:04 - Top shape: 359 88631 1 1 (31818529)
INFO - 18:11:04 - Top shape: 359 (359)
INFO - 18:11:04 - Memory required for data: 127275552
INFO - 18:11:04 - Creating layer / name=ip0 / type=InnerProduct
INFO - 18:11:04 - Creating Layer ip0
INFO - 18:11:04 - ip0 <- data
INFO - 18:11:04 - ip0 -> ip0
INFO - 18:11:04 - Setting up ip0
INFO - 18:11:04 - Top shape: 359 200 (71800)
INFO - 18:11:04 - Memory required for data: 127562752
INFO - 18:11:04 - Creating layer / name=act0 / type=ReLU
INFO - 18:11:04 - Creating Layer act0
INFO - 18:11:04 - act0 <- ip0
INFO - 18:11:04 - act0 -> ip0 (in-place)
INFO - 18:11:04 - Setting up act0
INFO - 18:11:04 - Top shape: 359 200 (71800)
INFO - 18:11:04 - Memory required for data: 127849952
INFO - 18:11:04 - Creating layer / name=drop0 / type=Dropout
INFO - 18:11:04 - Creating Layer drop0
INFO - 18:11:04 - drop0 <- ip0
INFO - 18:11:04 - drop0 -> ip0 (in-place)
INFO - 18:11:04 - Setting up drop0
INFO - 18:11:04 - Top shape: 359 200 (71800)
INFO - 18:11:04 - Memory required for data: 128137152
INFO - 18:11:04 - Creating layer / name=ip1 / type=InnerProduct
INFO - 18:11:04 - Creating Layer ip1
INFO - 18:11:04 - ip1 <- ip0
INFO - 18:11:04 - ip1 -> ip1
INFO - 18:11:04 - Setting up ip1
INFO - 18:11:04 - Top shape: 359 200 (71800)
INFO - 18:11:04 - Memory required for data: 128424352
INFO - 18:11:04 - Creating layer / name=act1 / type=ReLU
INFO - 18:11:04 - Creating Layer act1
INFO - 18:11:04 - act1 <- ip1
INFO - 18:11:04 - act1 -> ip1 (in-place)
INFO - 18:11:04 - Setting up act1
INFO - 18:11:04 - Top shape: 359 200 (71800)
INFO - 18:11:04 - Memory required for data: 128711552
INFO - 18:11:04 - Creating layer / name=drop1 / type=Dropout
INFO - 18:11:04 - Creating Layer drop1
INFO - 18:11:04 - drop1 <- ip1
INFO - 18:11:04 - drop1 -> ip1 (in-place)
INFO - 18:11:04 - Setting up drop1
INFO - 18:11:04 - Top shape: 359 200 (71800)
INFO - 18:11:04 - Memory required for data: 128998752
INFO - 18:11:04 - Creating layer / name=ip2 / type=InnerProduct
INFO - 18:11:04 - Creating Layer ip2
INFO - 18:11:04 - ip2 <- ip1
INFO - 18:11:04 - ip2 -> ip2
INFO - 18:11:04 - Setting up ip2
INFO - 18:11:04 - Top shape: 359 20 (7180)
INFO - 18:11:04 - Memory required for data: 129027472
INFO - 18:11:04 - Creating layer / name=loss / type=SoftmaxWithLoss
INFO - 18:11:04 - Creating Layer loss
INFO - 18:11:04 - loss <- ip2
INFO - 18:11:04 - loss <- label
INFO - 18:11:04 - loss -> loss
INFO - 18:11:04 - Creating layer / name=loss / type=Softmax
INFO - 18:11:04 - Setting up loss
INFO - 18:11:04 - Top shape: (1)
INFO - 18:11:04 - with loss weight 1
INFO - 18:11:04 - Memory required for data: 129027476
INFO - 18:11:04 - loss needs backward computation.
INFO - 18:11:04 - ip2 needs backward computation.
INFO - 18:11:04 - drop1 needs backward computation.
INFO - 18:11:04 - act1 needs backward computation.
INFO - 18:11:04 - ip1 needs backward computation.
INFO - 18:11:04 - drop0 needs backward computation.
INFO - 18:11:04 - act0 needs backward computation.
INFO - 18:11:04 - ip0 needs backward computation.
INFO - 18:11:04 - inputl does not need backward computation.
INFO - 18:11:04 - This network produces output loss
INFO - 18:11:04 - Network initialization done.
INFO - 18:11:04 - Creating test net (#0) specified by net_param
INFO - 18:11:04 - The NetState phase (1) differed from the phase (0) specified by a rule in layer inputl
INFO - 18:11:04 - The NetState phase (1) differed from the phase (0) specified by a rule in layer loss
INFO - 18:11:04 - Initializing net from parameters:
INFO - 18:11:04 - Creating layer / name=inputlt / type=MemoryData
INFO - 18:11:04 - Creating Layer inputlt
INFO - 18:11:04 - inputlt -> data
INFO - 18:11:04 - inputlt -> label
INFO - 18:11:04 - Setting up inputlt
INFO - 18:11:04 - Top shape: 290 88631 1 1 (25702990)
INFO - 18:11:04 - Top shape: 290 (290)
INFO - 18:11:04 - Memory required for data: 102813120
INFO - 18:11:04 - Creating layer / name=ip0 / type=InnerProduct
INFO - 18:11:04 - Creating Layer ip0
INFO - 18:11:04 - ip0 <- data
INFO - 18:11:04 - ip0 -> ip0
INFO - 18:11:04 - Setting up ip0
INFO - 18:11:04 - Top shape: 290 200 (58000)
INFO - 18:11:04 - Memory required for data: 103045120
INFO - 18:11:04 - Creating layer / name=act0 / type=ReLU
INFO - 18:11:04 - Creating Layer act0
INFO - 18:11:04 - act0 <- ip0
INFO - 18:11:04 - act0 -> ip0 (in-place)
INFO - 18:11:04 - Setting up act0
INFO - 18:11:04 - Top shape: 290 200 (58000)
INFO - 18:11:04 - Memory required for data: 103277120
INFO - 18:11:04 - Creating layer / name=drop0 / type=Dropout
INFO - 18:11:04 - Creating Layer drop0
INFO - 18:11:04 - drop0 <- ip0
INFO - 18:11:04 - drop0 -> ip0 (in-place)
INFO - 18:11:04 - Setting up drop0
INFO - 18:11:04 - Top shape: 290 200 (58000)
INFO - 18:11:04 - Memory required for data: 103509120
INFO - 18:11:04 - Creating layer / name=ip1 / type=InnerProduct
INFO - 18:11:04 - Creating Layer ip1
INFO - 18:11:04 - ip1 <- ip0
INFO - 18:11:04 - ip1 -> ip1
INFO - 18:11:04 - Setting up ip1
INFO - 18:11:04 - Top shape: 290 200 (58000)
INFO - 18:11:04 - Memory required for data: 103741120
INFO - 18:11:04 - Creating layer / name=act1 / type=ReLU
INFO - 18:11:04 - Creating Layer act1
INFO - 18:11:04 - act1 <- ip1
INFO - 18:11:04 - act1 -> ip1 (in-place)
INFO - 18:11:04 - Setting up act1
INFO - 18:11:04 - Top shape: 290 200 (58000)
INFO - 18:11:04 - Memory required for data: 103973120
INFO - 18:11:04 - Creating layer / name=drop1 / type=Dropout
INFO - 18:11:04 - Creating Layer drop1
INFO - 18:11:04 - drop1 <- ip1
INFO - 18:11:04 - drop1 -> ip1 (in-place)
INFO - 18:11:04 - Setting up drop1
INFO - 18:11:04 - Top shape: 290 200 (58000)
INFO - 18:11:04 - Memory required for data: 104205120
INFO - 18:11:04 - Creating layer / name=ip2 / type=InnerProduct
INFO - 18:11:04 - Creating Layer ip2
INFO - 18:11:04 - ip2 <- ip1
INFO - 18:11:04 - ip2 -> ip2
INFO - 18:11:04 - Setting up ip2
INFO - 18:11:04 - Top shape: 290 20 (5800)
INFO - 18:11:04 - Memory required for data: 104228320
INFO - 18:11:04 - Creating layer / name=losst / type=Softmax
INFO - 18:11:04 - Creating Layer losst
INFO - 18:11:04 - losst <- ip2
INFO - 18:11:04 - losst -> losst
INFO - 18:11:04 - Setting up losst
INFO - 18:11:04 - Top shape: 290 20 (5800)
INFO - 18:11:04 - Memory required for data: 104251520
INFO - 18:11:04 - losst does not need backward computation.
INFO - 18:11:04 - ip2 does not need backward computation.
INFO - 18:11:04 - drop1 does not need backward computation.
INFO - 18:11:04 - act1 does not need backward computation.
INFO - 18:11:04 - ip1 does not need backward computation.
INFO - 18:11:04 - drop0 does not need backward computation.
INFO - 18:11:04 - act0 does not need backward computation.
INFO - 18:11:04 - ip0 does not need backward computation.
INFO - 18:11:04 - inputlt does not need backward computation.
INFO - 18:11:04 - This network produces output label
INFO - 18:11:04 - This network produces output losst
INFO - 18:11:04 - Network initialization done.
I0317 18:11:05.002513 14736 caffelib.cc:1614] filling up net prior to training
INFO - 18:11:04 - Solver scaffolding done.
ERROR - 18:13:09 - service n20 training status call failed
ERROR - 18:13:09 - {"code":500,"msg":"InternalError","dd_code":1007,"dd_msg":"./include/caffe/syncedmem.hpp:26 / Check failed (custom): *ptr"}
INFO - 18:13:09 - Fri Mar 17 18:13:09 2017 UTC - 127.0.0.1 "GET /train?service=n20&job=1" 200 2
Very new to deep learning/machine learning, hoping to learn more by getting everything running on my jetson dev kit and utilize it via C# from my code, even though I was able to get everything compiled, seems like something might still be amiss.
ctest
Test project /usr/src/deepdetect/build
Start 1: ut_apidata
1/6 Test #1: ut_apidata ....................... Passed 0.50 sec
Start 2: ut_conn
2/6 Test #2: ut_conn ..........................***Failed 1.40 sec
Start 3: ut_jsonapi
3/6 Test #3: ut_jsonapi ....................... Passed 1.64 sec
Start 4: ut_caffe_mlp
4/6 Test #4: ut_caffe_mlp ..................... Passed 0.14 sec
Start 5: ut_caffeapi
5/6 Test #5: ut_caffeapi ......................***Exception: Other 1.82 sec
Start 6: ut_httpapi
6/6 Test #6: ut_httpapi .......................***Exception: Other 0.46 sec
50% tests passed, 3 tests failed out of 6
Total Test time (real) = 5.97 sec
The following tests FAILED:
2 - ut_conn (Failed)
5 - ut_caffeapi (OTHER_FAULT)
6 - ut_httpapi (OTHER_FAULT)
Errors while running CTest
Any help would be greatly appreciated,
Thank you!
The text was updated successfully, but these errors were encountered: