Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wish to run omp_kmeans on 100G dataset #1

Open
ghost opened this issue Oct 20, 2012 · 8 comments
Open

Wish to run omp_kmeans on 100G dataset #1

ghost opened this issue Oct 20, 2012 · 8 comments

Comments

@ghost
Copy link

ghost commented Oct 20, 2012

Hi, I am planning to run this program on a dataset sized almost 100GB on my server(more than 200GB of mem).
Could you please tell how to implement is cause I constantly getting a 'segmentation fault' error message when the memory exceed 4GB.
I have checked the BLAS and LAPACK libraries are all 64 bits version.
omp_kmeans is also compiled using a 64bit gcc compiler.

Thank you for your kindness.

@serban
Copy link
Owner

serban commented Oct 21, 2012

Hi, there.

The build does not require any BLAS or LAPACK libraries, so don't worry about those.

Are you trying to use the CUDA version? If so, your dataset must fit in the RAM available to your GPU, which typically maxes out at 4 to 8 GB. A dataset of 100 GB is simply too large. I'd go so far as to say that CUDA won't bring you much benefit if you can't fit your dataset in the GPU memory because the time it takes to copy the data back and forth between the CPU memory and GPU memory would cripple the performance of the application.

Let me know if I can be of anymore help.

Serban

On Oct 20, 2012, at 8:24 AM, meloom notifications@github.com wrote:

Hi, I am planning to run this program on a dataset sized almost 100GB on my server(more than 200GB of mem).
Could you please tell how to implement is cause I constantly getting a 'segmentation fault' error message when the memory exceed 4GB.
I have checked the BLAS and LAPACK libraries are all 64 bits version.
omp_kmeans is also compiled using a 64bit gcc compiler.

Thank you for your kindness.


Reply to this email directly or view it on GitHub.

@ghost
Copy link
Author

ghost commented Oct 22, 2012

Hi,
Thank you for replying.

I did the following instruction

./omp_main -i ~/feature.txt -n 50 -p 12 -o

Getting the following result(when the total Ram is over 3.7GB).

Segmentation fault

The feature.txt consists of 87 GB of data. each vector has about 300,000 features.
I am not using CUDA.
Could you please tell me how to fix this error?

Thank you in advance.

@ghost
Copy link
Author

ghost commented Oct 24, 2012

Hi there,

I think I found out where caused the error.

In File_io.c, the program malloc a whole bulk of memory to objects, error occurs when the continuous memory allocated is too large.

Then instead, I allocated memory for each object.

Thank you anyway for replying.

@cvnerds
Copy link

cvnerds commented Oct 6, 2016

Nvidia is ramping up their deep learning efforts and you can get up to 96GB of graphic memory. It would be really cool if you could consider looking into eliminating the 32bit restriction for the cuda code. For example I noticed on a g2.2xlarge machine with Nvidia CUDA AMI (https://aws.amazon.com/marketplace/pp/B01LZMLK1K) that the read call in cuda_io.cu (binary file) was limited to read 2^31 bytes. It's a bit weird, because the machine supports 64bit

@dinvlad
Copy link

dinvlad commented Oct 6, 2016

FWIW g2.2xlarge uses a single GPU with 4GB of RAM:

High-performance NVIDIA GPUs, each with 1,536 CUDA cores and 4GB of video memory

@dinvlad
Copy link

dinvlad commented Oct 6, 2016

Also, 64-bit performance would suffer heavily because it needs the double-precision unit. Most GPUs (but the newest/upcoming Teslas) have miniscule capabilities for that, so a server CPU may easily outperform them. I agree the upcoming Pascals will be better suited for 64-bit though (currently ~5 TFlops: http://www.nvidia.com/object/tesla-p100.html).

EDIT: it can in fact address 64-bit with multiple-instruction sequences, but that again may decrease the performance: https://developer.nvidia.com/cuda-faq

EDIT2: double-precision performance refers to the ALU, for integers you'd need to rely on multi-instruction sequences.

EDIT3: well, double-precision can be used to manipulate any integers up to 2^53 without loss of precision. It's more of a hack though, and may not be well-suited for memory-addressing.

@cvnerds
Copy link

cvnerds commented Oct 6, 2016

They also recently released p2 instances, although the rollout doesn't seem to have finished in practice: https://aws.amazon.com/ec2/instance-types/p2/

@vmarkovtsev
Copy link

Sorry for the shameless promotion, but all people stuck with 4GB memory limit should try https://github.com/src-d/kmcuda It supports as much memory as your GPU has, runs on multiple GPUs in parallel and is capable to handle the data in float16 format with Kahan summation (hence doubled data size) . Yet still 100GB is too much, of course. I would do the following. Pick "best" X GB from 100GB where X is the amount of mem your GPU has, cluster them, and then use the centroids to assign the rest of the dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants