Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some program problems when using my own dataset #1

Open
GoooDte opened this issue Jul 19, 2022 · 13 comments
Open

Some program problems when using my own dataset #1

GoooDte opened this issue Jul 19, 2022 · 13 comments
Labels
question Further information is requested

Comments

@GoooDte
Copy link

GoooDte commented Jul 19, 2022

Hello! It's really an excellent work! Thanks for releasing the huggingface transformers based version. Recently, I'm doing experiments on some other datasets. Unfortunately, I met some problems when running the code:

  1. When clustering the datastore, kmeans.train() in Kmeans.py line 41 reports the following error:
    ae052aa2a7fbbed3540e91e539dde49
    My cudatoolkit version is 11.0 and faiss-gpu version is 1.7.2
  2. When getting knns, the knn search process reports the following error:
    image

Maybe I'm not too familiar with the faiss-gpu package, so could you please help me to see how to solve the above problems ? Thanks a lot!

@GoooDte GoooDte added the question Further information is requested label Jul 19, 2022
@urialon
Copy link
Collaborator

urialon commented Jul 20, 2022

Hi @GoooDte ,
Thank you for your interest in our work!

Thank you for reporting these problems.

  1. I am not sure. Can you share your keys and the exact command line, and I will try running it myself?

  2. I am guessing that you refer to our knn-transformers version, right? I just fixed the --k flag to be of type int rather than float. Can you please git pull and try again? I believe that it will solve this problem. By the way, how do you use the code? using our example scripts, or did you modify it?

Best,
Uri

@GoooDte
Copy link
Author

GoooDte commented Jul 20, 2022

Thanks very much, Uri!

Your suggestion really works that my second problem has been solved. I mainly use the code on some other datasets, so maybe only modify a little in preprocess progress.

I have searched on the Internet about my first error. The most likely cause of the problem is the matching problem among the versions of faiss, faiss-gpu, cudatoolkit and CUDA. So could you please tell me your exact versions of the above packages?

Thanks a lot!

@urialon
Copy link
Collaborator

urialon commented Jul 20, 2022

I'm using:

  • CUDA 11.2
  • faiss-gpu 1.7.2
  • python 3.9
  • not sure about cudatoolkit, I run it in a shared server and I can't find cudatoolkit.

Questions for you:

  1. What is your python version?
  2. What is your operating system?
  3. Do you have both faiss and faiss-gpu installed? you should install only one of them. Did you install it using pip or conda? I found that on linux it works best if you install using pip, and on a mac it works better if you install it using conda .

Best,
Uri

@GoooDte
Copy link
Author

GoooDte commented Jul 20, 2022

I'm using python 3.6 on linux operating system.

I tried to uninstall faiss and used pip to reinstall faiss-gpu to 1.7.2, but it still reports the error bellow:
Faiss assertion 'err == CUBLAS_STATUS_SUCCESS' failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<float, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<IndexType, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with AT = float; BT = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at /project/faiss/faiss/gpu/utils/MatrixMult-inl.cuh:265; details: cublas failed (13): (512, 512) x (13, 512)' = (512, 13) gemm params m 13 n 512 k 512 trA T trB N lda 512 ldb 512 ldc 13

@urialon
Copy link
Collaborator

urialon commented Jul 20, 2022

Can you try:

  1. Set gpu=False here: https://github.com/neulab/retomaton/blob/main/kmeans.py#L39
  2. I vaguely remember that faiss-gpu required a higher python version. Can you try creating a virtual environment or a conda environment using python 3.7 or python 3.9, and then reinstall faiss-gpu in the new environment?
  3. If you are working with small sets, you can use faiss-cpu instead of faiss-gpu

Please let me know how it goes.
Uri

@GoooDte
Copy link
Author

GoooDte commented Jul 25, 2022

Sorry for a long time.

I tried your first suggestion on a small part of the wikitext dataset. It works but still need to annotate line54-58 in kmeans.py.

Due to the large size of wikitext, it is hard to use cpu to cluster. I tried to install CUDA 11.2 and construct a new virtual environment using python 3.9 with faiss-gpu 1.7.2. But it still reports the same error. So could you please provide me with a detailed environment information (such as a requirements.txt file).

Many thanks!

@urialon
Copy link
Collaborator

urialon commented Jul 25, 2022

What kind of GPU do you have?

I have read a bit online about the error you're getting, and some suggested that there's not enough GPU memory.

@GoooDte
Copy link
Author

GoooDte commented Jul 25, 2022

My GPU is NVIDIA GeForce RTX 3090

@urialon
Copy link
Collaborator

urialon commented Jul 25, 2022

Can you take a look at the list of flags, and verify that all of them are correct?

There might be default values that I set which do not match your settings, like dimensions, size of datastore etc?

Another question: if you set the --sample flag to a much smaller value like 1000 - does anything change?

@GoooDte
Copy link
Author

GoooDte commented Jul 25, 2022

My flag settings are below:
image

I try to set --sample to 100 and it adds a WARNING:
WARNING clustering 100 points to 13 centroids: please provide at least 507 training points

@urialon
Copy link
Collaborator

urialon commented Jul 25, 2022

Yeah, it just means that clustering 100 examples into 13 clusters is likely to result in "bad" clusters.

But does it work without errors? At what sample size does it crash?

I suspect that maybe the GPU memory is the limitation.

What is your overall datastore size? Only 1341?

@GoooDte
Copy link
Author

GoooDte commented Jul 31, 2022

Sorry for late again.

I have solved my problem by accidentally changing another linux server. The problem is really caused by environmental problems, but I still don't know which exact environment can run 100% successfully. My new environments are CUDA 10.2, faiss-gpu 1.7.2 and python 3.7.

Thank you for your attention to this issue for so long!

@urialon
Copy link
Collaborator

urialon commented Jul 31, 2022

Great, I'm glad to hear!
Let me know if you have any more questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants