Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create dataset COVIDx8 #173

Closed
GliozzoJ opened this issue May 13, 2021 · 7 comments
Closed

Cannot create dataset COVIDx8 #173

GliozzoJ opened this issue May 13, 2021 · 7 comments

Comments

@GliozzoJ
Copy link

Dear all,

I would like to reproduce the results obtained with the model COVIDNet-CXR-2 (results reported here ).
To this end, I am trying to create the dataset COVIDx8.
After downloading all the databases listed here, I used the script create_ricord_dataset.ipynb to adequately pre-process the ricord images.
A first issue I found is that at line 28 of the script create_ricord_dataset.ipynb, I had to change that line from:

study_dir = os.path.join(ricord_dir, 'MIDRC-RICORD-1C-{}'.format(mrn), '*-{}'.format(uid))
to

study_dir = os.path.join(ricord_dir, 'MIDRC-RICORD-1C-{}'.format(mrn), '*{}'.format(uid))

This was necessary to match the hierarchy of the folders automatically created by the NBIA Data Retriever that I used to download the ricord database.

Using this modified script, 24 images are removed from the ricord dataset because they are in "in position LL". Following the message I receive from the script:

Image from MRN-419639-001634 Date-12-27-2003 UID-16722 in position LL
Image from MRN-419639-001686 Date-02-09-2004 UID-47369 in position LL
Image from MRN-419639-003089 Date-03-27-2005 UID-37417 in position LL
Image from MRN-419639-003089 Date-03-30-2005 UID-54764 in position LL
Image from MRN-SITE2-000045 Date-12-01-2005 UID-76077 in position LL
Image from MRN-SITE2-000046 Date-02-02-2002 UID-62756 in position LL
Image from MRN-SITE2-000078 Date-01-21-2006 UID-37750 in position LL
Image from MRN-SITE2-000101 Date-12-29-1999 UID-36965 in position LL
Image from MRN-SITE2-000129 Date-12-15-2005 UID-45395 in position LL
Image from MRN-SITE2-000148 Date-04-09-2000 UID-44042 in position LL
Image from MRN-SITE2-000149 Date-04-29-2001 UID-71428 in position LL
Image from MRN-SITE2-000176 Date-03-04-2000 UID-81030 in position LL
Image from MRN-SITE2-000186 Date-06-21-2005 UID-47380 in position LL
Image from MRN-SITE2-000190 Date-05-31-2008 UID-46535 in position LL
Image from MRN-SITE2-000190 Date-06-01-2008 UID-19302 in position LL
Image from MRN-SITE2-000199 Date-12-12-2003 UID-92778 in position LL
Image from MRN-SITE2-000199 Date-12-06-2003 UID-54181 in position LL
Image from MRN-SITE2-000199 Date-12-25-2003 UID-91718 in position LL
Image from MRN-SITE2-000199 Date-12-08-2003 UID-55518 in position LL
Image from MRN-SITE2-000210 Date-05-15-2004 UID-51719 in position LL
Image from MRN-SITE2-000237 Date-03-17-2005 UID-52517 in position LL
Image from MRN-SITE2-000248 Date-09-27-2005 UID-77857 in position LL
Image from MRN-SITE2-000249 Date-12-17-2003 UID-49234 in position LL
Image from MRN-SITE2-000267 Date-02-10-2002 UID-81231 in position LL
Created 1072 files

I used the obtained ricord images to create my final COVIDx8 dataset, using the script create_COVIDx_binary.ipynb, but I obtain the following error:

`---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
in
74 cv2.imwrite(os.path.join(savepath, 'train', patient[1]), gray)
75 else:
---> 76 copyfile(os.path.join(ds_imgpath[patient[3]], patient[1]), os.path.join(savepath, 'train', patient[1]))
77 train.append(patient)
78 train_count[patient[2]] += 1

~/anaconda3/envs/covid-net_py3.6/lib/python3.6/shutil.py in copyfile(src, dst, follow_symlinks)
118 os.symlink(os.readlink(src), dst)
119 else:
--> 120 with open(src, 'rb') as fsrc:
121 with open(dst, 'wb') as fdst:
122 copyfileobj(fsrc, fdst)

FileNotFoundError: [Errno 2] No such file or directory: '/home/jessica/prin_DNN_compression/dataset_COVIDxV8B/ricord_images/MIDRC-RICORD-1C-SITE2-000101-36965-0.png'`

It seems that one of the images removed by the script create_ricord_dataset.ipynb is actually necessary to build the dataset.

Moreover, I suspect that one or more of these images are presented in the training set and test set labels files (train_COVIDx8B.txt and test_COVIDx8B.txt), which I need to reproduce the results obtained by the model COVIDNet-CXR-2 using the script eval.py. In particular, the image I highlighted in bold in the precedent list are actually present in the file test_COVIDx8B.txt and required to reproduce your results.

As a final attempt, I tried to simply remove this block of code from the script create_ricord_dataset.ipynb:

# Verify orientation if ds.ViewPosition != 'AP' and ds.ViewPosition != 'PA': print('Image from MRN-{} Date-{} UID-{} in position {}'.format(mrn, date, uid, ds.ViewPosition)) continue

I do not know if removing the verification of images orientation is safe, but in this was i can create the COVIDx8 dataset.
However, when I use the script eval.py to obtain the reported performances of the model COVIDNet-CXR-2 I obtain:

!python eval.py \ --weightspath /home/jessica/prin_DNN_compression/COVID-Net-CXR-2 \ --metaname model.meta \ --ckptname model \ --n_classes 2 \ --testfile ./labels/test_COVIDx8B.txt \ --testfolder /home/jessica/prin_DNN_compression/dataset_COVIDxV8B/data/test \ --out_tensorname norm_dense_2/Softmax:0

[[194. 6.]
[ 10. 190.]]
Sens Negative: 0.97000, Positive: 0.95000
PPV Negative: 0.95098, Positive: 0.96939

These results are a bit different from the ones reported here .

Can you tell me if I am missing something to reproduce you results for the model COVIDNet-CXR-2?

Thank you in advance for your help.

All the best,
Jessica

@haydengunraj
Copy link
Collaborator

Hi Jessica,

Thanks for the detailed explanation of the issue! We'll have to take a closer look into what's going on here, but in the meantime I'd recommend downloading the prepared version of the dataset to save yourself some time and headache. The data preparation scripts can be a little finnicky, but the prepared version will exactly match the training and testing data used for CXR-2.

As a check, I re-downloaded the Kaggle data and ran eval.py with the CXR-2 model, yielding:

[[194.   6.]
 [  9. 191.]]
Sens Negative: 0.970, Positive: 0.955
PPV Negative: 0.956, Positive: 0.970

which matches the reported results.

@baranaldemir
Copy link

Hello @haydengunraj
Is there any prepared version of the dataset for the 3class version too? Because it seems like the dataset you shared above has some missing images when I tried to run train_tf.py for the 3class version.

@haydengunraj
Copy link
Collaborator

Hi @baranaldemir

On my end I seem to be able to run train_tf.py for 3-class prediction using the dataset from Kaggle. Note that you should use one of the 3-class models (e.g., CXR4-A) rather than CXR-2 (2-class). The command I used was:

python train_tf.py \
    --datadir /path/to/COVIDx8 \
    --weightspath /path/to/COVIDNet-CXR4-A \
    --ckptname model-18540 \
    --trainfile labels/train_COVIDx8A.txt \
    --testfile labels/test_COVIDx8A.txt \
    --n_classes 3 \
    --out_tensorname norm_dense_1/Softmax:0 \
    --logit_tensorname norm_dense_1/MatMul:0 \
    --training_tensorname batch_normalization_1/keras_learning_phase:0

The training gets through the first epoch without issue. For the baseline eval of CXR4-A I get:

[[ 94.   6.   0.]
 [  5.  95.   0.]
 [  4.   1. 195.]]
Sens Normal: 0.940, Pneumonia: 0.950, Covid-19: 0.975
PPV Normal: 0.913, Pneumonia: 0.931, Covid-19: 1.000

@GliozzoJ
Copy link
Author

GliozzoJ commented Jun 3, 2021

Hi Jessica,

Thanks for the detailed explanation of the issue! We'll have to take a closer look into what's going on here, but in the meantime I'd recommend downloading the prepared version of the dataset to save yourself some time and headache. The data preparation scripts can be a little finnicky, but the prepared version will exactly match the training and testing data used for CXR-2.

As a check, I re-downloaded the Kaggle data and ran eval.py with the CXR-2 model, yielding:

[[194.   6.]
 [  9. 191.]]
Sens Negative: 0.970, Positive: 0.955
PPV Negative: 0.956, Positive: 0.970

which matches the reported results.

Thank you for your reply,

at the end I managed to create the COVIDx8 dataset using the provided scripts and to reproduce the results of the model COVIDNet-CXR-2. I think I had a slighly old version of the repository that hindered the creation of the dataset.
I wasn't aware about the availability of a prepared version of the dataset on Kaggle. I believe there is no mention of that in the documentation, ora at least I did not find it.
I would suggest to add the availability of this dataset in the documentation or to make it more evident :)

Is there also an available prepared version of older datasets?

All the best,
Jessica

@GliozzoJ GliozzoJ closed this as completed Jun 3, 2021
@baranaldemir
Copy link

baranaldemir commented Jun 4, 2021

Hi @baranaldemir

On my end I seem to be able to run train_tf.py for 3-class prediction using the dataset from Kaggle. Note that you should use one of the 3-class models (e.g., CXR4-A) rather than CXR-2 (2-class). The command I used was:

python train_tf.py \
    --datadir /path/to/COVIDx8 \
    --weightspath /path/to/COVIDNet-CXR4-A \
    --ckptname model-18540 \
    --trainfile labels/train_COVIDx8A.txt \
    --testfile labels/test_COVIDx8A.txt \
    --n_classes 3 \
    --out_tensorname norm_dense_1/Softmax:0 \
    --logit_tensorname norm_dense_1/MatMul:0 \
    --training_tensorname batch_normalization_1/keras_learning_phase:0

The training gets through the first epoch without issue. For the baseline eval of CXR4-A I get:

[[ 94.   6.   0.]
 [  5.  95.   0.]
 [  4.   1. 195.]]
Sens Normal: 0.940, Pneumonia: 0.950, Covid-19: 0.975
PPV Normal: 0.913, Pneumonia: 0.931, Covid-19: 1.000

@haydengunraj Using the same command and Kaggle dataset but the output says there are some missing files. Plus, I used create_COVIDx.ipynb and created the dataset. Again, It gave me the same output.

image

@haydengunraj
Copy link
Collaborator

@GliozzoJ , we don't currently have prepared versions of the previous datasets, although we may have archived versions of them available internally. We may also be able to help with generating past versions from source if you have a particular version in mind.

@baranaldemir , are you using the most recent master branch of the repository? The results I mentioned above were obtained from a fresh pull of both the Kaggle data and codebase, so on our end everything appears to be working correctly. If the most recent master branch does not fix your issue (or if you're already using it), I'll try downloading everything again to see if I can reproduce the issue.

@baranaldemir
Copy link

Yeah, latest master branch did work. I'm so sorry for replying this late.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants