Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plate list inconsistency #2

Closed
xiaohk opened this issue Dec 1, 2018 · 15 comments
Closed

Plate list inconsistency #2

xiaohk opened this issue Dec 1, 2018 · 15 comments

Comments

@xiaohk
Copy link

xiaohk commented Dec 1, 2018

Hello, thanks so much for providing a script to download all images.

It seems the plate lists from different sources are quite different. In the paper and download_cil_images.sh, there are 406 plates. However, Cell Image Library writes there are 375 plates but only gives 373 links. It also lists many 20*** plates which are not listed in download_cil_images.sh. Giga DB also writes 406 plates, but the provided md5sum.txt has only 349 entries. I also notice that download_cil_images.sh has been edited, and Giga DB mentions 7 excluded plates.

@shntnu Which plate list do you recommend to use for analysis please?

@shntnu
Copy link
Contributor

shntnu commented Dec 5, 2018

Thanks for reporting this. tl;dr - I recommend using the plate list reported in the paper and download_cil_images.sh (n=406)

The numbers on the landing page of http://www.cellimagelibrary.org/pages/project_20269 are certainly misleading because they were later updated in GigaDB paper, so you can ignore that (I will try to request them to update). But all the 406 plates should nonetheless be available via the Cell Image Library download link (which is in download_cil_images.sh). Let me know if that's not the case.

Regarding md5sum.txt - can you point me to that file?

@xiaohk
Copy link
Author

xiaohk commented Dec 5, 2018

Thanks for the reply.

You can find md5sum.txt in the "file" section of http://gigadb.org/dataset/view/id/100351 . Its FTP url is ftp://parrot.genomics.cn/gigadb/pub/10.5524/100001_101000/100351/md5sum.txt .

This checksum file is very helpful to check if the downloaded metadata files are complete. It would be awesome if GigaDB can upload a similar checksum file for raw image zip files as well.

Regarding md5sum.txt, it seems the md5 fingerprint for Plate_25575.tar.gz is incorrect. Our downloaded Plate_25575.tar.gz has the same fingerprint but it fails to be extracted.

@shntnu
Copy link
Contributor

shntnu commented Dec 5, 2018

Ah so those md5's correspond to the (per plate) processed data –  is that what you are referring to as metadata files? I can look up why only 349 show up

The md5's for the images would need to be provided by the Cell Image Library. I can request them but can't guarantee.

@xiaohk
Copy link
Author

xiaohk commented Dec 5, 2018

Yes, those md5's correspond to the processed data.

@shntnu
Copy link
Contributor

shntnu commented Dec 5, 2018

Got it

Do the md5's match up for the 349 that are available (except for Plate_25575, although the issue there seems to be different - md5 is correct but cannot extract, right?)

@xiaohk
Copy link
Author

xiaohk commented Dec 5, 2018

Yes, the downloaded files having the same md5 fingerprint as shown in md5sum.txt are all able to be extracted (except 25575).

If a downloaded file has a different fingerprint, then it is likely to be corrupted during the download process. We just try to download it again.

@shntnu
Copy link
Contributor

shntnu commented Dec 5, 2018

md5_of_tar_gz_files.txt
Can you check whether the md5's here are correct? This is the full 406, so I guess you'd need to check only 406-349

Meanwhile, I'll try to figure out 25575

@shntnu
Copy link
Contributor

shntnu commented Dec 5, 2018

Plate_25575.tar.gz md5 should be fc10288f8826d8d15a73edbbc0e6b214, as listed in #2 (comment)

The one at http://gigadb.org/dataset/view/id/100351 is wrong (will fix) once you confirm that the rest are good

@shntnu
Copy link
Contributor

shntnu commented Dec 5, 2018

LMK if you are able to download from https://s3.amazonaws.com/imaging-platform-collaborator/CDRP/Plate_25575.tar.gz
and verify the md5

@xiaohk
Copy link
Author

xiaohk commented Dec 7, 2018

Thanks for providing md5_of_tar_gz_files.txt. The new added md5's match our downloaded files, after redownloading few plates.

The md5 for the new Plate_25575.tar.gz mentioned in #2 (comment) matches fc10288f8826d8d15a73edbbc0e6b214, and it now can be extracted.

@shntnu
Copy link
Contributor

shntnu commented Dec 7, 2018

Great - thanks for confirming

For our records:

Shantanu has emailed gigascience to sort this out
Chris Hunter chris@gigasciencejournal.com
S.C. Edmunds scott@gigasciencejournal.com
GigaScience Journal editorial@gigasciencejournal.com
"GigaScience MS GIGA-D-16-00012 (morphological profiles)"
Dec 7, 2018, 10:42 AM

@xiaohk
Copy link
Author

xiaohk commented Dec 7, 2018

Thank you so much, Shantanu. I will also appreciate it if you can request a similar md5sum file from Cell Image Library.

@xiaohk xiaohk closed this as completed Dec 7, 2018
@only1chunts
Copy link
Member

Hi,
Sorry for being late to the party! Every file in GigaDB has the md5sum value of that file available as file metadata. This can be viewed on the dataset page, file table (expand to view file attributes using the table settings if not already visible), or you can use the API to get the details e.g. http://gigadb.org/api/file?doi=100351 . API help page = http://gigadb.org/site/help#interface .
I have now updating the corrupt file (Plate_25575.tar.gz), thank you @shntnu

@shntnu
Copy link
Contributor

shntnu commented Dec 19, 2018

Thanks @only1chunts
Can you also update ftp://parrot.genomics.cn/gigadb/pub/10.5524/100001_101000/100351/md5sum.txt with https://github.com/gigascience/paper-bray2017/files/2650868/md5_of_tar_gz_files.txt?

@shntnu
Copy link
Contributor

shntnu commented Mar 13, 2019

To close the loop on this, I just verified that #2 (comment) has been addressed. Thanks @only1chunts !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants