Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIF data as .json files #29

Open
maak-sdu opened this issue Jun 10, 2021 · 5 comments
Open

CIF data as .json files #29

maak-sdu opened this issue Jun 10, 2021 · 5 comments
Milestone

Comments

@maak-sdu
Copy link
Collaborator

We should capture this somewher in our planning.

This is a direction we want to go in (document backend). It would involve modifying our load_cif capabilities. I would rather put this off for now, but maybe we should make an issue to return to this sometime later. Please could one of you do that, and copy paste the email contents in there. Perhaps we should archive the data somewhere safe (Columbia Google drive with the url captured in the issue maybe?)

Thx

S

---------- Forwarded message ---------
From: Simon Westrip simonwestrip@btinternet.com
Date: Fri, Jun 4, 2021 at 11:29 AM
Subject: MRLP - some more test data
To: Nicola Ashcroft na@iucr.org, Simon Billinge sb2896@columbia.edu, Berrak Ozer bo2220@columbia.edu, Martin Karlsen maak@sdu.dk, Peter Strickland ps@iucr.org, Brian McMahon bm@iucr.org, Koh Song Sang koh@iucr.org, Dave Holden dh@iucr.org

Hopefully the following may be of use:
The zip file at https://publbio.iucr.org/services/tools/pdcifplot/cache/json.zip contains a collection of .json files which each contain an array of objects that contain 'pdcifplot' data, i.e. each object contains processed data corresponding to a datablock in a source CIF.

Although these objects are tailored for use in the online pdcifplot application, they could form the basis of a database of powder patterns, but obviously with additional fields optimized for input to the pydatarecognition script and for search purposes...

The json structure for each datablock is outlined below (this is not 'set in stone'):

{
// minimal metadata:

"_pdcifplot_article_id": "xx0000",           // the IUCr code for the associated article
"_pdcifplot_article_doi": "10.1107/...",     // doi of the article
"_pdcifplot_cif_id": "xx0000Isup2.rtv",      // the name of the supplementary CIF containing the source data
"_pdcifplot_block_id": "I",                  // the 'label' of the datablock in the source CIF

// 'x' data:

"_pdcifplot_xaxis_options": [                // a string list of 'keys' that identify the json properties containing the actual arrays of numeric data
                                             // ('keys' listed below will only be included if the data are in the source CIF)
   
  "_pdcifplot_pd_meas_2theta_range_",        // derived from CIF items _pd_meas_2theta_range_min/max/inc 
  "_pdcifplot_pd_meas_2theta_range_,Q",      // Q calculated from _pdcifplot_pd_meas_2theta_range_ 
  "_pdcifplot_pd_meas_2theta_range_,d",      // d space calculated from _pdcifplot_pd_meas_2theta_range_
  "_pdcifplot_pd_proc_2theta_range_",        // derived from CIF items _pd_proc_2theta_range_min/max/inc 
  "_pdcifplot_pd_proc_2theta_range_,Q",      // ...
  "_pdcifplot_pd_proc_2theta_range_,d", 
  "_pdcifplot_pd_meas_2theta_scan",          // CIF item _pd_meas_2theta_scan
  "_pdcifplot_pd_meas_2theta_scan,Q",        // Q calculated from _pd_meas_2theta_scan
  "_pdcifplot_pd_meas_2theta_scan,d",        // d space calculated from _pd_meas_2theta_scan
  "_pdcifplot_pd_meas_2theta_corrected",     ...
  "_pdcifplot_pd_meas_2theta_corrected,Q",   
  "_pdcifplot_pd_meas_2theta_corrected,d",   
  "_pdcifplot_pd_proc_2theta_corrected",     
  "_pdcifplot_pd_proc_2theta_corrected,Q",   
  "_pdcifplot_pd_proc_2theta_corrected,d",  
  "_pdcifplot_pd_proc_d_spacing",
  "_pdcifplot_pd_proc_d_spacing,Q",
  "_pdcifplot_pd_proc_recip_len_q",
  "_pdcifplot_pd_proc_recip_len_q,d",
  "_pdcifplot_pd_meas_time_of_flight",
  "_pdcifplot_pd_proc_energy_incident",
  "_pdcifplot_pd_proc_energy_detection"
  
],

// the properities specified by "_pdcifplot_xaxis_options" (arrays of numeric data):

"_pdcifplot_pd_meas_2theta_range_": [ 5.019, 5.038, 5.057, ... ], 
"_pdcifplot_pd_meas_2theta_range_,Q": [ 0.356942, 0.358292, 0.359643, ... ],
"_pdcifplot_pd_meas_2theta_range_,d": [ 17.6051, 17.5388, 17.4729, ... ],
...

// 'y' data separated into observed and calculated data sets:

"_pdcifplot_yobs_options": [                // a string list of 'keys' that identify the json properties containing the actual arrays of numeric data
                                            // ('keys' listed below will only be included if the data are in the source CIF)
  "_pdcifplot_pd_meas_counts_total",        // correspond to the CIF items (su's removed)
  "_pdcifplot_pd_meas_intensity_total",
  "_pdcifplot_pd_proc_intensity_net",
  "_pdcifplot_pd_proc_intensity_total",
  "_pdcifplot_pd_meas_intensity_net"
],

// the properties specified by "_pdcifplot_yobs_options" (arrays of numeric data):

"_pdcifplot_pd_meas_counts_total": [ 82700, 81800, 82500, ...],
...


"_pdcifplot_ycalc_options": [               // a string list of 'keys' that identify the json properties containing the actual arrays of numeric data
                                            // ('keys' listed below will only be included if the data are in the source CIF)
  "_pdcifplot_pd_calc_intensity_total",     // correspond to the CIF items (su's removed)
  "_pdcifplot_pd_calc_intensity_net"
],

// the properities specified by "_pdcifplot_ycalc_options" (arrays of numeric data):

"_pdcifplot_pd_calc_intensity_total": [ 82600, 82300, 82000, ... ],
 ...
 

// other data useful for 'pdcifplot' purposes:

// 'y' background

"_pdcifplot_ybkg_options": [                // a string list of 'keys' that identify the json properties containing the actual arrays of numeric data
                                            // ('keys' listed below will only be included if the data are in the source CIF)
    "_pd_meas_counts_background",           // correspond to the CIF items (su's removed)
    "_pd_meas_counts_container",
    "_pd_meas_intensity_background",
    "_pd_meas_intensity_container",
    "_pd_proc_intensity_bkg_calc",
    "_pd_proc_intensity_bkg_fix"
],

// the properities specified by "_pdcifplot_ybkg_options" (arrays of numeric data):

"_pd_meas_counts_background": [ ... ],
...

// 'y' standard uncertainties
    
"_pdcifplot_ysu_options": [                 // a string list of 'keys' that identify the json properties containing the actual arrays of numeric data
                                            // ('keys' listed below will only be included if the data are in the source CIF)
  "_pdcifplot_pd_meas_counts_total_su",     // any su's extracted from the corresponding 'y' data items
  ...
  "_pdcifplot_pd_proc_ls_weight"            // corresponds to the CIF item _pd_proc_ls_weight
],

// the properities specified by "_pdcifplot_ysu_options" (arrays of numeric data):

"_pdcifplot_pd_meas_counts_total_su": [ ... ],
...

//  reflection positions by phase (each phase in own array)

"_pdcifplot_refln_ticks": [         
  [ 6.655586, 3.382725, 3.327793, ... ],
  [ 2.777435, 2.563201, 2.539053, ... ], ...
],

// wavelength used in calculations:

"_pdcifplot_diffrn_radiation_wavelength": 1.54188

}

NB This is not 'CIF-JSON'. In CIF-JSON data blocks are named properties of a CIF-JSON object rather than an array of objects. In addition CIF-JSON specifies that all properties describing CIF data items are arrays and all numeric data should be quoted as strings.

Cheers

Simon

@maak-sdu
Copy link
Collaborator Author

@sbillinge: @berrakozer and I tried to inspect and explore the .json files provided by Simon Westrip. Initially, we just used pandas.read_json() to read the jsons into pandas dataframes. Of course, we may do that in another way down the road. We tried to do some initial plotting to explore how to access values in the dataframe.

Considering the architecture and content of the jsons, we can definitely get inspiration to our own object to create. However, I guess we would like some more 'consistency' in our own object, e.g. 495 cifs don't contain Q-values. This was realized by identifying all the possible keys, containing either 'Q' or 'q':
_pdcifplot_pd_proc_2theta_corrected,Q
_pdcifplot_pd_proc_d_spacing,Q
pdcifplot_pd_meas_2theta_range,Q
pdcifplot_pd_proc_2theta_range,Q
_pdcifplot_pd_meas_2theta_scan,Q
_pdcifplot_pd_meas_2theta_corrected,Q

If interested, I have attached the .py file used.
json_plot.zip

@sbillinge
Copy link
Collaborator

sbillinge commented Jun 10, 2021 via email

@maak-sdu
Copy link
Collaborator Author

@sbillinge: got it. We will leave it for now.

@berrakozer
Copy link
Collaborator

@sbillinge @maak-sdu: Added the .json file to my lion mail google drive and sent invites to you both.
Google drive url is the following:
https://drive.google.com/drive/u/1/folders/1QkuqjRSSlcmC6kVa7KjJWvmjgDGOoYw6

@sbillinge
Copy link
Collaborator

sbillinge commented Jun 11, 2021 via email

@berrakozer berrakozer added this to the mongo_backend milestone Jul 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants