-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for DataStore #4
Comments
Thanks for the suggestion @rgrp . Will do. |
@rgrp Hmmm. I don't think I understand how the datastore differs from what I've already implemented. To be able to test pushing, is there a test datastore I can use? Can you point to some example endpoints that use the datastore? |
http://datahub.io has DataStore turned on but you'll need to get an account. http://demo.ckan.org/ may work more easily. |
Thanks! |
@rgrp I added some functionality for datastore. If you have time could you see if the functions work for you? See |
@sckott I ran this, which appeared to work. Up to now, I've been mainly using the python 'ckanapi' (https://github.com/ckan/ckanapi) package for interacting with the CKAN datastore and wanted to see if some of the more granular functionality available through that mechanism could be added to ckanr, if it already isn't in there, that is. As with your Establish a connection to a running CKAN instance
For a new table, I may predefine field types, such as date / value / text fields. Subsetted example below.
I'm generally then pushing data directly from a pandas dataframe (akin to the R data.frame) to CKAN ... which requires some massaging to get in an acceptable JSON form. I generally upload a large dataframe in two steps ... first the fields and a single row ... to set the table up, as it were and then a follow-on 'upsert' of the bulk of the dataframe, chunking it in parts, where necessary, to get around memory issues. As per last line below, I'm supplying a resource_id to my datastore_create call, where your example uses a name and generates the resource on-the-fly. just doing this to use datastore_create on first slice and push second slice as datastore_upsert
Does your
You asked, "What else should be added?"
On first working with CKAN, I struggled a bit ref the naming conventions in their data model. AFAIK, an organisation contains any number of datasets which are akin to data packages. Each dataset, can in turn contain resources ... files such as CSVs, PDFs etc... Some of these resources (such as CSVs, XLSs etc) can get ingested to the datastore (postgres), for added functionality. Hence, the ckanapi function naming My experience with R is somewhat limited, but I would love to see a fully-functional ckanr library, so let me know if I can be of further help. |
@Analect Thanks for your comments. To address the questions:
|
hmmm, trying to work on adding function for http://docs.ckan.org/en/latest/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_create and http://docs.ckan.org/en/latest/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_upsert but the terminology is so confusing, and I need much more time to understand this - will come back later. |
@sckott what's confusing there - would be super useful to have this support so if we can help clarify let me know. Basically this is about creating tables and then inserting data to them. If you want worked example in python we have several. |
@rgrp Thanks for offer to help. Some examples would be great. It seems like there are multiple ways to use http://docs.ckan.org/en/latest/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_create and I don't quite understand the different use cases yet. |
@sckott I'm sure @rgrp can supply you with a whole lot more relevant stuff, but here are a few relevant links that may help, beyond the very detailed docs at docs.ckan.org, that is! CKAN data model visualisation: ckan/ckan#2053 Simple python interaction with CKAN examples: Other example of python integration with datastore: As I said, @rgrp may be able to point you to other publically available stuff, some of which may be more relevant to some of the new functionality in ver 2.3 (resource views etc..) Rgds, |
@Analect Thanks very much for your help on this. Sorry it has been so long since I've worked on this. I will start hacking again on this soon...the python examples will surely help |
@sckott |
@Analect thanks! will do |
@sckott First things first, here's a simple R file to download some data from the ECB, partially clean it and save it locally as a CSV. Ideally, this file would be able to leverage your ds_create function to push directly to CKAN, but I'm not there yet, as I'll elaborate on this further down. https://gist.github.com/Analect/378a61704941359e3e5a As a test, I tried to push the dataframe (local dataframe in dplyr sense) in R directly to CKAN, but it hung for a long time and finally timed-out with a broken pipe, as some sort of chunking mechanism is probably needed. It looks like the convert function is doing the right transformation on the records from the ts_df_mod dataframe, but the fact that I'm not explicitly adding fields into this call (and setting these up ahead of time) probably doesn't help.
I found myself reverting back to python and leveraging the ckanapi in order to get this done. Being able to stay in R would be nice. Here's the mechanism I've used in python. Probably not eloquent ... but it works! I'm going to use the demo.ckan.org end-point as you have done and have created a dataset in there called 'test-ckanr' ... On reflection .... OK, having tried this, it seems that demo.ckan.org has some limitation on the amount of data that can be pushed there as I'm getting an 'client error (413): Request Entity Too Large'. If I push the same data to my own CKAN instance, it's fine. In any case, here's the code, which might work if you just push a portion of the dataframe for testing. https://gist.github.com/Analect/ea38dd75c282d8312a65 If you want to test for larger datasets, then maybe I can make contact with you directly via email and set you up with a test CKAN instance on AWS. As you can see from the python script, I'm having to set up field types ahead of pushing data to the datastore. That's because using datapusher to do that correctly is not always reliable, as you can see from the screenshot below, where it ended up interpreting many fields as timestamps. Some functionality gaps I see that are perhaps there but I'm just not using the package correctly are:
Let me know if I can be of further help here. |
thanks @Analect - I'll get to this soon ⌚ |
On your numbered list:
|
@sckott Thanks. |
@Analect It is in the works. See any functions starting with I'll also start a label to put on issues that have to do with datastore items |
This pkg is a big project, and lots of other pkgs to work on. And lots of travel this summer :) |
Sorry all, but this isn't quite feature complete yet, but I want to push a first version to CRAN to start getting this more widely distributed, get more users, more feedback, etc. Hope to get this more sorted out soon, moving to next milestone |
I noticed a long time ago that the resource datastore creation process is poorly documented. Here is the structure for a simple HTTP (curl) call which shows how the request should look: curl -X POST https://CKAN_INSTANCE_URL/api/3/action/datastore_create -H "Authorization: API_KEY" -d '{"force":"true", "resource": {"package_id": "PACKAGE ID"}, "primary_key":["name"], "fields": [{"id":"field1", "type": "one of text, json, date, time, timestamp, int, float, bool"},{"id":"field2", "type":"optional, but must be a valid type"}]}' I noted that it is important to specify the primary key. |
thanks for the help @mattfullerton |
initial support is here in the pkg, lets open new issues as needed for certain parts of datastore |
See http://docs.ckan.org/en/latest/maintaining/datastore.html
This would allow you to pull (and push) data from CKAN to R (dataframes)
The text was updated successfully, but these errors were encountered: