Export of metadata/data from Uwazi to csv #958

whyfrycek · 2017-05-05T10:58:18Z

A user just wrote: "It would be great if we could also export data/metadata stored in Uwazi database to csv so that we can use the data for other purposes."

Have asked for a bit more context and, if possible, to join discussion here, but if the team has questions, feel free to ask.

txau · 2017-12-11T17:04:16Z

The design decision here is how to translate from a heterogeneous selection of data (including different data models) to a single CSV. I guess we can do this pretty straight forward:

Each card is translated as a row.
We always add the UID of each exported element.
If there are common fields, we export them together.
Non-common fields are exported as separate columns.
If users feel like they are getting too many empty columns, they can do a sub-selection of data to get rid of the unwanted columns.

CESRadmin · 2018-05-04T18:29:25Z

It would be a valuable feature for users to have the option to download the metadata of multiple documents into a .csv or .xlsx file, rather than having to download all of the full documents.

whyfrycek · 2019-06-27T08:22:02Z

This was the feedback from another organisation that is evaluating use. Asked them for details on use case.

whyfrycek · 2019-06-27T09:27:10Z

Additional context:

Goal is to compare part of the data with other organisations to see who holds what on whom.

This means, they would want to use only some information from templates, so export by template would be useful as a minimum. Field-based export could be even more useful, but redaction can also happen in Excel after.

bertver · 2019-07-24T12:43:17Z

Another organisation is interested and would like to know how much time it takes to develop the export feature.

txau · 2019-07-24T12:56:12Z

@bertver this was estimated in 10 to 15 days and included table view as an alternative to card list view as part of the user interface.

kjantin · 2020-02-25T16:42:48Z

We need to define the MVP for this feature.

txau · 2020-02-25T17:29:15Z

Probably showing only the visible data is the simplest solution. The problem is that it may lead users to simply add more columns to the "show in card" group so that they get that column exported.
The other options is adding all columns from all templates. Maybe it requires more work from our end (maybe not so much? an Uwazi dev should answer that question). The problem is that the number of columns could eventually explode and become clutter.
We also probably need to export as extra columns the template name and the unique id.
How are dates going to be exported? We store them as timestamps. If we want actual dates we need a formatter there.
How are multi-fields (multi-date, multi-select, etc) going to be exported? A CSV inside a CSV? Probably we need to mimic whatever we are doing in the import feature.
How to handle media files? Do we export them as filename + we also add the file as something you download?
How do we handle relational fields (selects, multiselects). In import we conciliate by name and assume that conflicts are unimportant (less safe, more user friendly). Another option is export by id.

txau · 2020-02-25T17:29:57Z

Probably showing only the visible data is the simplest solution. The problem is that it may lead users to simply add more columns to the "show in card" group so that they get that column exported.
Another option is adding all columns from all templates. Maybe it requires more work from our end (maybe not so much? an Uwazi dev should answer that question). The problem is that the number of columns could eventually explode and become clutter.
We also probably need to export as extra columns the template name and the unique id.
How are dates going to be exported? We store them as timestamps. If we want actual dates we need a formatter there.
How are multi-fields (multi-date, multi-select, etc) going to be exported? A CSV inside a CSV? Probably we need to mimic whatever we are doing in the import feature.
How to handle media files? Do we export them as filename + we also add the file as something you download?
What do we do with relational fields (selects, multiselects). In import we conciliate by name and assume that conflicts are unimportant (less safe, more user friendly). Another option is export by id ala relational db.

bdittes · 2020-03-10T14:46:17Z

Design doc: https://docs.google.com/document/d/1t5fqpCwhm24M2xeKKICOH3ZLLWFCCDiT-wUgqjGD0Dw/edit#heading=h.u4zhs5qr8hms

Branch: export-csv

txau · 2020-03-27T13:57:03Z

Keep an eye on huge queries crashing the server. Related: #2841

RafaPolit · 2020-04-07T14:54:14Z

This will probably be done by the end of the week. Pending QA and deployment.

fnocetti · 2020-04-16T19:14:26Z

I'll be tagging this for QA today.
The implementation includes:

An export button on the sidebar for triggering the export.
Export public or non-public documents (not together, and depending on if the user is logged in)
It exports all entities matching the user search up to the Elastic Search limit of 10000 or, in case the user selected particular documents, the selected documents.
Support for translations in headers and contents. It exports in the lang the user is working on at the moment of the export.
For generating the table, it calculates the deduplicated union of the metadata fields for all exported types based on the field label. That is: if two fields are labeled identically in two different templates, they will constitute a single column in the exported CSV. This computation is done in a greedy fashion prior to starting the entities processing based on aggregation information available. That means that it will use the fields for all the templates for which an entity matches the search criteria and in particular cases (like entity selection) there might be columns without data. If the user filtered by template, then that information is used to compute the headers.
The export supports all the field types but:
- Nested fields
- Preview fields
- Markdown fields: they were excluded because they can clutter the csv file. If we need them, it should be fairy simple to add support for them. Maybe it is a good idea to open an issue.
Apart of the metadata fields, each row will include:
- Title
- Template
- Date added
- Documents: as a relative path to the file. e.g "/files/docname.pdf"
- Attachments: same as documents
- Published: a field indicating if the entity is either published or unpublished
All multivalue fields are formatted by including every value separated by a pipe character
- Date ranges are formatted as two dates separated by a tilde character ( ~ )
Link fields follow the "Label|URL" convention
No progress report is implemented at the moment. It will require further UI work.

txau · 2020-04-20T13:34:05Z

Documents: as a relative path to the file. e.g "/files/docname.pdf"

Attachments: same as documents

Are we exporting the files as well as the paths?

fnocetti · 2020-04-20T13:38:36Z

Are we exporting the files as well as the paths?

No, we are not exporting the actual documents/attachments files. We are including the file path in a documents (or attachment) field in the CSV.
That's what we agreed for the MVP, am I right?

txau · 2020-04-20T13:44:19Z

No, we are not exporting the actual documents/attachments files. We are including the file path in a documents (or attachment) field in the CSV.
That's what we agreed for the MVP, am I right?

It's ok for the MVP.

RafaPolit · 2020-04-20T14:11:14Z

Markdown fields: they were excluded because they can clutter the csv file. If we need them, it should be fairy simple to add support for them. Maybe it is a good idea to open an issue.

This may be a problem, some models heavily rely on rich text fields to hold their data.

The bigger issue would be to 'explain' to the user that this field is not available while all others are. I'll continue with the QA, but I'm not sure if we need to tackle this before merging this into DEV.

fnocetti · 2020-04-20T17:50:17Z

@RafaPolit makes sense. I'll talk to the PM's and work on it right away if needed.

fnocetti · 2020-04-22T02:44:26Z

Support for rich text fields is already implemented.

kjantin added NEW labels May 5, 2017

kjantin added this to the Ability to import/export data milestone Jun 12, 2017

kjantin removed the NEW label Jun 12, 2017

kjantin added the Priority: Low label Aug 18, 2017

txau removed the Type: Enhancement label Feb 6, 2018

kjantin added Partner Request Priority: Medium and removed Priority: Low labels Feb 4, 2020

konzz self-assigned this Feb 25, 2020

konzz added the Sprint label Feb 25, 2020

kjantin modified the milestones: Ability to import/export data, Roadmap for March/April/May 2020, Roadmap for Q2 2020 (Apr, May, Jun), Roadmap for Q1 2020 (Jan, Feb, Mar) Feb 25, 2020

txau unassigned konzz Mar 2, 2020

nwyu assigned samschaevitz Mar 2, 2020

bdittes assigned fnocetti and unassigned samschaevitz Mar 10, 2020

fnocetti mentioned this issue Apr 2, 2020

958 export csv #2850

Merged

7 tasks

RafaPolit mentioned this issue Apr 28, 2020

Library Pagination #2892

Closed

RafaPolit closed this as completed in #2850 May 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export of metadata/data from Uwazi to csv #958

Export of metadata/data from Uwazi to csv #958

whyfrycek commented May 5, 2017

txau commented Dec 11, 2017

CESRadmin commented May 4, 2018

whyfrycek commented Jun 27, 2019

whyfrycek commented Jun 27, 2019

bertver commented Jul 24, 2019

txau commented Jul 24, 2019

kjantin commented Feb 25, 2020

txau commented Feb 25, 2020

txau commented Feb 25, 2020 •

edited

Loading

bdittes commented Mar 10, 2020

txau commented Mar 27, 2020

RafaPolit commented Apr 7, 2020

fnocetti commented Apr 16, 2020 •

edited

Loading

txau commented Apr 20, 2020

fnocetti commented Apr 20, 2020

txau commented Apr 20, 2020

RafaPolit commented Apr 20, 2020

fnocetti commented Apr 20, 2020

fnocetti commented Apr 22, 2020

Export of metadata/data from Uwazi to csv #958

Export of metadata/data from Uwazi to csv #958

Comments

whyfrycek commented May 5, 2017

txau commented Dec 11, 2017

CESRadmin commented May 4, 2018

whyfrycek commented Jun 27, 2019

whyfrycek commented Jun 27, 2019

bertver commented Jul 24, 2019

txau commented Jul 24, 2019

kjantin commented Feb 25, 2020

txau commented Feb 25, 2020

txau commented Feb 25, 2020 • edited Loading

bdittes commented Mar 10, 2020

txau commented Mar 27, 2020

RafaPolit commented Apr 7, 2020

fnocetti commented Apr 16, 2020 • edited Loading

txau commented Apr 20, 2020

fnocetti commented Apr 20, 2020

txau commented Apr 20, 2020

RafaPolit commented Apr 20, 2020

fnocetti commented Apr 20, 2020

fnocetti commented Apr 22, 2020

txau commented Feb 25, 2020 •

edited

Loading

fnocetti commented Apr 16, 2020 •

edited

Loading