Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export of metadata/data from Uwazi to csv #958

Closed
whyfrycek opened this issue May 5, 2017 · 19 comments · Fixed by #2850
Closed

Export of metadata/data from Uwazi to csv #958

whyfrycek opened this issue May 5, 2017 · 19 comments · Fixed by #2850

Comments

@whyfrycek
Copy link
Contributor

A user just wrote: "It would be great if we could also export data/metadata stored in Uwazi database to csv so that we can use the data for other purposes."

Have asked for a bit more context and, if possible, to join discussion here, but if the team has questions, feel free to ask.

@txau
Copy link
Collaborator

txau commented Dec 11, 2017

The design decision here is how to translate from a heterogeneous selection of data (including different data models) to a single CSV. I guess we can do this pretty straight forward:

  • Each card is translated as a row.
  • We always add the UID of each exported element.
  • If there are common fields, we export them together.
  • Non-common fields are exported as separate columns.
  • If users feel like they are getting too many empty columns, they can do a sub-selection of data to get rid of the unwanted columns.

@CESRadmin
Copy link

It would be a valuable feature for users to have the option to download the metadata of multiple documents into a .csv or .xlsx file, rather than having to download all of the full documents.

@whyfrycek
Copy link
Contributor Author

This was the feedback from another organisation that is evaluating use. Asked them for details on use case.

@whyfrycek
Copy link
Contributor Author

Additional context:

Goal is to compare part of the data with other organisations to see who holds what on whom.

This means, they would want to use only some information from templates, so export by template would be useful as a minimum. Field-based export could be even more useful, but redaction can also happen in Excel after.

@bertver
Copy link

bertver commented Jul 24, 2019

Another organisation is interested and would like to know how much time it takes to develop the export feature.

@txau
Copy link
Collaborator

txau commented Jul 24, 2019

@bertver this was estimated in 10 to 15 days and included table view as an alternative to card list view as part of the user interface.

@kjantin
Copy link
Contributor

kjantin commented Feb 25, 2020

We need to define the MVP for this feature.

@txau
Copy link
Collaborator

txau commented Feb 25, 2020

Probably showing only the visible data is the simplest solution. The problem is that it may lead users to simply add more columns to the "show in card" group so that they get that column exported.
The other options is adding all columns from all templates. Maybe it requires more work from our end (maybe not so much? an Uwazi dev should answer that question). The problem is that the number of columns could eventually explode and become clutter.
We also probably need to export as extra columns the template name and the unique id.
How are dates going to be exported? We store them as timestamps. If we want actual dates we need a formatter there.
How are multi-fields (multi-date, multi-select, etc) going to be exported? A CSV inside a CSV? Probably we need to mimic whatever we are doing in the import feature.
How to handle media files? Do we export them as filename + we also add the file as something you download?
How do we handle relational fields (selects, multiselects). In import we conciliate by name and assume that conflicts are unimportant (less safe, more user friendly). Another option is export by id.

@txau
Copy link
Collaborator

txau commented Feb 25, 2020

  • Probably showing only the visible data is the simplest solution. The problem is that it may lead users to simply add more columns to the "show in card" group so that they get that column exported.
  • Another option is adding all columns from all templates. Maybe it requires more work from our end (maybe not so much? an Uwazi dev should answer that question). The problem is that the number of columns could eventually explode and become clutter.
  • We also probably need to export as extra columns the template name and the unique id.
  • How are dates going to be exported? We store them as timestamps. If we want actual dates we need a formatter there.
  • How are multi-fields (multi-date, multi-select, etc) going to be exported? A CSV inside a CSV? Probably we need to mimic whatever we are doing in the import feature.
  • How to handle media files? Do we export them as filename + we also add the file as something you download?
  • What do we do with relational fields (selects, multiselects). In import we conciliate by name and assume that conflicts are unimportant (less safe, more user friendly). Another option is export by id ala relational db.

@kjantin kjantin modified the milestones: Ability to import/export data, Roadmap for March/April/May 2020, Roadmap for Q2 2020 (Apr, May, Jun), Roadmap for Q1 2020 (Jan, Feb, Mar) Feb 25, 2020
@txau txau unassigned konzz Mar 2, 2020
@bdittes
Copy link
Contributor

bdittes commented Mar 10, 2020

Design doc: https://docs.google.com/document/d/1t5fqpCwhm24M2xeKKICOH3ZLLWFCCDiT-wUgqjGD0Dw/edit#heading=h.u4zhs5qr8hms

Branch: export-csv

@bdittes bdittes assigned fnocetti and unassigned samschaevitz Mar 10, 2020
@txau
Copy link
Collaborator

txau commented Mar 27, 2020

Keep an eye on huge queries crashing the server. Related: #2841

@fnocetti fnocetti mentioned this issue Apr 2, 2020
7 tasks
@RafaPolit
Copy link
Member

This will probably be done by the end of the week. Pending QA and deployment.

@fnocetti
Copy link
Contributor

fnocetti commented Apr 16, 2020

I'll be tagging this for QA today.
The implementation includes:

  • An export button on the sidebar for triggering the export.
  • Export public or non-public documents (not together, and depending on if the user is logged in)
  • It exports all entities matching the user search up to the Elastic Search limit of 10000 or, in case the user selected particular documents, the selected documents.
  • Support for translations in headers and contents. It exports in the lang the user is working on at the moment of the export.
  • For generating the table, it calculates the deduplicated union of the metadata fields for all exported types based on the field label. That is: if two fields are labeled identically in two different templates, they will constitute a single column in the exported CSV. This computation is done in a greedy fashion prior to starting the entities processing based on aggregation information available. That means that it will use the fields for all the templates for which an entity matches the search criteria and in particular cases (like entity selection) there might be columns without data. If the user filtered by template, then that information is used to compute the headers.
  • The export supports all the field types but:
    • Nested fields
    • Preview fields
    • Markdown fields: they were excluded because they can clutter the csv file. If we need them, it should be fairy simple to add support for them. Maybe it is a good idea to open an issue.
  • Apart of the metadata fields, each row will include:
    • Title
    • Template
    • Date added
    • Documents: as a relative path to the file. e.g "/files/docname.pdf"
    • Attachments: same as documents
    • Published: a field indicating if the entity is either published or unpublished
  • All multivalue fields are formatted by including every value separated by a pipe character
    • Date ranges are formatted as two dates separated by a tilde character ( ~ )
  • Link fields follow the "Label|URL" convention
  • No progress report is implemented at the moment. It will require further UI work.

@txau
Copy link
Collaborator

txau commented Apr 20, 2020

  • Documents: as a relative path to the file. e.g "/files/docname.pdf"
  • Attachments: same as documents

Are we exporting the files as well as the paths?

@fnocetti
Copy link
Contributor

Are we exporting the files as well as the paths?

No, we are not exporting the actual documents/attachments files. We are including the file path in a documents (or attachment) field in the CSV.
That's what we agreed for the MVP, am I right?

@txau
Copy link
Collaborator

txau commented Apr 20, 2020

No, we are not exporting the actual documents/attachments files. We are including the file path in a documents (or attachment) field in the CSV.
That's what we agreed for the MVP, am I right?

It's ok for the MVP.

@RafaPolit
Copy link
Member

  • Markdown fields: they were excluded because they can clutter the csv file. If we need them, it should be fairy simple to add support for them. Maybe it is a good idea to open an issue.

This may be a problem, some models heavily rely on rich text fields to hold their data.

The bigger issue would be to 'explain' to the user that this field is not available while all others are. I'll continue with the QA, but I'm not sure if we need to tackle this before merging this into DEV.

@fnocetti
Copy link
Contributor

@RafaPolit makes sense. I'll talk to the PM's and work on it right away if needed.

@fnocetti
Copy link
Contributor

Support for rich text fields is already implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants