Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automated update of resources and index #40

Open
SchSascha opened this issue May 8, 2017 · 17 comments
Open

automated update of resources and index #40

SchSascha opened this issue May 8, 2017 · 17 comments
Assignees
Milestone

Comments

@SchSascha
Copy link
Collaborator

No description provided.

@SchSascha SchSascha added this to the 0.2 milestone May 8, 2017
@SchSascha
Copy link
Collaborator Author

  • aim: daily medline update
  • pmc: look for best solution (really full download each time?)

@SchSascha SchSascha modified the milestones: 0.3, 0.2 Jun 26, 2017
@SchSascha
Copy link
Collaborator Author

My current request from partners includes to be able to find most current hits (e.g. Sep 2017), which likely includes a recognisable event.
Not only shows this the urgency of an updated data basis but also highlights our need for this issue to be resolved. What's the state of things here? What is the estimated workload? Need to discuss!

@SchSascha SchSascha modified the milestones: 0.4, 0.1.49 Dec 4, 2017
@khituras
Copy link
Member

khituras commented Dec 4, 2017 via email

@khituras
Copy link
Member

Status PMC: Benjamin needs to give me an update mechanism for PMC because we need to download it completely every time but not every document is new. He knows what to do and should be reminded at a regular basis.

@khituras
Copy link
Member

Status PMC: Benjamin gives an "unknown" deadline due to other work. Thus, for the time being we will just have to reprocess completely for updates.
Or, as a dirty workaround, we only add new documents and ignore the possibility that old documents have undergone a change.

@SchSascha
Copy link
Collaborator Author

Unknown sounds alarming.
Adding only new documents is fine for the moment.

@khituras
Copy link
Member

Benjamin: Thursday January 23 2018 afternoon he can say whether he can do it or not.

@SchSascha
Copy link
Collaborator Author

He can't can he?
So, what is the latest time line, here, what do you think?

@khituras
Copy link
Member

khituras commented Jun 7, 2018

No, he can't. I will do it next. The facts:

  • PMC does not offer update packages but is always updated as a whole.
  • PMC documents can be downloaded individually but this takes ages.

The plan:

Init:

  1. Create a directory A from where PMC documents will be read and processed in the future.
  2. Download the latest PMC distribution to A.

Update:

  1. Download PMC into directory B.
  2. Create a files HA and HB that contains the filename of each document in A and B, respectively, and their MD5 hashes.
  3. Compute the differences of HA and HB. Perform appropriate insertions, replacements or deletions from B to A. Create a new file PA with newly inserted or replaced documents, create another file DA for documents that have been deleted.
  4. Apply the changes to the PMC archives in A.
  5. Delete B if desired.
  6. Apply the changes to the database (Postgres).

I will give a report on Monday the 11th on how it went, if there were any issues and how long it will still take. I will be able to judge better then than now.

@khituras
Copy link
Member

Update: I began writing the script. I think I will be done by Wednesday with finishing it.

@khituras
Copy link
Member

Update: Well, was not quite finished ;-) I fixed some bugs and now all steps until 5. are working. Can't say when step 6 will also be working as the TREC PM 2018 deadline approaches (August 5).

@khituras
Copy link
Member

See https://github.com/JULIELab/jedis-pmc-updater for a new repository where I store the respective files and describe them.

@khituras khituras modified the milestones: 0.1.17-49, 0.2 Sep 25, 2018
@SchSascha
Copy link
Collaborator Author

Is this issue still open?

@khituras
Copy link
Member

It is. I have made progression here, but the last step is still missing (having the PMC UIMA reader understand what to (re)process and what not).

@SchSascha
Copy link
Collaborator Author

Appr. time for resolving?

@khituras
Copy link
Member

khituras commented Oct 18, 2018 via email

@SchSascha
Copy link
Collaborator Author

OK.
I believe, next to recovering output for another round as input, having full texts results is essential as well.
Hence, I consider the automatic update also as critical, especially with respect to NAR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants