Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Recent dataset list on the main page is not reflective of new dataset #542

Closed
2 tasks done
astrochun opened this issue Dec 11, 2023 · 12 comments · Fixed by #557
Closed
2 tasks done

Bug: Recent dataset list on the main page is not reflective of new dataset #542

astrochun opened this issue Dec 11, 2023 · 12 comments · Fixed by #557
Assignees
Labels
question Further information is requested

Comments

@astrochun
Copy link

astrochun commented Dec 11, 2023

UPDATE: Sounds like we don't yet have quite the right definition of what should be on the Recently Published feed. Working with @astrochun and @matthewjchandler to figure out what that should be.

On Friday, December 8, we published a new dataset in PDC:

However, on the main Discovery page this does not show up under "Recently published".

@hectorcorrea points out that the sorting is done by perhaps year, so it's not capturing the proper order between each dataset.

Screen Shot 2023-12-11 at 11 36 06 AM

Acceptance criteria

  • Sort Recently Published items on the front page by the created_at timestamp from PDC Describe
  • Exclude anything that was migrated from DSpace
@hectorcorrea
Copy link
Member

For reference: there were 17 works published this year (2023) and we only show 5 in the recent works page. The sort is done by year so there is no easy way to determine which one is more recent of those 17 works.

@astrochun
Copy link
Author

For reference: there were 17 works published this year (2023) and we only show 5 in the recent works page. The sort is done by year so there is no easy way to determine which one is more recent of those 17 works.

Is there a published metadata date that we can use?

@bess bess added the bug label Jan 8, 2024
@bess bess self-assigned this Jan 8, 2024
@bess
Copy link
Contributor

bess commented Jan 9, 2024

We can use either created_at or updated_at, which exist on the database record but not in the Datacite record. For the referenced data set, these are:

 created_at: Fri, 17 Nov 2023 14:10:06.509405000 EST -05:00,
 updated_at: Fri, 08 Dec 2023 11:41:34.649789000 EST -05:00,

I'm going to index them both so we can try out which of these works better in the UI.

@astrochun
Copy link
Author

Great. I suspect created date makes the most sense

@leefaisonr
Copy link
Contributor

Screenshot 2024-01-17 at 4 11 07 PM

@leefaisonr
Copy link
Contributor

Done

@bess bess reopened this Jan 18, 2024
@bess
Copy link
Contributor

bess commented Jan 18, 2024

@astrochun Asked us to look into why the dataset he mentioned isn't on the front page.
The dataset in question is this one: https://datacommons.princeton.edu/describe/works/201
Its created_at date is 17 Nov 2023.
On 18 Jan 2024, the list of Recently Added datasets on the front page of PDC Discovery production have these dates for when they were added to PDC Describe:

["2024-01-08T11:51:38Z",
 "2023-12-22T14:00:38Z",
 "2023-12-22T13:47:37Z",
 "2023-12-22T13:43:37Z",
 "2023-12-22T13:35:55Z",
 "2023-12-22T13:26:33Z",
 "2023-12-22T13:20:44Z",
 "2023-12-22T12:37:29Z",
 "2023-12-22T12:24:52Z",
 "2023-12-22T12:13:31Z"]

All of which are more recent than 17 Nov 2023. Do we instead need some combination of "publication date" and "created_at"?

@bess bess added the question Further information is requested label Jan 18, 2024
@astrochun
Copy link
Author

All of which are more recent than 17 Nov 2023. Do we instead need some combination of "publication date" and "date added"?

@bess perhaps. The issue is that we did not migrate the data in chronological order and had to publish new datasets. A lot of those aren't recent datasets but from a few years back. I know this is a challenging one to fix since the metadata is a bit limited. If we can filter out those that have a publication date on or before 2022, that should capture more of the recent datasets. I think once we have more datasets, this will resolve itself.

@bess
Copy link
Contributor

bess commented Jan 18, 2024

Another idea: Maybe we exclude anything that was migrated?

@matthewjchandler
Copy link

Here's my two cents: "recently published" should sort in reverse-chronological order by the date of first issue (not update/edit, and not migration); and once we get past the migration phase, I don't expect much confusion about what recently went into PDC vs. what was recently published for the first time.

@bess
Copy link
Contributor

bess commented Jan 19, 2024

I talked to @matthewjchandler on slack and after discussion he now agrees we should sort by the pdc created_at timestamp (since date of issue is not granular enough to do meaningful sorting) but exclude migrated works.

@leefaisonr
Copy link
Contributor

Screenshot 2024-01-19 at 1 39 00 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants