Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several minor SEO issues in HTML <meta> tags #1197

Closed
tdonohue opened this issue May 18, 2021 · 5 comments · Fixed by #1228
Closed

Several minor SEO issues in HTML <meta> tags #1197

tdonohue opened this issue May 18, 2021 · 5 comments · Fixed by #1228
Assignees
Labels
bug component: SEO Search Engine Optimization e/8 Estimate in hours high priority testathon Reported by a tester during Community Testathon
Milestone

Comments

@tdonohue
Copy link
Member

tdonohue commented May 18, 2021

Describe the bug
There are 5 minor issues in our HTML <meta> tags at this time. I decided to copy them all into a single ticket as they are most easily tackled by one person. All of these issues appear to be in our MetadataService which generates these tags.

  1. The citation_pdf_url has incorrect logic as it requires the bitstream be PDF format. Google Scholar previously told us to change that logic in DS-1483 and then update it in DS-3127 -- better logic for 7.x is listed below.
  2. The citation_abstract_html_url often points at localhost URLs (as it uses the ui settings in your environment.*.ts) This is reproducible on the demo site, e.g. https://demo7.dspace.org/entities/publication/3149d355-7c13-4abb-8537-1852c181d9b2 (use "inspect" on the page).
  3. The og:title and og:description tags should be removed. These are hardcoded to reference DSpace (in general) and are unnecessary at this time since we don't support other "open graph" meta tags. Plus we already have a "generator" tag to specify that the site is DSpace.
  4. The citation_date tag should be renamed to citation_publication_date as that's the new tag that Google Scholar uses. See also https://scholar.google.com/intl/en/scholar/inclusion.html#indexing
  5. Add citation_publisher tag to list the value of dc.publisher (if field exists)

Expected behavior

  1. The citation_pdf_url should use the following logic (based on the two old JIRA tickets above):
    • Create a whitelist of preferred formats. Per old JIRA ticket DS-3127, those formats should be:
      • application/pdf (PDF)
      • application/postscript (PS)
      • application/msword (DOC)
      • application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)
      • text/richtext (RTF)
      • application/epub+zip (EPUB)
    • If the Item has one bitstream (in ORIGINAL bundle), add it to citation_pdf_url (regardless of format -- this overrides the whitelist)
    • If the Item has more than one bitstream, but the ORIGINAL bundle has one flagged as primary, add it to citation_pdf_url (regardless of format -- this overrides the whitelist)
    • If the Item has more than one bitstream, and none are flagged as primary, look for the first bitstream matching any of the preferred formats above.
  2. The citation_abstract_html_url should not use the environment.ui settings. It likely can just use the value of dc.identifier.uri (which is the handle or public URL stored in metadata). If that value isn't found, it could be built similar to the citation_pdf_url.
    3-5 are self explanatory
@tdonohue tdonohue added this to the 7.0 milestone May 18, 2021
@tdonohue tdonohue added the testathon Reported by a tester during Community Testathon label May 18, 2021
@tdonohue
Copy link
Member Author

@artlowel : Assigning this to you for your team to look at. These were found during testathon by a tester & I've verified them & suggested possible fixes.

@artlowel
Copy link
Member

This can be fixed in an estimated 4 hours

For the citation_pdf_url rules, just to make sure, what needs to happen if there is only one ORIGINAL bitstream, but its format isn't in the whitelist? Do we omit the citation_pdf_url tag ?

@artlowel artlowel assigned tdonohue and unassigned artlowel May 19, 2021
@tdonohue
Copy link
Member Author

@artlowel : I think we should error on the side of including the citation_pdf_url if we aren't sure. So, if there is only one ORIGINAL Bitstream, include it in the citation_pdf_url by default. I'll change that rule in the description to be clearer.

Thanks for the estimate. I'll add it & assign back to you for your team to work on when you are read.

@tdonohue tdonohue added e/4 Estimate in hours and removed Estimate TBD labels May 19, 2021
@tdonohue tdonohue assigned artlowel and unassigned tdonohue May 19, 2021
@artlowel
Copy link
Member

artlowel commented May 31, 2021

@tdonohue

  1. The citation_abstract_html_url should not use the environment.ui settings. It likely can just use the value of dc.identifier.uri (which is the handle or public URL stored in metadata). If that value isn't found, it could be built similar to the citation_pdf_url.

I noticed that currently citation_pdf_url doesn't contain the origin e.g.

<meta property="citation_pdf_url" content="/bitstreams/619c1973-3f91-4612-8c43-887f5e32672f/download">

I would have expected:

<meta property="citation_pdf_url" content="https://demo7.dspace.org/bitstreams/619c1973-3f91-4612-8c43-887f5e32672f/download">

I'm ok with leaving it out as well for citation_abstract_html_url, but are we sure that works for google scholar?

Alternatively we could use HardRedirectService.getRequestOrigin() to get the origin used by the request. This isn't based on config, and should work automatically no matter how it's hosted because it just checks where is the request coming from and uses that origin in the link

@tdonohue
Copy link
Member Author

tdonohue commented Jun 1, 2021

@artlowel : Good catch, I guess I overlooked that the URL in the HTML is not an absolute URL. My understanding is it is supposed to contain the Origin. According to https://scholar.google.com/intl/en/scholar/inclusion.html#indexing

please specify the locations of all full text versions using citation_pdf_url or DC.identifier tags. The content of the tag is the absolute URL of the PDF file

So, both citation_abstract_html_url and citation_pdf_url should be absolute URLs. Your approach of using HardRedirectService.getRequestOrigin() sounds fine to me.

@ybnd ybnd mentioned this issue Jun 14, 2021
13 tasks
@tdonohue tdonohue added e/8 Estimate in hours and removed e/4 Estimate in hours labels Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug component: SEO Search Engine Optimization e/8 Estimate in hours high priority testathon Reported by a tester during Community Testathon
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants