Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sitemap & Google Scholar #379

Merged
merged 23 commits into from
Jan 30, 2018
Merged

Sitemap & Google Scholar #379

merged 23 commits into from
Jan 30, 2018

Conversation

pgwillia
Copy link
Member

@pgwillia pgwillia commented Jan 15, 2018

Replaces #340 Closes #9

  • Get the modified date indexed into Solr
  • Item visibility (filter out private items)
  • Add a static robots.txt file?
  • Other requirements from Google Scholar requirements: Google Scholar optimization #234
  • Change SitemapPolicy to simply inherit from ApplicationPolicy.
  • Add an application specific property on fileset to include hash, file size, and mimetype information.
  • Theses.
  • splitting into sitemapindex and sitemap

@pgwillia pgwillia changed the title Sitemap Sitemap & Google Scholar Jan 15, 2018
@@ -1,5 +1,7 @@
<% page_title(@item.title) %>

<%= render partial: 'google_scholar_metadata' %>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These meta tags will be in the <body>. Should probably inject them into the <head> of the page instead?

Requires some additional work to make this happen. Look into content_for and yield if you need help figuring out how to do this.

@pgwillia pgwillia force-pushed the sitemap branch 2 times, most recently from 7e30e59 to 88e13fc Compare January 17, 2018 23:12

def communities
@communities = Community.all
raise 'sitemap should contain less than 50,000 targets' if @communities.count > 50_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'll want to start serving Google a 500 page if/when any of these creeps above 50,000 without us noticing. Maybe replace raise with logger.warn? (actually even better would be to push a warning to Rollbar, but we don't have that set up yet. Maybe add a TODO: about that as well?)

@@ -134,6 +148,10 @@ def display_attribute_names
def valid_visibilities
super + [VISIBILITY_EMBARGO]
end

def public
Item.where(visibility: JupiterCore::VISIBILITY_PUBLIC)
Copy link
Contributor

@mbarnett mbarnett Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this concern is shared with Thesis, I don't think this quite works (Thesis.public will return public Items, instead). self.where(visibility: JupiterCore::VISIBILITY_PUBLIC) should do the appropriate thing in both classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also love to see this get exercised in the test suite. Probably just add another assertion to these two existing test cases: https://github.com/ualbertalib/jupiter/blob/master/test/models/thesis_test.rb#L5 and https://github.com/ualbertalib/jupiter/blob/master/test/models/item_test.rb#L5

<meta name="citation_issue" content=""/>
<meta name="citation_firstpage" content=""/>
<meta name="citation_lastpage" content=""/>
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want to output this whole set of comments into the HTML, or do we want this comment not to render at all? (Use <%# %> is the latter).

"#{unlocked_fileset.original_file.checksum.value}\" \
length=\"#{unlocked_fileset.original_file.size}\" \
type=\"#{unlocked_fileset.original_file.mime_type}\"\
/>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than have to escape all of these newlines, this is probably a great use-case for a heredoc: https://infinum.co/the-capsized-eight/multiline-strings-ruby-2-3-0-the-squiggly-heredoc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually more like a run-on sentence situation. There should be no newlines or extra spaces/tabs between the values. My understanding is that heredoc would maintain newlines. Is there a solution to make something like that more readable?

@mbarnett
Copy link
Contributor

mbarnett commented Jan 18, 2018

One other thing I'd love to see would be an added system test exercising the sitemap. But this looks really good other than the small things noted above.

@@ -1 +1,2 @@
# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
Sitemap: https://era.library.ualberta.ca/sitemap.xml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming out of a discussion with Kenton this morning, it sounds like it would be useful to make this dynamic and serve

User-agent: *
Disallow: /

based on a config key in the Rails environment config files, so that we can tell crawlers to go away during the beta launch phase.

(Basically, set config.some_key_name in the various env files, and then do if Rails.configuration.some_key_name == true here

should 'show url, last modified date, and content attributes' do
#visit items_sitemap_path
#assert_selector 'url', count: 4
#assert_selector 'loc', text: 'sitemap-items.xml'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/LeadingCommentSpace: Missing space after #. (https://github.com/bbatsov/ruby-style-guide#hash-space)

end
should 'show url, last modified date, and content attributes' do
#visit items_sitemap_path
#assert_selector 'url', count: 4

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/LeadingCommentSpace: Missing space after #. (https://github.com/bbatsov/ruby-style-guide#hash-space)

assert_empty schema.validate(document)
end
should 'show url, last modified date, and content attributes' do
#visit items_sitemap_path

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/LeadingCommentSpace: Missing space after #. (https://github.com/bbatsov/ruby-style-guide#hash-space)


should 'not show private items' do
#assert_select 'url', count: 2
assert_select 'loc', {count: 0, text: item_url(@private_item)}, 'private items shant appear in the sitemap'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/SpaceInsideHashLiteralBraces: Space inside { missing. (https://github.com/bbatsov/ruby-style-guide#spaces-operators)
Layout/SpaceInsideHashLiteralBraces: Space inside } missing. (https://github.com/bbatsov/ruby-style-guide#spaces-operators)

end

should 'not show private items' do
#assert_select 'url', count: 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/LeadingCommentSpace: Missing space after #. (https://github.com/bbatsov/ruby-style-guide#hash-space)

assert_select 'lastmod', @item.updated_at.to_s
@item.file_sets.first.unlock_and_fetch_ldp_object do |uo|
assert_select "rs|ln[href=?]", url_for(controller: :file_sets,
action: :download,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/AlignHash: Align the elements of a hash literal if they span more than one line.

assert_select 'loc', item_url(@item)
assert_select 'lastmod', @item.updated_at.to_s
@item.file_sets.first.unlock_and_fetch_ldp_object do |uo|
assert_select "rs|ln[href=?]", url_for(controller: :file_sets,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/StringLiterals: Prefer single-quoted strings when you don't need string interpolation or special symbols. (https://github.com/bbatsov/ruby-style-guide#consistent-string-literals)

assert_select 'lastmod'
assert_select 'changefreq'
assert_select 'priority'
assert_select "rs|md[type=?]", 'text/html'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/StringLiterals: Prefer single-quoted strings when you don't need string interpolation or special symbols. (https://github.com/bbatsov/ruby-style-guide#consistent-string-literals)

publication_status: [CONTROLLED_VOCABULARIES[:publication_status].published],
license: CONTROLLED_VOCABULARIES[:license].attribution_4_0_international,
subject: ['Items'])
.unlock_and_fetch_ldp_object do |uo|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/MultilineMethodCallIndentation: Align .unlock_and_fetch_ldp_object with .new_locked_ldp_object on line 39.

item_type: CONTROLLED_VOCABULARIES[:item_type].article,
publication_status: [CONTROLLED_VOCABULARIES[:publication_status].published],
license: CONTROLLED_VOCABULARIES[:license].attribution_4_0_international,
subject: ['Items'])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/AlignHash: Align the elements of a hash literal if they span more than one line.

one failing bc Item.public is returning private items.
@@ -86,4 +86,7 @@

# Which ActiveStorage service to use
config.active_storage.service = (ENV['ACTIVE_STORAGE_SERVICE'] || :local).to_sym

# we can tell crawlers to go away during the beta launch phase
config.allow_crawlers = ENV['RAILS_ALLOW_CRAWLERS'].present?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference:
Could have defined this in config/application.rb once instead of doing this in every environment especially when they are the same.

@@ -0,0 +1,147 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this resourcesync stuff living in test/controllers? Is it being used as a file? Maybe test/fixtures/file a better place?

@@ -0,0 +1,114 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link
Contributor

@murny murny Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this one? Don't see anything reference these two files in your PR either (EDIT: I see sitemap.xsd being used but not resourcesync.xsd)?

get items_sitemap_url
end
should 'be valid sitemap xml' do
schema = Nokogiri::XML::Schema(File.open(File.join(File.dirname(__FILE__), 'sitemap.xsd')))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah your using it here. Yeah should look into file_fixtures and put them in test/fixtures/files

@@ -86,4 +86,5 @@

# Which ActiveStorage service to use
config.active_storage.service = (ENV['ACTIVE_STORAGE_SERVICE'] || :local).to_sym

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/EmptyLinesAroundBlockBody: Extra empty line detected at block body end. (https://github.com/bbatsov/ruby-style-guide#empty-lines-around-bodies)

@@ -42,4 +42,5 @@

# Which ActiveStorage service to use
config.active_storage.service = (ENV['ACTIVE_STORAGE_SERVICE'] || :test).to_sym

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/EmptyLinesAroundBlockBody: Extra empty line detected at block body end. (https://github.com/bbatsov/ruby-style-guide#empty-lines-around-bodies)

@@ -83,4 +83,5 @@

# Do not dump schema after migrations.
config.active_record.dump_schema_after_migration = false

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/EmptyLinesAroundBlockBody: Extra empty line detected at block body end. (https://github.com/bbatsov/ruby-style-guide#empty-lines-around-bodies)

@@ -86,4 +86,5 @@

# Which ActiveStorage service to use
config.active_storage.service = (ENV['ACTIVE_STORAGE_SERVICE'] || :local).to_sym

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/EmptyLinesAroundBlockBody: Extra empty line detected at block body end. (https://github.com/bbatsov/ruby-style-guide#empty-lines-around-bodies)

@@ -54,4 +54,5 @@

# Which ActiveStorage service to use
config.active_storage.service = (ENV['ACTIVE_STORAGE_SERVICE'] || :local).to_sym

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/EmptyLinesAroundBlockBody: Extra empty line detected at block body end. (https://github.com/bbatsov/ruby-style-guide#empty-lines-around-bodies)

get items_sitemap_url
end
should 'be valid sitemap xml' do
schema = Nokogiri::XML::Schema(File.open(Rails.root + 'test/fixtures/files/sitemap.xsd'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nokogiri isn't defined in our Gemfile (might be getting this from another gem, but regardless we should add it in case this other gem removes it down the road and now our test suite fails).

@@ -58,7 +58,7 @@ def before_all
get sitemapindex_url
end
should 'be valid sitemapindex xml' do
schema = Nokogiri::XML::Schema(File.open(File.join(File.dirname(__FILE__), 'siteindex.xsd')))
schema = Nokogiri::XML::Schema(File.open(Rails.root + 'test/fixtures/files/siteindex.xsd'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can get even easier here. You should be able to just call:

file_fixture('siteindex.xsd')

http://api.rubyonrails.org/v5.1/classes/ActiveSupport/Testing/FileFixtures.html

@pgwillia pgwillia force-pushed the sitemap branch 3 times, most recently from 98f94f6 to d619391 Compare January 29, 2018 20:53
end
should 'show location and last modified' do
assert_select 'loc', community_collection_url(@collection.community, @collection)
assert_select 'lastmod', @collection.updated_at.to_s\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why the \ at the end of updated_at.to_s? Happens a couple times in this file

Copy link
Member Author

@pgwillia pgwillia Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?! Copy-paste remnant.

config/routes.rb Outdated
get 'sitemap-theses.xml', to: 'sitemap#theses', defaults: { format: :xml }

# Dynamic robots.txt
get 'robots.:format' => 'robots#robots'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just limit it to robots.txt here instead of allowing any format extension? I assume it doesn't matter since in the controller action for this it only has a responds_to :text anyways?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, doesn't matter but the only thing that makes sense is robots.txt -- I'll change it.

@@ -0,0 +1,147 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link
Contributor

@murny murny Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment I have on this PR, this file is never used anywhere? Is this something being kept around for future? Or If it's not needed just remove it?

Maybe I am wrong, looks like sitemap.xsd imports it maybe? From looks of it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is being used. It's included by the sitemap.xsd.

Copy link
Contributor

@murny murny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me now.

gif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants