Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could perma.cc help PH keep weblinks sustainable? #2030

Closed
hawc2 opened this issue Feb 5, 2021 · 50 comments
Closed

Could perma.cc help PH keep weblinks sustainable? #2030

hawc2 opened this issue Feb 5, 2021 · 50 comments

Comments

@hawc2
Copy link
Contributor

hawc2 commented Feb 5, 2021

I came across perma.cc today and was wondering if it could be useful for the Programming Historian to ensure its weblinks are more 'permanent.'

After chatting with @walshbr and @ZoeLeBlanc I'm opening this issue so we could do some research and see if we should be using it instead of or in conjunction with the web archive

@ZoeLeBlanc
Copy link
Member

Matt Lincoln also helpfully sent this article on Robustifying Links To Combat Reference Rot https://journal.code4lib.org/articles/15509 (not tagging Matt so that we don't bug him but also wanna give him kudos).

Definitely think we should discuss this all at our next tech team meeting, which @hawc2 you're welcome to attend

@acrymble
Copy link

This ticket needs someone assigned to it. Otherwise it will stay open forever. @hawc2 are you planning on taking this forward?

@hawc2 hawc2 self-assigned this Mar 22, 2021
@hawc2
Copy link
Contributor Author

hawc2 commented Mar 22, 2021

Yeah, I just assigned it to myself, and the plan was @ZoeLeBlanc will bring it up at the meeting this Wednesday. There are some organizational decisions to make, but this seems like a pretty viable and sustainable option, if we can get a sponsor library on board

@drjwbaker
Copy link
Member

drjwbaker commented Mar 24, 2021

On perma.cc, having had a look the following people are at institutions that are already partners, though I note that in some cases it may be specific (law) libraries that may not provide support to all faculty.

  • Princeton University Library, Princeton University @ZoeLeBlanc
  • Arthur J. Morris Law Library, University of Virginia School of Law @walshbr
  • Temple University Beasley School of Law Library @hawc2
  • Tarlton Law Library, Jamail Center of Legal Research, The University of Texas @JoshuaGOB

Signing up is free for academic libraries, so I've asked Sussex as well. My instinct is that if we want to move to perma.cc we need a number of us at institutions where our libraries have signed up. So there are two actions here:

  • those named above to ask their librarians if they can use the service.
  • consider if we need to ask all PH members at eligible institutions to ask their library to sign up.

@ZoeLeBlanc
Copy link
Member

I'll reach out the Princeton but was planning to see about getting UIUC to join IPP anyways, so will ask about this with them too!

@ZoeLeBlanc
Copy link
Member

Just to clarify @hawc2 & @drjwbaker is perma.cc free if a library sponsors us? Or does the library already need to be a member and then we just use it with their account? Mostly just wondering how much this costs the sponsoring library. Thanks!

@drjwbaker
Copy link
Member

My read is that it is free for academic libraries to join, and then any faculty can use the account for any purpose. But my scan may be wrong! It may just be worth starting by asking your library about their perma.cc membership and how you can use it.

@hawc2
Copy link
Contributor Author

hawc2 commented Mar 24, 2021 via email

@walshbr
Copy link
Contributor

walshbr commented Mar 24, 2021

That's funny @hawc2 - I didn't realize it was a common thing. But the UVA Law Library is separate institutionally from the rest of our Library, and I similarly would not have access to their account.

@hawc2
Copy link
Contributor Author

hawc2 commented Mar 24, 2021 via email

@drjwbaker
Copy link
Member

Yeah, I guess law libraries at US universities might be separate things, but thought I'd ping you all anyway just in case :)

@hawc2
Copy link
Contributor Author

hawc2 commented Mar 30, 2021

Good news regarding perma.cc.

  1. I can access perma.cc through Temple Law Library. @walshbr I'm curious what you'll find out about UVA's Law Library.

  2. I heard back from perma.cc and it does sound like as long as one library, such as Sussex, is both an IPP for PH and a member of Perma.cc, then @drjwbaker could create an Org account for PH and add any of us PH editors as Org administrators for PH's perma.cc instance. Even if editors don't have access to perma.cc through their own academic libraries, they can be added as Org administrators and create perma links for PH.

It's possible we could do it with any of our libraries, and that they don't need to be Institutional Partners of PH to use perma.cc for the journal. I've followed up with perma.cc's support team to ask about long-term sustainability in terms of what happens if the relevant staff at the hosting institution were to leave either PH or their academic institution.

@hawc2
Copy link
Contributor Author

hawc2 commented Mar 30, 2021

Update on migrating between institutions, from perma.cc support: "If you'd like to migrate an org from one registrar to another, you would just need to send in that request to the perma team and get permission from both the existing registrar and the intended registrar."

@drjwbaker
Copy link
Member

@hawc2 Great digging! Will you reply on our behalf via Temple Law Library (ideally using programminghistorian@gmail.com, though I appreciate you probably don't have access - but you can have it)? Or would you like me to? (if I can via your library)

@hawc2
Copy link
Contributor Author

hawc2 commented Mar 31, 2021

@drjwbaker Do you mean I should set up an Org account for PH through Temple's account?

If I can have access to the gmail account, I'm happy to begin a separate conversation directly with perma.cc user support about the various options we're considering for using their service for the journal.

@drjwbaker
Copy link
Member

Okay. I'll email you the gmail details. If you could do it now(ish) I can be sure to approve the login when the big WARNING sign flashes up on my phone :) (google authentication has caused problems before when sharing access)

@drjwbaker
Copy link
Member

@hawc2 How are you getting on with this? Need a hand?

@hawc2
Copy link
Contributor Author

hawc2 commented May 3, 2021 via email

@drjwbaker
Copy link
Member

Okay.

Too time consuming to be worth it? I guess what we are suggesting here is a) all future articles use perma.cc for link b) when link rot occurs in published articles, perma.cc is used to fix links (that is, we aren't going to go through and make perma.cc links for all published articles)

Right?

@hawc2
Copy link
Contributor Author

hawc2 commented May 5, 2021 via email

@ZoeLeBlanc
Copy link
Member

Thanks for getting this setup Alex 👏🏽 !

No set number on how often this happen, but I do think it's easily once every month or so that we find a broken link for various reasons.

I agree that focusing on future and current breaking links is the right direction and that we can over time move all lessons to using perma.cc. I think an additional next step is writing up documentation for editors to use perma.cc. Right now our tech documentation is long and not broken up easily by topic, so I would recommend potentially starting a new page for fixing broken links and we can work on archiving the existing instructions.

Let me know if you need help with this Alex and thanks again for taking the lead on this 🙌🏽

@drjwbaker
Copy link
Member

Two thoughts (that came up in an email thread with @hawc2 ):

  1. we should align this with @rivaquiroga's Lesson Maintenance Workflow Lesson Maintenance Workflow #2058

  2. maybe the correct approach here to ask MEs for a list of active editors who need accounts? (and then add getting an account to the onboarding process) Alternatively, might editors log into perma.cc via the gmail?

@drjwbaker
Copy link
Member

And thanks @ZoeLeBlanc for contributing. I'm aware that it is often @programminghistorian/technical-team members who resolve issues with broken links.

@hawc2
Copy link
Contributor Author

hawc2 commented May 10, 2021

Update on for providing access to perma.cc.: all PH members can now access our perma.cc account through our programminghistorian@gmail.com account. @drjwbaker has the account access info.

Agree with @ZoeLeBlanc we should create documentation this summer for using perma.cc.

For now, we'll plan to test it out on specific broken links?

I'm happy to help lead the effort but will need some onboarding to how we're handling the problem currently - makes sense to integrate this with @rivaquiroga Lesson Maintenance Worfklow to me

@drjwbaker
Copy link
Member

Is the making progress @hawc2? (and, do we know the steps that look like progress?)

@hawc2
Copy link
Contributor Author

hawc2 commented Jun 15, 2021

Now that we have general access to use it, could we have a meeting to discuss how to proceed, both with testing and implementation?

I am still learning the ropes of some PH processes, so I'm not sure who should be involved and what are the most efficient ways to integrate perma.cc into our workflows.

It shouldn't be a hard tool to use, but as @acrymble mentioned, there are some complex decisions to consider, and it will be very time-consuming to remediate old lessons.

@drjwbaker
Copy link
Member

drjwbaker commented Jun 15, 2021

It feels like the aim is to get it into the author/editor guidelines as our preferred implementation of URLs where we do not expect the content at those URLs to change / be usefully dynamic (as @acrymble notes). A route to implementation might be to test this with a live article submission (perhaps one you edit?) but that decision is better made by a Managing Editor than me. Perhaps we can add this as a discussion point for our next Project Team Call: @mariajoafana will this be in July?

@anisa-hawes
Copy link
Contributor

Hello @hawc2. I'd like to be part of this conversation!

@hawc2
Copy link
Contributor Author

hawc2 commented Jul 28, 2021

Per our team meeting discussion on July 28 #2159, @Anisa-ProgHist and I will test out perma.cc for a PH lesson using the one I just finished editing, currently under copyedit stage with Anisa, issue #325 in ph-submissions: https://programminghistorian.github.io/ph-submissions/lessons/clustering-with-scikit-learn-in-python.

As we finalize this lesson for publication, we'll try to develop some basic standards for use of perma.cc to deal with link 'rot' and 'growth' for further editors. We'll also track how long the process takes us.

While the copyediting stage makes most sense for integrating perma.cc, decisions still need to be made about who will do this labor regularly going forward.

@anisa-hawes
Copy link
Contributor

As part of our pilot implementation of perma.cc on Submission #325, I have collated a list of all links which appear in the lesson (numbers represent the line/paragraph where the links appear).

This list includes links featured in tables, links referenced within code, and links in footnotes.

Note to self: I would be interested to know if the number of links included here (~75) is roughly representative of a 'standard' lesson.

LINKS

PLUS, ADDITIONAL LINKS SUGGESTED BY COPY-EDITOR @Anisa-ProgHist

@anisa-hawes
Copy link
Contributor

anisa-hawes commented Aug 11, 2021

Another example which I think may be useful to consider: Submission #348

LINKS

Additional links suggested by copyeditor @Anisa-ProgHist

@anisa-hawes
Copy link
Contributor

anisa-hawes commented Aug 13, 2021

Thinking about citations, and wondering whether it would be useful to include both the original URL and the perma.cc URL in our bibliographies/footnotes.

e.g., http://ceur-ws.org/Vol-2253/paper22.pdf archived at https://perma.cc/---

Looking at the recently published lesson Detecting Text Reuse with Passim, I notice that the citation format used doesn't expose the original URL, rather embeds it within the word 'Link'.

Greta Franzini, Maria Moritz, Marco Büchler, Marco Passarotti. Using and evaluating TRACER for an Index fontium computatus of the Summa contra Gentiles of Thomas Aquinas. In Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018). (2018). Link

Going forward, I feel that especially when a link isn't archived at perma.cc, it is useful if we can expose original URLs (this may include considering a system for truncating excessively long URLs/those which include queries) because URLs give readers information about sources.

Q: Are we still aiming to use the Chicago Manual of Style format as our template?

@anisa-hawes
Copy link
Contributor

Also, this guide looks useful: https://guides.law.stanford.edu/c.php?g=588091&p=4063422

It shows how it is possible to 'batch create' links, and organise links within folders. Both these features will be useful to us.

PDFs can also be archived. This could be useful for an example such as that given above (http://ceur-ws.org/Vol-2253/paper22.pdf) of conference proceedings which don't have a DOI.

@anisa-hawes
Copy link
Contributor

I suspect that Submission #348 is an unusual case, but it does raise some interesting challenges.

It included several links to the interactive games which are currently playable on the live web. Perma.cc cannot effectively render this kind of complex content, so upon following the link I think readers would be dissatisfied. However, readers could choose to either click through the to ‘See the Screenshot View’ to see a page that looks like the original webpage, or click through to ‘View the Live Page’ from where they will be able to get started playing the game(s) for as long as it/they exist(s) on the web.

In case anyone following this thread is interested, those instances are as follows:

Interestingly,

  • Line 408 September 7th, 2020 is an example of where perma.cc has achieved a successful capture, which you can interact with here so I have included this.

Links to YouTube playlists are also problematic. The page ‘looks’ right, but each individual video has a unique URL (in fact, they have multiple URLs, depending upon whether the Playlist is played through start to finish, or if a video is selected individually)

@hawc2
Copy link
Contributor Author

hawc2 commented Sep 5, 2021 via email

@anisa-hawes
Copy link
Contributor

Ah! Yes ! I almost included in my previous comment, that when I am not at PH, I am a freelance web archivist and I use Webrecorder daily ! It is my tool of choice: brilliantly powerful. Definitely capable of capturing these interactive games - I have tested it to archive several, similarly complex, sites/artefacts in the past. I know the web archivists at the British Library very well, including those involved in the Collecting Interactive Digital Narratives project, and those who launched the research that became the Emerging Formats initiative. Capturing individual YouTube videos via their canonical URLs works well, and it is also possible to capture YT embeds on other websites, but Playlists pose particular challenges because of the number of URLs associated with each individual video (can be 10 or more). I would be happy share some examples and more information.

@anisa-hawes
Copy link
Contributor

The developers of Webrecorder are among my direct contacts, and I'd be delighted to chat with them about our use case ✨

@drjwbaker
Copy link
Member

Per @anisa-hawes @hawc2 introduction at #2223 given the labour involved in using perma.cc is there a case with future new articles for a) encouraging authors to only include essential links, b) discouraging authors from pointing to complicated links (e.g. YouTube playlists). Both these can be justified under our sustainability criteria https://programminghistorian.org/en/reviewer-guidelines#sustainability

@hawc2
Copy link
Contributor Author

hawc2 commented Sep 30, 2021

@anisa-hawes how much additional time would you say perma.cc linking added to the copyedit stage? given that was your first time, how much faster do you think it could become?

@drjwbaker the perma.cc process definitely made it apparent a number of ways we could clarify guidelines for authors/editors on when to use links and what kind. Reducing links overall isn't a bad idea, and we could ask people to avoid some kinds of unnecessary links to dynamic sites. But I don't think the jury is out on our ability to preserve interactive media like games, so I think we should investigate further first

@anisa-hawes
Copy link
Contributor

Here is a brief summary of what I said (although did not express as clearly as I would have liked) at today's Project Team Meeting:

  • Our first lesson including Perma.cc links has now been published.

    • the Perma links were generated as part of the copyediting workflow
    • it was a straight forward process, involving making a list of all links included in the lesson (one URL per line), batch creating the archive links (this takes a while...), then, replacing all original links in the .md file with the new Perma links. Care is needed with this final task in particular, to ensure that all new links are put in the correct places. GitHub's version control means that we can view the original links within the file's history, so we do not lose them.
    • Perma.cc's UI offers us the option to create a folder structure to organise our links. My suggestion is that we create a folder per lesson, following a naming convention such as 'EN-name-of-lesson'
    • I could personally do this work across all four language journals, to avoid us asking freelance copyeditors to do it.
    • in itself, I think this work could contribute to improving the sustainability of our lessons. During the course of collating the links to archive for the two lessons I tested this on, I identified several already broken and was able to liaise with authors to find alternatives ahead of publication.
  • Positives:

    • Perma.cc's header banner clearly indicates that page is an archived record of a web page, created at a particular time, on a particular date. It also allows a reader to 'click through' to View the Live Page, so we preserve links from rot, without losing the opportunity for knowledge growth within the life of the lesson, as discussed above. This observation, informed our decision to archive all links, including Wikipedia, except in cases where the context requires a link to the live web, for example, "visit [this page], and perform [this action]".
  • Challenges:

    • When a reader visits the page, they will find that they cannot browse or 'link hop' onwards. But, if they make a secondary click, they will encounter a page that gives them the option to view the link as it exists on the live web.
    • Perma.cc cannot capture complex interactive content or video but, we can archive those elements of content using the Webrecorder tool suite. I encountered some initial difficulties with playback of the interactive narratives I captured for the second lesson in our pilot, but with the support of WR's Lead Developer, these issues have been fixed in the latest release. If you are interested to see how my capture of the Twine game functions, you can run the game here. It's in my s3 bucket for now, but we could consider hosting web archive files like this one (the format is .warc or .wacz) on our own server. YouTube Playlists present some particular challenges to web archiving, but in my past work I have developed a method for doing so. I capture each video individually, then re-construct the playlist so that it is presented as a list. At the moment, WR's UI doesn't offer this facility, but it will be re-introduced soon. For now, you can see the way I propose we could prepare such collections for access via Conifer (Rhizome's WR instance), https://conifer.rhizome.org/collect_curate/twine-21-tutorials.

@anisa-hawes
Copy link
Contributor

Per @anisa-hawes @hawc2 introduction at #2223 given the labour involved in using perma.cc is there a case with future new articles for a) encouraging authors to only include essential links, b) discouraging authors from pointing to complicated links (e.g. YouTube playlists). Both these can be justified under our sustainability criteria https://programminghistorian.org/en/reviewer-guidelines#sustainability

Yes, I think this is something we could consider... In one of the two lessons I read, I found that the author had doubled up on links multiple times, rather than defining it/providing a link upon first mention only. Elsewhere, in that lesson I found myself suggesting additional links to define technical terms. I wonder how typical these two lessons were in terms of the number of links they included?

@drjwbaker
Copy link
Member

Thanks for the summary @anisa-hawes. I think..

I think this work could contribute to improving the sustainability of our lessons. During the course of collating the links to archive for the two lessons I tested this on, I identified several already broken and was able to liaise with authors to find alternatives ahead of publication.

..is ultimately the key positive. So long as we have an infrastructure where updates fail because a link on another part of the site has gone down, perma.cc has the advantage of reducing our exposure to that, thus gradually making working with the site much easier.

@drjwbaker
Copy link
Member

drjwbaker commented Sep 30, 2021

Per @anisa-hawes @hawc2 introduction at #2223 given the labour involved in using perma.cc is there a case with future new articles for a) encouraging authors to only include essential links, b) discouraging authors from pointing to complicated links (e.g. YouTube playlists). Both these can be justified under our sustainability criteria https://programminghistorian.org/en/reviewer-guidelines#sustainability

Yes, I think this is something we could consider... In one of the two lessons I read, I found that the author had doubled up on links multiple times, rather than defining it/providing a link upon first mention only. Elsewhere, in that lesson I found myself suggesting additional links to define technical terms. I wonder how typical these two lessons were in terms of the number of links they included?

Personally, I think some authors use links in our articles as they would be blog rather than a journal, because we've always encourged it, and are now seeing the downside as links break and cause work. Now, I don't want to encourage the inflexibility of journal policies towards links/urls, but I think we could advise more parsimonious use of links and/or a use of links that is clearly justifiable/justified.

@anisa-hawes
Copy link
Contributor

anisa-hawes commented Sep 30, 2021

@anisa-hawes how much additional time would you say perma.cc linking added to the copyedit stage? given that was your first time, how much faster do you think it could become?

@drjwbaker the perma.cc process definitely made it apparent a number of ways we could clarify guidelines for authors/editors on when to use links and what kind. Reducing links overall isn't a bad idea, and we could ask people to avoid some kinds of unnecessary links to dynamic sites. But I don't think the jury is out on our ability to preserve interactive media like games, so I think we should investigate further first

I estimate that it added another couple of hours to copyediting, but it felt worthwhile for the reasons explained above. But, you are right to observe that the process can be speeded up as I become more familiar with the workflow.

I'm not certain how often authors link out to YouTube Playlists / individual videos or exceptionally complex content (e.g. the interactive narratives), but I think it's good if we have a workflow in place for if they do – because this content isn't robust. Indeed, the author of the interactive narratives commented on their instability.

@anisa-hawes
Copy link
Contributor

That's an interesting thought, @drjwbaker. Thank you!

@anisa-hawes
Copy link
Contributor

In another recent Issue, we were talking about updates to the research/investigacion/recherche/pesquisa pages. I note that links on these pages break frequently. Perhaps these are good candidates for perma.cc overhauls too!

@anisa-hawes
Copy link
Contributor

I am currently finalising a draft of revised Editorial Guidelines (to be tested in an Onboarding pilot study with the English team this autumn) which include detailed steps for the Copyediting phase of the workflow. My draft integrates step-by-step instructions for link archiving using perma.cc, but recognises that it doesn't have to be the same person who undertakes both tasks. For example, I could perform the link archiving task across all four languages.

Going forwards, I think we could consider integrating use of Webrecorder tools to stabilise (and ensure sustainable access to) the kinds of complex online content (interactives, video 3D models, etc.) we are likely to encounter more frequently in the future. I've added this as an idea for one of our Longer-term Goals within our shared planning document.

Following this successful pilot study, I am closing this Issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants