This is an open-source, self-hosted, web-based tool for correcting transcripts that were automatically generated using speech-to-text software. It was originally developed by NYPL Labs in partnership with The Moth and Pop Up Archive with generous support from the Knight Foundation.
Since 2018, GBH Archives has adopted and extended this project to provide a crowdsourced transcript correction platform for GBH and for the public media community at large.
If you would like to help improve access to public media audiovisual materials, please correct some transcripts on our public FIXIT+ site! ( https://fixitplus.americanarchive.org/ )
- Setting up the editor
- Importing your transcripts
- Customizing your project
- Transcript Consensus
- Deploying your project
- Managing your project
- Retrieving your finished transcripts
- Export UIDs by project
- Contributing
- License
- Attribution
You will need to have the following installed to run this project on your machine.
- Git
- Ruby - this app has been developed using 2.3.0. Older versions may not work
- PostgreSQL
Once everything is installed, clone this repository
cd /my/projects/folder
git clone https://github.com/WGBH/transcript-editor.git
cd transcript-editor
If you forked this repository, replace the URL with your repository
- Create
config/database.yml
based on config/database.sample.yml - update this file with your own database credentials - Create
config/application.yml
based on config/application.sample.yml - this file contains all your private config credentials such as Pop Up Archive or Google accounts. The only required configuration to start is:
- SECRET_KEY_BASE. You can generate this value by running
rake secret
- PROJECT_ID. A project id that will be used to identify this project (e.g. my-project). Must be alphanumeric; no spaces or periods; underscores and dashes okay
- Copy the folder
project/sample-project
and rename it to the PROJECT_ID from the previous step (e.g.project/my-project
). This folder will contain all the configuration, content, and language for your project.
Your project folder has the following structure:
my-project/
+-- assets/
| +-- css/
| +-- img/
| +-- js/
+-- data/
+-- layouts/
+-- pages/
+-- transcripts/
+-- project.json
The primary place for project configuration the file project.json
. For now, we can keep everything as defaults. We will cover the details of this folder in later steps.
- Run
bundle
- this will install all the necessary gems for this app - Run
rake db:setup
to setup the database based onconfig/database.yml
- Run
rake db:seed RAILS_ENV=test
to populate the test database for Minitest suite. - Run
rake project:load['my-project']
to load your project folder (replace my-project with your project name) - Run
rails s
to start your server. Go to http://localhost:3000/ to view your project
Your project should load, but since there's no transcripts, all you'll see is a header and blank screen! The next step is to seed the app with some data.
This section will assume you are using the American Archive of Public Broadcasting as the data source for importing transcripts.
All of the rake tasks in "Creating a manifest file", "Downloading transcripts", "Converting transcripts", and "Importing transcripts" have been wrapped in to a single process for WGBH AAPB transcripts.
If you are importing AAPB transcripts in to a project, you can run one rake task:
bundle exec rake aapb:ingest_guids['/path/to/file.txt','my-project']
This task will create a manifest file for AAPB records, download, convert, and import the transcripts.
The Collections for the transcript editor need to be created prior to uploading any transcripts or additional data.
In your project folder, you should find an empty .csv file: project/my-project/data/collections_seeds.csv.
Create a manifest file where the uid matches the
<pbcoreAnnotation annotationType="organization">
from the AAPB PBCore. Title is WGBH's only other required field. Load them into the app with this command:
rake collections:load['my-project','collections_seeds.csv']
Replace my-project
with your project id and collections_seeds.csv
with your file's name.
You can always re-run this script with new data and manifest files.
New audio files and transcripts can be added to this app by creating manifest files in .csv format. These manifest files will contain basic information about your audio, e.g. an internal id, title, description, url to audio file, etc. These files will be used to perform a number of tasks such as uploading new audio for transcription, download processed transcripts, and updating information about your audio.
In your project folder, you should find an empty .csv file: project/my-project/data/transcripts_seeds.csv. It contains the following columns:
Column | Description | Required? | Example |
---|---|---|---|
uid | a unique identifier for your audio file. Must be alphanumeric; no spaces or periods; underscores and dashes okay. Case sensitive. | Yes | podcast-123, podcast_123, 123 |
title | the title that will be displayed for this audio file | Yes | Podcast About Cats |
description | a description that will be displayed for this audio file | No | This is basically teh best podcast about cats; no dogs allowed |
url | a URL that will link back to where the audio is being presented on your website | No | http://mywebsite.com/podcast-123 |
audio_url | a public URL to your audio file | Yes, unless you already uploaded audio to Pop Up Archive or already have transcripts | http://mywebsite.com/podcast-123.mp3 |
image_url | a public URL to an image representing your audio; square and ~400px is preferred | No | http://mywebsite.com/podcast-123.jpg |
collection | the unique identifier for the collection this audio belongs to (see below for more on this) | No | cat-collection |
vendor | the vendor that will be doing the transcription | Yes, unless you already produced transcripts yourself | pop_up_archive |
vendor_identifier | if already uploaded the audio to Pop Up Archive, put the item id here | Only if you already uploaded audio to Pop Up Archive | 41326 |
notes | any extra notes that will only be used internally (not public) | No | this audio contains explicit material |
Lucky for you, we've got a process to handle all this. The first thing you'll need is a text file of AAPB GUIDs.
Then you can create the manifest file using this command:
rake aapb:create_seeds['ids_file_path','my-project']
Replace ids_file_path
with the path to your text file of AAPB GUIDs and my-project
with your project id. This process creates a csv file in the project's data directory named today's date. The data scraped from .pbcore view of the AAPB web application.
After the file is created, you can load them into the app with this command:
rake transcripts:load['my-project','transcripts_seeds.csv']
Replace my-project
with your project id and transcripts_seeds.csv
with your newly created file's name. You can run this command any number of times after editing the manifest file or with new manifest files. The script will check if the transcript already exists using the uid
column value.
Most of our transcripts are phrase level JSON stored in an online object store, so we've created an AAPB transcript downloader. Again you'll need a text file of AAPB GUIDS to run the rake take.
Download the transcripts using this command:
rake aapb:download_transcripts['ids_file_path','my-project']
Replace ids_file_path
with the path to your text file and 'my-project' with your project id. This process will: create an aapb
directory in the project's transcripts' directory if one does not already exist, use the GUID to look for a transcript in the online object store and download it to the 'aapb' directory.
The original transcript-editor was written to support the WebVTT format, a W3C standard for displaying timed text in connection with the HTML5 <track>
element. This format is extended from the popular SRT format.
Since the AAPB transcripts are phrase level JSON we wrote a transcript converter so that they'd play function within the current infrastructure.
Run this task to convert AAPB's phrase level JSON transcripts to WebVTT for the transcript-editor:
rake transcripts:convert['transcript_files_path', 'directory' , 'to_format']
Replace transcript_files_path
with the path to a directory containing all the transcripts that you'd like to convert, directory
with the path of the directory to which you'd like the converted transcripts saved, and to_format
with the file format to which you'd like the transcripts converted. In our process this would simply be vtt
.
The converter classes were purposefully written to be fairly easily extended, but right now it only has a reader for AAPB JSON and a converter for WebVTT. But new readers and writers can be plugged in the same process. Any new reader should return a hash in the following format:
{ "file_name" => "file_name",
"parts" => [
{ text: "transcript phrase",
start_time: "start time of phrase in hundreths of a second", end_time: "end time of phrase in hundreths of a second"
}...
]
}
And any new converters should be written to process this same format.
- Place all your
.vtt
files in folder/project/my-project/transcripts/webvtt/
- Create a manifest file
- Refer to this section to setup a manifest file
- In the
vendor
column, enterwebvtt
for each transcript - In the
vendor_identifier
column, enter the name of the.vtt
file, e.g.transcript_1234.vtt
- If you have not already, run the
rake transcripts:load['my-project','the_manifest_file.csv']
task on your manifest file to create entries for your transcripts
- Finally, run
rake webvtt:read['my-project']
which will import all the.vtt
files that have not already been processed
To make it easier to mass delete AAPB transcripts that have been erroneously uploaded, we created a rake task to remove a list of AAPB GUIDS in a text file.
bundle exec rake aapb:delete_guids['/path/to/file.txt']
All project customization should happen within your project directory (e.g. /project/my-project/
). Changes made anywhere else may result in code conflicts when updating your app with new code.
Whenever you make a change to your project directory, you must run the following rake task to see it in the app:
rake project:load['my-project']
This app currently supports logging in through Google or Facebook accounts (via OAuth2). You can activate this by the following:
- Log in to your Google account and visit https://console.developers.google.com/; complete any registration steps required
- Once you are logged into your Developer dashboard, create a project
- In your project's dashboard click enable and manage Google APIs. You must enable at least Contacts API and Google+ API
- Click the Credentials tab of your project dashboard, Create credentials for an OAuth client ID and select Web application
- You should make at least two credentials for your Development and Production environments (you can also create one for a Test environment)
- For development, enter
http://localhost:3000
(or whatever your development URI is) for your Authorized Javascript origins andhttp://localhost:3000/omniauth/google_oauth2/callback
for your Authorized redirect URIs - For production, enter the same values, but replace
http://localhost:3000
with your production URI e.g.https://myproject.com
- Open up your
config/application.yml
- For each development and production, copy the values listed for Client ID and Client secret into the appropriate key-value entry, e.g.
development:
GOOGLE_CLIENT_ID: 1234567890-abcdefghijklmnop.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET: aAbBcCdDeEfFgGhHiIjKlLmM
production:
GOOGLE_CLIENT_ID: 0987654321-ghijklmnopabcdef.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET: gGhHiIjKlLmMaAbBcCdDeEfF
- Google login is now enabled in the Rails app. Now we need to enable it in the UI. Open up
project/my-project/project.json
. Underauth_providers
enter:
"authProviders": [
{
"name": "google",
"label": "Google",
"path": "/auth/google_oauth2"
}
],
- Run
rake project:load['my-project']
to refresh this config in the interface - Finally, restart your server and visit
http://localhost:3000
. Now you should see the option to sign in via Google.
Notice: Speaker functionality is not being used in our instance, but retaining documentation for potential future use.
Sometimes your audio will contain multiple speakers. If you already know who are speaking, you can seed your transcripts with the speakers. Your users could then choose which speaker is speaking on any given line.
In your project folder, you should find an empty .csv file: project/my-project/data/speakers_seeds.csv. It contains just two columns: transcript_uid and name. Simply create one speaker per line, where transcript_uid is the transcript's uid and name is the name of the speaker.
Once you fill out the speakers this file, you can load them into the app with this command:
rake speakers:load['my-project','speakers_seeds.csv']
This app let's you create an arbitrary number of pages that you may link from the navigation menu or within other pages. All pages are found within:
project/
+-- my-project/
| +-- pages/
- All pages are written in Markdown, but since Markdown supports HTML, you can use HTML syntax as well.
- If you create a page called
faq.md
, you can access it via URLhttp://localhost:3000/page/faq
- Subdirectories are supported, but the URL will always respond to just the filename, e.g. for the file
project/my-project/pages/misc/faq.md
, the URL will still behttp://localhost:3000/page/faq
- You can embed assets in your markdown. For example
- Place an image in assets folder like
project/my-project/assets/img/graphic.jpg
- You can refer to it in a page like this:
<img src="/my-project/assets/img/graphic.jpg" />
- Place an image in assets folder like
- There are a few pages that the app comes with:
home.md
- contains the content that shows up on the homepagetranscript_edit.md
- contains the content that shows up on the top of all transcript editor pagestranscript_conventions.md
- contains the transcript conventions that show up in the drop-down on all transcript editor pages
In your project/my-project/project.json
file, there is an entry called menus
. These will contain all the available menus that will be displayed in the app. Here are the available menus:
header
- this is the persistent menu that shows up on the top of all pagestranscript_edit
- this is the menu that shows up below the main header menu if you are on a transcript editor pagefooter
- this is the persistent menu that shows up on the bottom of all pages
Each menu will contain a number of entries (or no entries). It may look like this:
"header": [
{"label": "Browse", "url": "/"},
{"label": "About", "url": "/page/about"},
{"label": "Main Website", "url": "http://otherwebsite.com/"}
],
The label
is what will show up in the menu, and the URL is what that label links to. It can link to a page within the app or an external page.
Sometimes you only want to have a link show up on certain pages. You can accomplish this like so:
"header": [
{"label": "Browse", "url": "/"},
{"label": "About", "url": "/page/about"},
{"label": "Help", "url": "/page/help", "validRoutes": ["transcripts/:id"]}
],
In the above case, the Help
link will only show up on transcript editor pages. You can see a list of available routes in the app's router.js file
Sometimes you don't want to redirect a user to a different page, but want to have the content show up in a pop-up modal. You can define modals in your project.json
file like this:
"modals": {
"help_modal": {
"title": "A Brief Guide",
"doneLabel": "Close",
"page": {"file": "help.md"}
},
"tutorial_modal": {
"title": "A Brief Tutorial",
"doneLabel": "Finished",
"pages": [
{"label": "Editing", "file": "tutorial_1.md"},
{"label": "Conventions", "file": "tutorial_2.md"},
{"label": "Commands", "file": "tutorial_3.md"}
]
}
},
This will create two modals:
help_modal
which contains the content of just one page:project/my-project/pages/help.md
tutorial_modal
which contains tabbed content of three pages
You can invoke a modal from within a menu like so:
"menus": {
"header": [
{"label": "Browse", "url": "/"},
{"label": "About", "url": "/page/about"},
{"label": "Help", "modal": "help_modal"}
],
...
},
You would probably want to customize the look and feel of your app. You can accomplish this by overriding the default CSS styling with a project CSS file:
project/
+-- my-project/
| +-- assets/
| +-- css/
| +-- styles.css
These styles will override any existing styles in the app. Similarly, you can add additional javascript functionality via custom js:
project/
+-- my-project/
| +-- assets/
| +-- js/
| +-- custom.js
Sometimes you may want to include additional files or tags in your app such as custom external font services, analytics, or meta tags. You can simply edit this page:
project/
+-- my-project/
| +-- layouts/
| +-- index.html
Be careful not to edit the existing app structure within the #app
element. Also, there are a few javascript and css files that the app depends on that you shouldn't delete.
Be sure to run the project rake task if you make any changes:
rake project:load['my-project']
Even though this app could be used by just one individual, this app was designed to allow multiple users to correct the same transcript simultaneously. The goal is to improve quality by having many users look at each line of text. This app contains a configurable "consensus" algorithm that looks for where users agree and disagree, and makes an informed decision about which submission is best.
Consensus is essentially when a certain threshold of users agree that a certain line of a transcript contains the correct text. This is a bit more complicated than it seems. Consider the following three lines of text submitted by three separate users:
- "Hi, my name is Jane Doe."
- "hi my name is Jane Doe"
- "Hi, my name is Jane Do."
Let's assume we need more than 50% of users to agree on a particular line of text--in this case, we need 2 of the 3 users to agree. The first two lines essentially say the same thing, but the first line contains capitalization and punctuation. The first thing the consensus algorithm does is "normalize" the text by removing punctuation and making everything the same case (it also removes extra whitespace and removes hesitations like "uhh" and "ummm"). So then the submissions will look like this:
- "hi my name is jane doe"
- "hi my name is jane doe"
- "hi my name is jane do"
So based on our requirement for at least 2 of 3 users to agree, it looks like the first 2 users agree and we have reached consensus. Now we have to choose which submission is "better" out of the 2:
- "Hi, my name is Jane Doe."
- "hi my name is Jane Doe"
Even though this can get rather subjective in some cases, the algorithm favors the following:
- line contains capitalization
- line contains punctuation
- line contains numbers (since the default convention is to transcribe a number and numerals, e.g. "3" not three)
- line is submitted by a registered user
- line contains hesitations (since the default convention is to transcribe as you hear it, e.g. uhh, umm)
Therefore, between our 2 options, the chosen text will be "Hi, my name is Jane Doe.". This line now has reached consensus and is now considered "complete."
The previous example demonstrates the ideal case where a line automatically reaches consensus. But there will be many cases where users simply don't agree and arbitration will be required. Since this app is built for large audio collections where the project creator may not have the resources to manually settle disputes, arbitration is built into the app itself.
The project creator can set a limit to how many submissions are allowed for a particular line. Let's say that number is 5 and the submissions are as follows:
- "Hi, my name is John Hardtospell."
- "hi my name is John hearttospell"
- "Hi, my name is John Harttospell."
- "Hi, my name is John Hard2spell."
- "hi my name is John Hardtospel"
Since this line reached 5 submissions but did not reach consensus (like in the previous example), it will go into "review mode". When this happens, users can no longer submit new transcriptions but must choose the best among the 5 existing choices/submissions.
Let's say that a 6th user chooses option 1; their "vote" counts as another submission with that options's text. So let's say more users vote as follows:
- "Hi, my name is John Hardtospell." (+5 votes)
- "hi my name is John hearttospell"
- "Hi, my name is John Harttospell." (+1 vote)
- "Hi, my name is John Hard2spell."
- "hi my name is John Hardtospel"
If we still have the same consensus rules as before (50% must agree), option 1 will be chosen as the correct text since it has 6 of the 11 total submissions.
There are generally 4 stages a line of text can possibly go through:
- Initialized: this contains the original text that the speech-to-text software created
- Edited: the line has received submissions, but not enough to reach consensus
- In Review: the line has reached enough submissions for consensus, but not enough users agree to reach consensus, so it must be reviewed by others
- Complete: the line has reached consensus and is no longer editable.
Similarly, an entire transcript has similar stages, but is generally just an aggregate of the lines that it is made up of. So a transcript is only Complete when all of its lines are Complete.
All of a project's consensus rules are defined in the project.json
file like so:
"consensus": {
"maxLineEdits": 5,
"minLinesForConsensus": 3,
"minPercentConsensus": 0.34,
"minLinesForConsensusNoEdits": 5,
"lineDisplayMethod": "original"
},
...
Here are what each property means:
- maxLineEdits - This is the maximum number of submissions for a particular line. If a line did not reach consensus when it reaches this number, it will go into review mode. Since the line will be locked to new submissions after this number is reached, the number should be reasonably high enough so least one of the submissions is probably correct.
- minLinesForConsensus - The minimum number of submissions required for a line to reach consensus. If you are expecting a lot of spam, you can increase this number. If you are using this tool internally and/or you trust all users of this tool, you can set this number to 1, meaning a single user's submission is final.
- minPercentConsensus - The threshold for determining if a line reached consensus. For example, if this number is 0.5, at least 50% of users must agree in order for a line to reach consensus.
- minLinesForConsensusNoEdits - Sometimes users submit a line with no edits, i.e. presses enter on a line that contains the original computer text. This may be intended or not: the original text may be correct or the user may just be navigating away from the line. In order to reduce false positives, this property allows you to configure how many "no-edit" submissions are required for it to be considered for consensus. For example, if the number is 5, at least 5 people must say that the original computer text was correct before it is considered for consensus. Submissions that contain edits are always preferred over no-edit submissions.
- lineDisplayMethod - This can either be original or guess. If this property is original, the user will see the original computer text when they are editing a line that is not locked. If this property is guess, the user will see what the app thinks is the "best guess" thus far based on the existing submissions. The advantage of showing the original is that all submissions are made independently since they do not see other users' submissions. It also ensures spam submissions are not seen by others. The advantage of guess is that consensus is more likely to be reached quicker since users are essentially looking at user submissions which would likely be more quality that the computer transcriptions.
Be sure to run the project rake task if you make any changes:
rake project:load['my-project']
If you edit the consensus rules after you have received edits, you can retroactively apply the new rules to the existing edits with the following tasks:
rake transcript_lines:recalculate
rake transcripts:recalculate
The full release script is currently saved in the WGBH MLA wiki in the MLA Staff Pages section. The document is called "NYPL Transcript Editor Deployment."
Broad overview steps are:
- Provision a server on AWS.
- Run the MLA Playbook for transcript-editor
- Install and set up Postgres 9.6.
- Run the [transcript-editor_deploy] Capistrano deployment](https://github.com/WGBH/transcript-editor_deploy)
- Intall and run Gulp.
Notice: We are not deploying our code with Heroku, but keeping this section for potential future use.
This example will use Heroku to deploy the app to production, though the process would be similar for other hosting solutions. The commands assume you have Heroku Toolbelt installed.
Before you start, if you used Pop Up Archive to generate your transcripts, make sure your manifest files are up-to-date to make sure your production server knows how to download the transcripts from Pop Up Archive. Run these commands:
rake collections:update_file['my-project,'collections_seeds.csv']
rake transcripts:update_file['my-project','transcripts_seeds.csv']
Replace my-project
and .csv
files with your project key and manifest files. Commit the updated manifest files to your repository and continue.
-
Create a new Heroku app:
heroku apps:create my-app-name heroku git:remote -a my-app-name
(Only run the 2nd command if you already have an app setup)
-
Provision a PostgreSQL database:
heroku addons:create heroku-postgresql:hobby-dev heroku pg:wait heroku config -s | grep HEROKU_POSTGRESQL
Replace
hobby-dev
with your database plan of choice. This example uses the free "Hobby Dev" plan. Note that you should choose a higher plan (e.g.standard-0
) for production; Hobby Dev has a row limit of 10,000 and a maximum of 20 connections. You can always upgrade an existing database. -
Update your environment variables
figaro heroku:set -e production
This sets environment variables from
config/application.yml
in your production environment -
Deploy the code and run rake tasks
git push heroku master heroku run rake db:migrate heroku run rake db:seed
-
Next you'll need to populate your transcripts. The last command will download your transcripts from Pop Up Archive. You can run these commands however many times you like if you update your manifest file or transcripts become available.
heroku run rake collections:load['my-project','collections_seeds.csv'] heroku run rake transcripts:load['my-project','transcripts_seeds.csv'] heroku run rake pua:download['my-project']
This tool has the concept of "user roles", though currently only supports three types of users:
- Guests - These are anonymous users that did not register. They can make contributions to transcripts, but they will not be able to track their progress between sessions. Their contributions also weigh slightly less than registered users during consensus calculations.
- Registered Users - These are the same as guests, but they are able to track their editing progress between any session or computer if they are signed in. They do this by signing in via Google or Facebook. Their contributions weigh slightly more than anonymous users during consensus calculations.
- Admins - These are users that are explicitly configured to be administrators of the tool. Their privileges/capabilities include:
- All their edits are considered "complete" and do not have to go through the usual process of consensus.
- They can edit lines that have status "completed"
- They can resolve flags. If a user thinks that a line that is completed/locked still contains errors, they can flag it with a type (misspelling, repeated word, missing word, etc) and comment.
- They have access to the admin dashboard. This gives them basic statistics about overall edits, user registrations, top users, and flags.
You can add admins by editing your project.json
file like so:
"adminEmails": [
"janedoe@myorg.org",
"johndoe@myorg.org",
...
],
...
Make sure these email addresses correspond to the email addresses that the user signs in with (i.e. Google or Facebook.) When the user signs in using this email address, they will be able to access their dashboard by clicking the drop-down menu on the top right side of the app.
Sometimes you just need all the transcript UIDs for a project, so we built a rake task for that: rake transcripts:export_uids['my-project']
.
Each transcript will be available to download in three formats: plain text, captions (WebVTT - based on SRT), and JSON. These are available at any stage of the transcription process regardless of how "complete" it is. They are available via the UI by clicking "Download this transcript." Users can choose whether they want to include speakers, timestamps, or raw edits depending on which format they choose.
You can enable/disable transcript downloads by setting allowTranscriptDownload
to true
or false
in your project.json
file like so:
"allowTranscriptDownload": true,
...
You can also enable/disable individual transcript downloads by setting can_download
to 1
or 0
in the transcripts
table in the database.
The app has an endpoint that enables programmatic access to transcripts:
GET /transcript_files.json?updated_after=yyyy-mm-dd&page=1
This will get all the transcript files that were updated after a certain date. This is useful if you want to periodically update transcripts that you display on another website.
A collection of words, arranged in a particular order that conveys meaning. This collective meaning corresponds to a certain audio or video program.
A grouping of Transcripts by 'Station Name'
String that appears in Fixit UI as 'Collection Name'
Stored in Collection#uid
field
Used as value for "organization" annotation that links each record to its Collection
A string used throughout Fixit+ to differentiate configuration for different fixit instances (fix_it_plus, riverside)
Must have pbcore annotation of type 'organization'
<pbcoreAnnotation annotationType="organization">[containing Collection's uid/Station Name value in Fixit+]</pbcoreAnnotation>
Must have pbcore annotation of type 'Transcript Status'
<pbcoreAnnotation annotationType="Transcript Status">[ 'Uncorrected' or 'Correcting' or 'Correct' ]</pbcoreAnnotation>
Must have transcript file stored at s3://americanarchive.org/transcripts/[guid]/[guid]-transcript.json
Must be (re)ingested into production AAPB after transcript file was placed in S3, to create
<pbcoreAnnotation annotationType="Transcript URL">[ts location in aapb s3]</pbcoreAnnotation>
:uid
field set to Station Name based on the station's 'Contributing Organization':vendor_id
field set to 0 (this is an internal reference to the transcript file format, but we only use VTT):project_uid
set to Project ID (i.e.fix_it_plus
orriverside
)
- SSH into Fixit+ instance as ec2-user
- Navigate to the Fixit+ app directory:
cd /var/www/transcript-editor/current
- Start a Rails Console:
RAILS_ENV=production bundle exec rails c
- Create the Collection using the station information:
Collection.create(uid: "This is my station name", vendor_id: 0, project_id: "example_project_id")
- Text file containing the guid for each record, separated by newlines
- Previously-created
Collection
with correct Project ID, Vendor ID (always 0), and Station Name - All records must have
<pbcoreAnnotation annotationType="organization">Example Station Name Value</pbcoreAnnotation>
in AAPB - All records must have
<pbcoreAnnotation annotationType="Transcript Status">Uncorrected</pbcoreAnnotation>
in AAPB (value can be Uncorrected/Correcting/Correct) - All records must have transcript file stored at the conventional AAPB S3 location
s3://americanarchive.org/transcripts/EXAMPLEGUID/EXAMPLEGUID-transcript.json
- All records must have been ingested or reindexed into production AAPB after the transcript file was stored in S3, and after the required annotations have been added. This creates the required "Transcript URL" annotation.
- Use scp to copy the guids text file into ec2-user's home directory on the destination Fixit+ instance:
scp -i ~/.ssh/transcript-editor.pem exampleguidsfile.txt ec2-user@examplefixit.host.name:/home/ec2-user/exampleguidsfile.txt
- SSH into the destination Fixit+ instance:
ssh -i ~/.ssh/transcript-editor.pem ec2-user@examplefixit.host.name
- Navigate to the Fixit+ app directory:
cd /var/www/transcript-editor/current
- Run the ingest rake task:
bundle exec rake aapb:ingest_guids['/home/ec2-user/exampleguidsfile.txt','example_project_id']
Releasing a transcript from Fixit+ to AAPB simply requires replacing the transcript file in AAPB S3.
No reindexing is required, because all required annotations must have already been in place to bring the record into Fixit+.
Currently, the scripts included in Fixit+ can only either:
- Release all 100% complete transcripts in a Fixit instance
- Release all transcripts in a Fixit instance
- SSH into the destination Fixit+ instance:
ssh -i ~/.ssh/transcript-editor.pem ec2-user@examplefixit.host.name
- Navigate to the Fixit+ app directory:
cd /var/www/transcript-editor/current
- Run the release script:
RAILS_ENV=production bundle exec ruby scripts/workflow.rb --complete
See LICENSE.
Though it’s not required, if you would like to credit us as the source we recommend using the following statement and links:
Powered by the Open Transcript Editor created by The New York Public Library with generous support from the Knight Prototype Fund.