Iahmad/migrate transcripts s3#17605
Conversation
|
Your PR has finished running tests. |
2 similar comments
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
mushtaqak
left a comment
There was a problem hiding this comment.
Did initial review. Things are looking good.
Miner comments on logging and single quotes across py file.
| $ ./manage.py cms migrate_transcripts --all-courses --settings=devstack_docker | ||
| $ ./manage.py cms migrate_transcripts 'edX/DemoX/Demo_Course' --settings=devstack_docker | ||
| """ | ||
| args = '<course_id course_id ...>' |
There was a problem hiding this comment.
Should we not have a --commit argument that will actually execute the command. If --commit is not provided the command will draft run and would print out the expected transcripts migrated?
|
|
||
|
|
||
| VIDEO_DICT_STAR = dict( | ||
| client_video_id="TWINKLE TWINKLE", |
There was a problem hiding this comment.
I would prefer single quotes here
| data={'data': video_sample_xml} | ||
| ) | ||
|
|
||
| save_to_store(SRT_FILEDATA, "subs_grmtran1.srt", 'text/srt', self.video_descriptor.location) |
| **kwargs | ||
| ) | ||
| except InvalidKeyError as exc: | ||
| raise CommandError(u'Invalid Course Key: ' + unicode(exc)) |
| kwargs[key] = options[key] | ||
|
|
||
| try: | ||
| enqueue_async_migrate_transcripts_tasks( |
There was a problem hiding this comment.
Should we not check if the course exists with this key ? e.g
if not modulestore().get_course(course_key):
There was a problem hiding this comment.
Added the check in def async_migrate_transcript
There was a problem hiding this comment.
Added this check in def async_migrate_transcript
| revision=ModuleStoreEnum.RevisionOption.published_only, include_orphans=False): | ||
| all_videos.append(video) | ||
|
|
||
| for video in store.get_items(course_key, qualifiers={'category': 'video'}, |
There was a problem hiding this comment.
This will be unneccessary if we do not provide revision argument
There was a problem hiding this comment.
The all revision option does not seem to work ...
cms/djangoapps/contentstore/tasks.py
Outdated
| return all_videos | ||
|
|
||
|
|
||
| def is_transcript_content_srt(transcript_content): |
cms/djangoapps/contentstore/tasks.py
Outdated
| try: | ||
| srt_subs_obj = SubRipFile.from_string(transcript_content.data.decode('utf-8-sig')) | ||
| if len(srt_subs_obj) > 0: | ||
| LOGGER.info("SRT file format detected") |
cms/djangoapps/contentstore/tasks.py
Outdated
| transcript_name = args[2] | ||
| force_update = args[3] | ||
| result = None | ||
| LOGGER.info("Start migrating %s transcript", language_code) |
There was a problem hiding this comment.
lets use better text for this log as well
cms/djangoapps/contentstore/tasks.py
Outdated
| try: | ||
| transcript_content = Transcript.asset(video.location, transcript_name, language_code) | ||
| if video.edx_video_id: | ||
| LOGGER.info("Found edx_video_id= %s via first fetch asset method", video.edx_video_id) |
There was a problem hiding this comment.
Add log prefix so as to make debugging easier
|
Your PR has finished running tests. |
201bf55 to
049ef73
Compare
|
Your PR has finished running tests. |
1 similar comment
|
Your PR has finished running tests. |
muhammad-ammar
left a comment
There was a problem hiding this comment.
@irfanuddinahmad I am done with first pass of the review. Let me know once feedback is addressed.
cms/djangoapps/contentstore/tasks.py
Outdated
| video = args[0] | ||
| language_code = args[1] | ||
| transcript_name = args[2] | ||
| force_update = args[3] |
There was a problem hiding this comment.
we can do something like
video, language_code, transcript_name, force_update = args
cms/djangoapps/contentstore/tasks.py
Outdated
| )) | ||
| else: | ||
| LOGGER.info("video.sub is empty") | ||
| if any(other_lang_transcripts): |
There was a problem hiding this comment.
I think we don't need if other_lang_transcripts:. for loop will not run if other_lang_transcripts is empty.
There was a problem hiding this comment.
I think code under if english_transcript: and for lang, name in other_lang_transcripts.items(): can be merged like below. As per my understanding we don't need separate handling for sub and transcripts in this case.
all_language_transcripts = dict({'en': video.sub}, **other_lang_transcripts)
for lang, name in all_language_transcriptss.items():
...
cms/djangoapps/contentstore/tasks.py
Outdated
| try: | ||
| transcript_content, transcript_name, Transcript_mime_type = get_transcript_from_contentstore( | ||
| video, language_code, Transcript.SJSON) | ||
| if video.edx_video_id: |
There was a problem hiding this comment.
I think we can replace the whole if...else... with something like below
if not clean_video_id(video.edx_video_id):
video.edx_video_id = create_external_video('external-video')
video.save_with_metadata(user=User.objects.get(username='staff'))
result = push_to_s3(
video.edx_video_id,
language_code,
transcript_content,
Transcript.SJSON,
force_update
)
cms/djangoapps/contentstore/tasks.py
Outdated
|
|
||
|
|
||
| @task(base=PersistOnFailureTask) | ||
| def async_migrate_transcript(*args, **kwargs): |
There was a problem hiding this comment.
what about def async_migrate_transcript(course_key, **kwargs):?
cms/djangoapps/contentstore/tasks.py
Outdated
| kwargs = { | ||
| 'force_update': force_update, | ||
| 'commit': commit | ||
| } |
There was a problem hiding this comment.
indentation is bit off
| self.assertEqual(len(languages), 0) | ||
|
|
||
| # now call migrate_transcripts command and check the transcript availability | ||
| call_command('migrate_transcripts', unicode(self.course.id)) |
There was a problem hiding this comment.
I think we should verify the output of command in this case to check if the correct message is returned.
There was a problem hiding this comment.
The command output is checked by the following test
def test_migrate_transcripts_logging
There was a problem hiding this comment.
Yes we should check to have been migrated transcripts here.
| save_to_store(SRT_FILEDATA, 'subs_grmtran1.srt', 'text/srt', self.video_descriptor.location) | ||
| save_to_store(CRO_SRT_FILEDATA, 'subs_croatian1.srt', 'text/srt', self.video_descriptor.location) | ||
|
|
||
| def test_migrated_transcripts_count(self): |
There was a problem hiding this comment.
please add docstring to explain what we are testing/verifying in this test.
| languages = api.get_available_transcript_languages(self.video_descriptor.edx_video_id) | ||
| self.assertEqual(len(languages), 2) | ||
|
|
||
| def test_migrated_transcripts_without_commit(self): |
There was a problem hiding this comment.
same comment as above.
There was a problem hiding this comment.
The command output is checked by the following test
def test_migrate_transcripts_logging
| """ | ||
| Test migrating transcripts | ||
| """ | ||
| translations = self.video_descriptor.available_translations(self.video_descriptor.get_transcripts_info()) |
There was a problem hiding this comment.
Don't we need to pass include_val_transcripts to get_transcripts_info and available_translations?
@Qubad786 what do you say?
| call_command('migrate_transcripts', unicode(self.course.id), '--force-update', '--commit') | ||
|
|
||
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'hr')) | ||
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'ge')) |
There was a problem hiding this comment.
can we do the above in a loop?
There was a problem hiding this comment.
the command arguments change for subsequent calls ... a loop will decrease readability
|
@Qubad786 You also need to look into this PR. |
|
Your PR has finished running tests. |
3 similar comments
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
a59ff2a to
e6d5216
Compare
|
Your PR has finished running tests. |
cms/djangoapps/contentstore/tasks.py
Outdated
| )) | ||
| LOGGER.info("[Transcript migration] process for video %s ended", video.location) | ||
| callback = task_status_callback.s() | ||
| status = chord(sub_tasks)(callback) |
There was a problem hiding this comment.
@nasthagiri I have used a celery task for each video inside a celery task for each course. Do you think this could create too many celery tasks, which could potentially swamp the task queue? The benefit of a 'task per video' could be a more granular failure tracking and retry.
There was a problem hiding this comment.
@irfanuddinahmad That should be fine given our other recent usage of celery tasks:
- we create a celery task for each learner that we send an email to via ACE
- we create a celery task for each problem submission by a learner
However, you should let devops know so they are aware in case they see spikes in tasks at certain times. Also, the team will want to monitor this the first time you roll it out.
Things to consider:
-
Is there a time_limit placed on the task? If your task is calling the network, you'll want to timeout appropriately (and exponentially backoff) in case of a persistent network failure.
-
Are you using our celery-utils, which provides functionality for logging and persistent retries? You may find this useful if you want to persist failures (after celery's retries give up) and automatically retry them via a Jenkins job.
You can see some of this in action in the grades tasks code.
There was a problem hiding this comment.
@nasthagiri Thanks. Very useful feedback (as usual! :) )
I am already using the PersistOnFailureTask ... let me try the
celeryutils.chordable_django_backend and LoggedPersistOnFailureTask also
mushtaqak
left a comment
There was a problem hiding this comment.
@irfanuddinahmad Done with second pass of the review. This is looking good.
| u'Without this flag, the command will return the transcripts discovered for migration ', | ||
| ) | ||
|
|
||
| def handle(self, *args, **options): |
| raise CommandError('At least one course or --all-courses must be specified.') | ||
|
|
||
| kwargs = {} | ||
| for key in ('all_courses', 'force_update', 'commit'): |
There was a problem hiding this comment.
May be we can write this as:
kwargs = {key: options[key] for key in ['all_courses', 'force_update', 'commit'] if options.get(key)}
|
|
||
| 1 | ||
| 00:00:02,720 --> 00:00:05,430 | ||
| Ja, ich spreche Deutsch |
There was a problem hiding this comment.
Lets also add one more line so as to check unicode characters.
2
00:00:6,500 --> 00:00:08,600
可以用“我不太懂艺术 但我知道我喜欢什么”做比喻.
| client_video_id='TWINKLE TWINKLE', | ||
| duration=42.0, | ||
| edx_video_id='test_edx_video_id', | ||
| status="upload", |
|
|
||
| def test_invalid_course(self): | ||
| with self.assertRaises(CommandError): | ||
| call_command('migrate_transcripts', "invalid-course") |
| self.assertEqual(len(languages), 0) | ||
|
|
||
| # now call migrate_transcripts command and check the transcript availability | ||
| call_command('migrate_transcripts', unicode(self.course.id)) |
There was a problem hiding this comment.
Yes we should check to have been migrated transcripts here.
| languages = api.get_available_transcript_languages(self.video_descriptor.edx_video_id) | ||
| self.assertEqual(len(languages), 2) | ||
|
|
||
| def test_migrated_transcripts_without_commit(self): |
| # now call migrate_transcripts command again and check the transcript availability | ||
| call_command('migrate_transcripts', unicode(self.course.id), '--commit') | ||
|
|
||
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'hr')) |
There was a problem hiding this comment.
we should also check the count before and after the command run.
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'hr')) | ||
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'ge')) | ||
|
|
||
| def test_migrate_transcripts_logging(self): |
There was a problem hiding this comment.
We haven't checked for any exception logs. We need to check them as well in this test or a separate test
| call_command('migrate_transcripts', "invalid-course") | ||
|
|
||
|
|
||
| class MigrateTranscripts(ModuleStoreTestCase): |
|
Your PR has finished running tests. |
5 similar comments
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
8a5cf1f to
239b4ad
Compare
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
|
@nasthagiri Kindly advise regarding the trigger for our video transcript management command. Some options are discussed in the comments of the following issue: |
|
Your PR has finished running tests. |
1 similar comment
|
Your PR has finished running tests. |
|
@irfanuddinahmad https://openedx.atlassian.net/browse/ENT-641 takes me to a SuccessFactors ticket. Did you intend another ticket number? |
|
"@irfanuddinahmad https://openedx.atlassian.net/browse/ENT-641 takes me to a SuccessFactors ticket. Did you intend another ticket number?" |
|
@irfanuddinahmad ah. Thanks. I agree with Fred that there are valid reasons to not have Jenkins jobs be fully parameterized. To Backfill Grades, we created a management command for the following reasons:
We then created a Jenkins Job that would be manually triggered to run the management command. Following devOps practices, we configured the management command via Django Admin. Now, for video migrations, you can do the same thing we did for Grades: Or, you can go with Option #1 specified in the ticket: |
|
@nasthagiri Thanks. We will have a Jenkins job configured via Django Admin. |
|
Your PR has finished running tests. |
|
@irfanuddinahmad Ok. Please inform your devops contact person so s/he is aware of your path forward. |
|
Your PR has finished running tests. |
2 similar comments
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
| default=DEFAULT_ALL_COURSES, | ||
| help=u'Migrates transcripts to S3 for all courses.' | ||
| ) | ||
| parser.add_argument( |
| """ | ||
| force_update = BooleanField(default=False) | ||
| commit = BooleanField(default=False) | ||
| course_ids = TextField( |
There was a problem hiding this comment.
Don't we need all_courses option here?
|
|
||
| force_update = BooleanField( | ||
| default=False, | ||
| help_text="Flag to update transcripts in django storage even if already present." |
There was a problem hiding this comment.
Flag to force migrate transcripts for the requested courses, overwrite if already present.
| '--course-id', '--course_id', | ||
| dest='course_ids', | ||
| action='append', | ||
| help=u'Migrates transcripts for list of courses.' |
There was a problem hiding this comment.
nit: Migrates transcripts for the list of courses
| action='store_true', | ||
| default=DEFAULT_COMMIT, | ||
| help=u'Commits the discovered video transcripts to django storage. ' | ||
| u'Without this flag, the command will return the transcripts discovered for migration ' |
There was a problem hiding this comment.
nit: Remove extra space in transcripts discovered for migration probably add a dot 😃
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'hr')) | ||
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'ge')) | ||
|
|
||
| def test_migrate_transcripts_logging(self): |
cms/djangoapps/contentstore/tasks.py
Outdated
| def async_migrate_transcript_subtask(self, *args, **kwargs): | ||
| #pylint: disable=unused-argument | ||
| """ | ||
| Migrates a transcript of a given video in a course as a new celery task. |
There was a problem hiding this comment.
indentation is a bit off here
cms/djangoapps/contentstore/tasks.py
Outdated
| dict({'file_format': file_format}), | ||
| ContentFile(transcript_content) | ||
| ) | ||
| LOGGER.info("[Transcript migration] Push_to_storage %s for %s with create_or_update method", |
There was a problem hiding this comment.
nit: LOGGER.info("[Transcript migration] save_transcript_to_storage: transcript video %s is %s over-written", '' if result else 'not', edx_video_id)
cms/djangoapps/contentstore/tasks.py
Outdated
| file_format, | ||
| ContentFile(transcript_content) | ||
| ) | ||
| LOGGER.info("[Transcript migration] Push_to_storage %s for %s with create method", result, edx_video_id) |
There was a problem hiding this comment.
might need to change to save_transcript_to_storage
cms/djangoapps/contentstore/tasks.py
Outdated
| LOGGER.info("[Transcript migration] Push_to_storage %s for %s with create method", result, edx_video_id) | ||
| return result | ||
| except ValCannotCreateError as err: | ||
| LOGGER.exception("[Transcript migration] Push_to_storage_failed: %s", err) |
There was a problem hiding this comment.
might need to change to save_transcript_to_storage
|
Your PR has finished running tests. |
cms/djangoapps/contentstore/tasks.py
Outdated
There was a problem hiding this comment.
above two variables are unused?
There was a problem hiding this comment.
these are from the branched off master code ...
cms/djangoapps/contentstore/tasks.py
Outdated
There was a problem hiding this comment.
I think we can do clean_video_id only once like this.
edx_video_id = clean_video_id(video.edx_video_id)
if not edx_video_id:
video.edx_video_id = create_external_video('external-video')
video.save_with_metadata(user=User.objects.get(username='staff'))
if edx_video_id:
result = save_transcript_to_storage(
edx_video_id,
language_code,
transcript_content,
Transcript.SJSON,
force_update
)And we need to remove clean_video_id from save_transcript_to_storage as we not making any decision on edx_video_id in it.
mushtaqak
left a comment
There was a problem hiding this comment.
Looks good to me, just one outstanding question remaining though about all_courses option in model.
After that's addressed. Let's rebase and merge. Good work 👍
There was a problem hiding this comment.
Should we not have all_courses option here?
|
Your PR has finished running tests. |
7bad68d to
fabdb1d
Compare
|
Your PR has finished running tests. |
fabdb1d to
3e426ac
Compare
|
Your PR has finished running tests. |
|
EdX Release Notice: This PR has been deployed to the staging environment in preparation for a release to production on Friday, April 13, 2018. |
|
EdX Release Notice: This PR has been deployed to the production environment. |
|
EdX Release Notice: This PR has been rolled back from the production environment. |
Implementation of the design at:
https://openedx.atlassian.net/wiki/spaces/EDUCATOR/pages/162172272/Migration+from+contentstore+to+s3