This repository has been archived by the owner on Dec 14, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 87
Source metadata (e.g. end date and num_stories_X) incorrect #726
Labels
Comments
the media_health data is generated by a daily cron job. it must not be
running. I will take a look.
…-hal
On Wed, Sep 23, 2020 at 9:27 AM Dennis Jen ***@***.***> wrote:
We've found a number of inconsistencies relating to end dates in source
manager:
- mediacloud/web-tools#1953
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mitmedialab_MediaCloud-2DWeb-2DTools_issues_1953&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=jNKAqtvmqE2jLn29F_u8tpDaUoUYwKr4OKIFmgI0a-I&s=tjgayoBbkIhi3Wk-w7S2ahzTuV4DA4OGr7PEW6KItcA&e=>
- mediacloud/web-tools#1991
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mitmedialab_MediaCloud-2DWeb-2DTools_issues_1991&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=jNKAqtvmqE2jLn29F_u8tpDaUoUYwKr4OKIFmgI0a-I&s=LOemADMdaKpasb93BObBINj2Iy9MrMuI54PWoHyaTd4&e=>
I think I've zeroed in on where the problem may lie. The mediaHealth
endpoint returns a different end date from storyCount. For example:
mc.mediaHealth(1363086)
{'coverage_gaps': 1,
'coverage_gaps_list': [{'expected_sentences': 569.03,
'expected_stories': 2.69,
'media_id': 1363086,
'num_sentences': 102.71,
'num_stories': 3.0,
'stat_week': '2020-02-10 00:00:00-05:00'}],
'end_date': '2020-02-17 00:00:00-05:00',
'expected_sentences': 569.03,
'expected_stories': 2.69,
'has_active_feed': False,
'is_healthy': False,
'media_health_id': 1115839766,
'media_id': 1363086,
'num_sentences': 0,
'num_sentences_90': 0,
'num_sentences_w': 0,
'num_sentences_y': 224.09,
'num_stories': 0,
'num_stories_90': 0,
'num_stories_w': 0,
'num_stories_y': 1.44,
'start_date': '2019-02-11 00:00:00-05:00'}
fq='publish_day:[2010-01-01T00:00:00Z TO 2020-09-23T00:00:00Z]'
q='media_id:1363086 AND NOT tags_id_stories:8875452'
mc.storyCount(solr_query=q, solr_filter=fq, split=True)
{'counts': [{'count': 1, 'date': '2019-02-13 00:00:00'},
{'count': 1, 'date': '2019-02-15 00:00:00'},
{'count': 1, 'date': '2019-02-20 00:00:00'},
...
{'count': 9, 'date': '2020-09-18 00:00:00'},
{'count': 1, 'date': '2020-09-20 00:00:00'},
{'count': 3, 'date': '2020-09-21 00:00:00'},
{'count': 7, 'date': '2020-09-22 00:00:00'}]}
Note the end date in media health is 2020-02-17 00:00:00-05:00 and num_stories_90
= 0 and the final date in the split story count is 2020-09-22 00:00:00.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mediacloud_backend_issues_726&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=jNKAqtvmqE2jLn29F_u8tpDaUoUYwKr4OKIFmgI0a-I&s=71FCpZsA6estjkVWRP3VgG8YesYFF4ZLJpPPSwJy7mc&e=>,
or unsubscribe
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T33HPGXB56N7NDJBTDSHIAUDANCNFSM4RXDPYKQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=jNKAqtvmqE2jLn29F_u8tpDaUoUYwKr4OKIFmgI0a-I&s=2y3l5vxddWBFPIpBEjAYeyFMNQ8lkfaZ4rot7hf9urw&e=>
.
|
This was referenced Sep 25, 2020
The media health job has finished running and seems to have caught the data up. Can you please take a look and tell me if it looks better now? |
Sorry, I hit the endpoint and the dates still don't match up. 😢 |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
We've found a number of inconsistencies relating to end dates in source manager:
I think I've zeroed in on where the problem may lie. The
mediaHealth
endpoint returns a different end date fromstoryCount
. For example:Note the end date in media health is
2020-02-17 00:00:00-05:00
andnum_stories_90 = 0
and the final date in the split story count is2020-09-22 00:00:00
.The text was updated successfully, but these errors were encountered: