Skip to content

incorporating other changes prior to new column importation #1450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 28, 2024

Conversation

melange396
Copy link
Collaborator

based on #1449 with techniques similar to those used in #1442 ...

This adds some previously missing signal "Description" fields, removes {}-style variable substitution and replaces it with appropriate text in other "Description" fields, and adds previously missing "Available Geography" fields.

After this is merged, i will run the google-doc syncronization GH action once more, and then verify that it it is only adding new fields.

@melange396 melange396 added documentation chore data quality user experience things that affect users of our software or services labels May 28, 2024
Copy link

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

@melange396
Copy link
Collaborator Author

this code, similar to that in #1442, verifies and extracts the changes to existing fields:

import pandas as pd

base_url = 'https://github.com/cmu-delphi/delphi-epidata/raw/{}/src/server/endpoints/covidcast_utils/db_signals.csv'

current = pd.read_csv(base_url.format('dev'), na_filter=False)
proposed = pd.read_csv(base_url.format('bot/update-docs'), na_filter=False)

# this code assumes columns have not been removed or renamed,
# and that no new rows have been added or had their ordering changed

new_cols = set(proposed.columns) - set(current.columns)
print("new cols:", sorted(new_cols))

non_matching = (proposed[current.columns] != current)
diffs_per_col = non_matching.apply(sum)
print(diffs_per_col)

mismatched_time = pd.concat([current[['Source Subdivision', 'Signal']], non_matching[['Description']]], axis=1)
print("Descriptions changed:")
print(mismatched_time[mismatched_time['Description']])

# produce alpha-sorted "Available Geography" from each csv for accurate comparison purposes:
current_ag_norm = current['Available Geography'].apply(lambda g: ','.join(sorted(g.split(','))))
proposed_ag_norm = proposed['Available Geography'].apply(lambda g: ','.join(sorted(g.split(','))))
diff_geos = (current_ag_norm != proposed_ag_norm)
print("Number of rows with different geographies:", sum(diff_geos))
mismatched_geos = pd.concat([current[['Source Subdivision', 'Signal']], diff_geos], axis=1)
print(mismatched_geos[mismatched_geos['Available Geography']])
# This shows that these 44 rows with different geographies are accounted for by these rows that have empty geos:
print(
      (current.iloc[380:394]['Available Geography']=='').all(),
      (current.iloc[419:449]['Available Geography']=='').all(),
      (394-380)+(449-419)==44
      ) 




intermediate = current.copy()
intermediate.iloc[380:394]['Available Geography'] = proposed.iloc[380:394]['Available Geography']
intermediate.iloc[419:449]['Available Geography'] = proposed.iloc[419:449]['Available Geography']
intermediate['Description'] = proposed['Description']
intermediate.to_csv('intermediate.csv', index=False)
import os
for _ in range(2):
    os.system("sed -i 's/,False,/,FALSE,/g' intermediate.csv")
    os.system("sed -i 's/,True,/,TRUE,/g' intermediate.csv")

@melange396 melange396 requested a review from tinatownes May 28, 2024 19:58
Copy link

@carlynvandyke carlynvandyke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@melange396 melange396 merged commit 1559870 into dev May 28, 2024
7 checks passed
@melange396 melange396 deleted the csv_update_prep branch May 28, 2024 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore data quality documentation user experience things that affect users of our software or services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants