-
Notifications
You must be signed in to change notification settings - Fork 0
Add README instructions for publishing repo to GitHub #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
❌ Deploy Preview for coruscating-pastelito-60da0c failed. Why did it fail? →
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| def _combine_source_ids(values: Iterable[str]) -> str: | ||
| unique = sorted({v for v in values if v}) | ||
| return ";".join(unique) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop null source_ids before sorting
_combine_source_ids (merge.py lines 169‑171) builds a set with {v for v in values if v} and then sorts it. Catalogue rows come in with source_id values that are pd.NA/NaN whenever the export lacks an ID (the column is injected by _ensure_columns and never coerced afterward), so the truth test if v either raises ValueError: boolean value of NA is ambiguous (for pd.NA) or lets float('nan') into the set. As soon as a dedupe group contains both a real ID (string) and one of these nulls, sorted(...) raises TypeError: '<' not supported between instances of 'float' and 'str', and build_master_bibliography() aborts before producing any master table. Filter nulls with pd.notna/math.isnan and coerce the remaining IDs to strings before sorting so missing IDs don’t blow up the deduplication step.
Useful? React with 👍 / 👎.
| def standardize_language_label(label: Optional[str]) -> Optional[str]: | ||
| """Map language codes and descriptors to canonical names.""" | ||
|
|
||
| if not label: | ||
| return None | ||
| cleaned = unidecode(label).lower().strip() | ||
| if cleaned in LANGUAGE_MAP: | ||
| return LANGUAGE_MAP[cleaned] | ||
| if cleaned.startswith("lat"): | ||
| return "Latin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standardize_language_label fails on pandas NAs
standardize_language_label (normalize.py 106‑115) assumes every label is a truthy string (if not label followed by unidecode(label)). _load_catalogue() (merge.py line 140) applies this directly to the raw language column, which still contains pandas NA/NaN values for rows whose language is blank or for columns created by _ensure_columns. Calling bool(pd.NA) raises ValueError: boolean value of NA is ambiguous, and passing a float NaN into unidecode raises TypeError, so the pipeline crashes whenever a catalogue row lacks language metadata—well before the Latin-only filter runs. Guard against pandas nulls (e.g., use pd.isna or label is None before the if not label and unidecode calls, or sanitize the column before applying the function) so empty language cells don’t terminate the build.
Useful? React with 👍 / 👎.
Summary
Testing
Codex Task