BibTexNanny is a tool to check the consistency of BibTex files, fix common mistakes and generate simplified versions of a bibliography.
BibTexNanny uses biblib to parse and generate BibTex files.
The following fixes and changes should be made to biblib:
- Add BibDesk-compatibility mode for BibTex output
- Fix issues with loading bad month information
- Can't replicate issue anymore, not sure what changed.
- Add ability to handle duplicate keys
- Prevent BibTex Parser from dropping metadata and comment lines
- BibTexNanny internal work-around
- When names are parsed, curly braces need to be handled correctly
- Find duplicates
- Duplicate keys
- added biblib work-around to load files with duplicate keys.
- Duplicate paper titles
- Grade badness of duplicate by how much of the rest matches
- Consider cases where duplicates might be acceptable
- Pairs of entries for presentation and paper (what is the entry type for the presentation).
- Allow users to define entry types that should be ignored when looking for duplicate titles. This way you can for example model presentations as
@misc
entries and have them be ignored
- Allow users to define entry types that should be ignored when looking for duplicate titles. This way you can for example model presentations as
- Pre-print and published version of paper.
- Author who actually named different papers differently (in what cases would this happen?)
- Different editions of a book.
- Possibly paper and extended version of it as journal article.
- Pairs of entries for presentation and paper (what is the entry type for the presentation).
- Duplicate keys
- Warnings for missing fields
- Optional warning for optional fields
- Tex-Unicode conversion
- LaTeX to Unicode conversion
- Fix loosing curly braces
- Unicode to BibTeX conversion
- Check if URLs require special handling
- LaTeX to Unicode conversion
- Warnings for bad formatting
- Warning for non-standard entry type
- Warning for fields whose value has no curly braces, but is not a known macro
- Warnings for non-secured capitalisation in name field
- Warnings for unnecessary curly braces
- Curly braces are not only for uppercase characters but also for encoding special characters, e.g.
\'{e}
to geté
- Allow user preference for wrapping characters or whole words.
- What is the difference between single and double braces?
- Curly braces are not only for uppercase characters but also for encoding special characters, e.g.
- Warnings for badly formatted in page numbers
- Find badly formatted names (author and editor fields)
- All-caps names
- Bad use of latex commands
- Missing spaces between initials
- Other bad formattings
- Warning for all-caps texts
- Notice bad months
- Check if desired key format is followed (see entry key format)
- Warnings for inconsistent formatting
- Different names for conferences (see dictionary of conference names)
- Name formatting
- Names or parts of names written in all caps (
MICKEY MOUSE
orMickey MOUSE
)- Identify when an all-caps name part is actually intials written without period or whitespace
- Name initials
- Initial written without period (
Mickey D Mouse
) - Multiple initials written without whitespace (
Mickey A.B. Mouse
) - Multiple initials written without periods or whitespace (
Mickey AD Mouse
) - Warning when first names are only initials
- Warning when only some names of a paper are full and some have initials
- Initial written without period (
- Names or parts of names written in all caps (
- Location names
- Indicate when there is a country without a city
- Indicate when there is a city without a country
- States missing from US locations
- Inferrable information for conferences/journals is inconsistent
- Allow limiting search to citations found in aux file
- Infer fields from other entries
- Basic inference functionality
- Add more inferrable fields (see Field Inference)
- Add functionality for mapping information across types (e.g. from proceeding to inproceedings)
- Infer full names
- Infer full name form of initials when the full name is used elsewhere
- Infer proper non-ASCII spelling of a name when is it used elsewhere
- Fix inconstistent fields
- Replace conference name variations with main name (see dictionary of conference names)
- Expand name initials to full names
- Infer full name form of initials when the full name is used elsewhere
- Infer proper non-ASCII spelling of a name when is it used elsewhere
- Make locations more informative (City, [State], Country)
- Add missing country
- Add missing city
- Add state (USA only)
- Extend state initials to full state name
- Have consistent file order
- Fix formatting
- Replace non-ASCII characters in keys
- Add wraps around capitalised characters in name field
- Add option to wrap entire words instead of only the capitalised characters
- Remove unnecessary {}-wraps
- Fix badly formatted page numbers
- Fix all-caps text (but not single all caps words)
- Separate handling for names
- Fix bad but understandable months (e.g. numbers)
- Correct handling for escaped sequences - [ ] Escaped by curly braces - [ ] Escaped by math mode
- Name formatting
- Change format of name to non-ambiguous "Last, First" format
- Fix special character formatting
- Use consistent braces format (e.g. write
{\"o}
instead of\"{o}
) - Replace latex commands (e.g. replace
\textasciicaron{}e
with{e}
)
- Use consistent braces format (e.g. write
- Fix all-caps names (
MICKEY MOUSE
orMickey MOUSE
) - Fix initials format
- Initials must be followed by a period
- Multiple initials must be separated by spaces
- Test if text starts with "and"
- Rename entry keys
- Provide a format to specify the desired key names
- Key format might differ for different entry types.
- Key format should consist of only ASCII characters
- Multi-bibliography merger
- Identify entries that are the same
- Option 1: Same key
- Option 2: Match on major fields (e.g. name plus authors?)
- Merge
- Identical fields are accepted
- Fields available in only one version are accepted
- Fields that clash cause user prompt or trigger other fixer functions
- Identify entries that are the same
- Simplify conference names
- Use dictionary of conference names
- allow regex or sed replacement
- Simplify Names
- Turn full first names into initials
- Turn full middle names into initials
- Simplify Locations
- Drop entirely
- Drop city
- Drop state
- Shorten state to initials
- Copy location to address (even though technically it is incorrect)
- Allow full name, name variation, short name
- Names should allow for number placeholder
- How to link regularly named conferences with years where they were held in conjunction with something?
- Additional script to suggest possible name variations
- There might already be an open source system for standardising BibTex keys. This is also used by Zotero. Gotra check that out.
- First author last name
- capitalised
- lower caps
- Year
- Word from Title
- capitalised
- lower caps
- all caps
- Disambiguating characters
- lowercase a,b,c
- lastnameYEAR
- LastnameYEAR
- LastnameYEARkeyword
- LastnameYEARdisambig
- lastname_keyword_year
- TITLEWORD
- LastnameYEAR or KEYWORD
- Number of hardcoded options
- Easy to implement, little flexibility
- RegEx
- Easy to implement, flexible, but limited functionality (can't check other fields)
- Actually, if you use named groups, you could use those names to trigger additional checks for them.
- Custom format
- Lots of work to implement, full functionality, probably quite flexible
- article: journal + year + volume => month
- article: journal + year + month => volume
- book: booktitle + year +volume/number => inbook: author, editor,publisher, series, edition, month, publisher
- book: booktitle + year +volume/number => incollection: editor, publisher, series, edition, month, publisher
- conference: booktitle + year => address, month, editor, organization, publisher
- inbook: title + year => address, month, editor, publisher
- incollection: booktitle + year => address, month, editor, publisher
- inproceedings: booktitle + year => address, month, editor, organization, publisher
- proceedings: booktitle + year => i**nproceedings: **address, month, editor, organization, publisher
- If proceedings title contains an index (e.g. "Proceedings of the 5th Conference on Examples") we can infer year and all other pieces of information from it.
- Use Python's configparser, which allows INI-like config files
Dict- Straightforward, but need to keep the key strings straight
- Custom object with lots of boolean fields
- More design effort, but probably more flexible
- Should have different class for each Nanny component
- As the tasks overlap considerably, there should be a NannyConfig superclass and inherriting classes for the components.
- Accessing config info should be done via functions, not fields, to allow custom processing of the stored information
- True (check value)
- False (don't check value)
- True/Autofix/Auto (autofix value)
- Tryfix/Try (autofix if trivial, otherwise prompt to fix)
- Promptfix/Prompt (Prompt to fix)
- False (don't check value)
How information for both scripts can be given in the same config file
- Single value for both (Try and Prompt are treated as True)
-
Tuple: False,Tryfix (CONSISTENCY,FIXER) -
Variables for only one of the two configs, e.g. duplicateKeys-consistency - Different sections for giving instructions for both or just either
Should have separate config files.
- Blacklist: List fields that should be removed
- Whitelist List only the fields that are wanted
- Variables for conversion functions
- Argument calls
- set list of wanted fields (if None, all are wanted)
- Set list of unwanted fields (optional)
- Config files
- allows for templates
- More complex to set up
- Prompts during processing, asking for user decisions
- Could also be used to auto-generate config files
- .bst: BibTex format file (difficult to parse)
- .sty: LaTeX style file (can this contain the bst info?)
- .cls: LaTeX class file (can this contain the bst info?)
- .aux: Lists citations and labels
- Single line to parse:
\citation{citationlabel}
- Single line to parse:
- Dictionary of conference names
- Style config file
- Tool config files
- Consistency checker config file
- Fixer config file
- Simplifier config file
We need to be able to check the following aspects for fields:
- What type of entry are we looking at?
- What are the generally required and optional fields for this entry?
- This bit can be hardcoded as it is always true for all BibTex files
- Look up BibTex documentation to determine these values
- For a particular bibliography type, which are the required and optional fields, which fields are ignored?
- Easy solution: Manually create a config file that lists fields as mandatory, optional and ignored
- Requires config file design
- Better solution: Load style files to automatically extract this kind of information.
- Are there python tools that can load sty and cls files for us?
- Easy solution: Manually create a config file that lists fields as mandatory, optional and ignored
- Design a config file that allows users to set which info they want to drop and which they need enforced
- List by entry type
- Allow defining fields for more than one entry type at once
- Define fields as mandatory, optional, unused and maybe as hidden
- List by entry type
- Three layer approach:
- In-built BibTex entry definitions
- Config file for bibliography style requirements
- Config file for simplification requirements