Skip to content

Features

Marc Schulder edited this page Aug 18, 2019 · 4 revisions

BibTex file consistency checker

  • Find duplicates
    • Duplicate keys
      • added biblib work-around to load files with duplicate keys.
    • Duplicate paper titles
      • Grade badness of duplicate by how much of the rest matches
      • Consider cases where duplicates might be acceptable
        • Pairs of entries for presentation and paper (what is the entry type for the presentation).
          • Allow users to define entry types that should be ignored when looking for duplicate titles. This way you can for example model presentations as @misc entries and have them be ignored
        • Pre-print and published version of paper.
        • Author who actually named different papers the same
        • Different editions of a book.
        • Possibly paper and extended version of it as journal article.
  • Warnings for missing fields
    • Optional warning for optional fields
  • Tex-Unicode conversion
    • LaTeX to Unicode conversion
      • Fix loosing curly braces
    • Unicode to BibTeX conversion
      • Check if URLs require special handling
  • Warnings for bad formatting
    • Warning for non-standard entry type
    • Warning for fields whose value has no curly braces, but is not a known macro
    • Warnings for non-secured capitalisation in name field
    • Warnings for unnecessary curly braces
      • Curly braces are not only for uppercase characters but also for encoding special characters, e.g. \'{e} to get é
      • Allow user preference for wrapping characters or whole words.
      • What is the difference between single and double braces?
    • Warnings for badly formatted in page numbers
    • Find badly formatted names (author and editor fields)
      • All-caps names
      • Bad use of latex commands
      • Missing spaces between initials
      • Other bad formatting
    • Warning for all-caps texts
    • Notice bad months
    • Check if desired key format is followed (see entry key format)
  • Warnings for inconsistent formatting
    • Different names for conferences (see dictionary of conference names)
    • Name formatting
      • Names or parts of names written in all caps (MICKEY MOUSE or Mickey MOUSE)
        • Identify when an all-caps name part is actually initials written without period or whitespace
      • Name initials
        • Initial written without period (Mickey D Mouse)
        • Multiple initials written without whitespace (Mickey A.B. Mouse)
        • Multiple initials written without periods or whitespace (Mickey AD Mouse)
        • Warning when first names are only initials
        • Warning when only some names of a paper are full and some have initials
    • Location names
      • Indicate when there is a country without a city
      • Indicate when there is a city without a country
      • States missing from US locations
    • Inferrable information for conferences/journals is inconsistent
  • Allow limiting search to citations found in aux file

BibTex Fixer

  • Infer fields from other entries
    • Basic inference functionality
    • Add more inferable fields (see Field Inference)
    • Add functionality for mapping information across types (e.g. from proceeding to inproceedings)
  • Infer full names
    • Infer full name form of initials when the full name is used elsewhere
    • Infer proper non-ASCII spelling of a name when is it used elsewhere
  • Fix inconsistent fields
    • Replace conference name variations with main name (see dictionary of conference names)
    • Expand name initials to full names
      • Infer full name form of initials when the full name is used elsewhere
      • Infer proper non-ASCII spelling of a name when is it used elsewhere
    • Make locations more informative (City, [State], Country)
      • Add missing country
      • Add missing city
      • Add state (USA only)
      • Extend state initials to full state name
    • Have consistent file order
  • Fix formatting
    • Replace non-ASCII characters in keys
    • Add wraps around capitalised characters in name field
      • Add option to wrap entire words instead of only the capitalised characters
    • Remove unnecessary {}-wraps
    • Fix badly formatted page numbers
    • Fix all-caps text (but not single all caps words)
      • Separate handling for names
    • Fix bad but understandable months (e.g. numbers)
    • Correct handling for escaped sequences - [ ] Escaped by curly braces - [ ] Escaped by math mode
    • Name formatting
      • Change format of name to non-ambiguous "Last, First" format
      • Fix special character formatting
        • Use consistent braces format (e.g. write {\"o} instead of \"{o})
        • Replace latex commands (e.g. replace \textasciicaron{}e with {e})
      • Fix all-caps names (MICKEY MOUSE or Mickey MOUSE)
      • Fix initials format
        • Initials must be followed by a period
        • Multiple initials must be separated by spaces
      • Test if text starts with "and"
  • Rename entry keys
    • Provide a format to specify the desired key names
    • Key format might differ for different entry types.
    • Key format should consist of only ASCII characters
  • Multi-bibliography merger
    • Identify entries that are the same
      • Option 1: Same key
      • Option 2: Match on major fields (e.g. name plus authors?)
    • Merge
      • Identical fields are accepted
      • Fields available in only one version are accepted
      • Fields that clash cause user prompt or trigger other fixer functions

BibTex Simplifier

  • Simplify conference names
    • Use dictionary of conference names
    • allow regex or sed replacement
  • Simplify Names
    • Turn full first names into initials
    • Turn full middle names into initials
  • Simplify Locations
    • Drop entirely
    • Drop city
    • Drop state
    • Shorten state to initials
    • Copy location to address (even though technically it is incorrect)

Auxiliary

Dictionary of conference names

  • Allow full name, name variation, short name
  • Names should allow for number placeholder
  • How to link regularly named conferences with years where they were held in conjunction with something?
  • Additional script to suggest possible name variations

Key formatting

  • There might already be an open source system for standardising BibTex keys. This is also used by Zotero. Gotta check that out.

Relevant factors for key formatting

  • First author last name
    • capitalised
    • lower caps
  • Year
  • Word from Title
    • capitalised
    • lower caps
    • all caps
  • Disambiguating characters
    • lowercase a,b,c

Common formats

  1. lastnameYEAR
  2. LastnameYEAR
  3. LastnameYEARkeyword
  4. LastnameYEARdisambig
  5. lastname_keyword_year
  6. TITLEWORD
  7. LastnameYEAR or KEYWORD

How to choose format

  1. Number of hardcoded options
    • Easy to implement, little flexibility
  2. RegEx
    • Easy to implement, flexible, but limited functionality (can't check other fields)
    • Actually, if you use named groups, you could use those names to trigger additional checks for them.
  3. Custom format
    • Lots of work to implement, full functionality, probably quite flexible

Field Inference

  • article: journal + year + volume => month
  • article: journal + year + month => volume
  • book: booktitle + year +volume/number => inbook: author, editor,publisher, series, edition, month, publisher
  • book: booktitle + year +volume/number => incollection: editor, publisher, series, edition, month, publisher
  • conference: booktitle + year => address, month, editor, organization, publisher
  • inbook: title + year => address, month, editor, publisher
  • incollection: booktitle + year => address, month, editor, publisher
  • inproceedings: booktitle + year => address, month, editor, organization, publisher
  • proceedings: booktitle + year => inproceedings: address, month, editor, organization, publisher
  • If proceedings title contains an index (e.g. "Proceedings of the 5th Conference on Examples") we can infer year and all other pieces of information from it.

Input Parameters

Input methods

  • Use Python's configparser, which allows INI-like config files

Internal processing

  1. Dict
    • Straightforward, but need to keep the key strings straight
  2. Custom object with lots of boolean fields
    • More design effort, but probably more flexible
    • Should have different class for each Nanny component
      • As the tasks overlap considerably, there should be a NannyConfig superclass and inheriting classes for the components.
      • Accessing config info should be done via functions, not fields, to allow custom processing of the stored information

Required states for custom variables

Consistency checker

  • True (check value)
  • False (don't check value)

Fixer

  • True/Autofix/Auto (autofix value)
  • Tryfix/Try (autofix if trivial, otherwise prompt to fix)
  • Promptfix/Prompt (Prompt to fix)
  • False (don't check value)

Consistency + Fixer

How information for both scripts can be given in the same config file

  • Single value for both (Try and Prompt are treated as True)
  • Tuple: False,Tryfix (CONSISTENCY,FIXER)
  • Variables for only one of the two configs, e.g. duplicateKeys-consistency
  • Different sections for giving instructions for both or just either

Simplifier

Should have separate config files.

  • Blacklist: List fields that should be removed
  • Whitelist List only the fields that are wanted
  • Variables for conversion functions