syllable count seems not always correct #94

icezee · 2020-11-05T04:37:49Z

import syllapy
syllapy.count('feature')
2
syllapy.count('features')
3

Hathaway2010 · 2021-01-26T20:40:22Z

I found a pretty good way of dealing with "es" and "ed" endings (and a couple other issues) using regular expressions! I'm extremely new to open-source, though — are you open to pull requests now?

(I'm thinking of using syllapy or something of the sort in a poetry analysis app!)

eyaler · 2022-03-10T20:08:34Z

@Hathaway2010 can you share your solution?

hszhai · 2022-03-11T02:58:46Z

I later switched to 'pronouncing'
https://github.com/aparrish/gen-text-workshop/blob/master/cmu_pronouncing_dictionary_notes.md

I'll update the repo and add a license so it becomes more useful. Thanks for the feedback

Hathaway2010 · 2022-04-24T18:59:52Z

@eyaler belatedly:
Here's a short and nasty one:

def syllables(word):
    """Guess syllable count of word not in database.
    Parameters
    ----------
    word : str
        word not found in database
    
    Returns
    -------
    count : int
        estimated number of syllables
    
    See also
    --------
    tests/test_scan.py to clarify regular expressions
    """
    vowels_or_clusters = re.compile("[AEÉIOUaeéiouy]+")
    vowel_split = re.compile("[aiouy]é|ao|eo[^u]|ia[^n]|[^ct]ian|iet|io[^nu]|[^c]iu|[^gq]ua|[^gq]ue[lt]|[^q]uo|[aeiouy]ing|[aeiou]y[aiou]") # exceptions: Preus, Aida, poet, luau
    final_e = re.compile("e$")
    silent_final_ed_es = re.compile("[^aeiouydlrt]ed$|[^aeiouycghjlrsxz]es$|thes$|[aeiouylrw]led$|[aeiouylrw]les$|[aeiouyrw]res$|[aeiouyrw]red$")
    lonely = re.compile("[^aeiouy]ely$")
    audible_final_e = re.compile('[^aeiouylrw]le$|[^aeiouywr]re$|[aeioy]e|[^g]ue')
    word_lower = word.lower()
    voc = re.findall(vowels_or_clusters, word_lower)
    count = len(voc)
    if final_e.search(word_lower) and not audible_final_e.search(word_lower):
        count -= 1
    if silent_final_ed_es.search(word_lower) or lonely.search(word_lower):
        count -= 1
    likely_splits = re.findall(vowel_split, word_lower)
    if likely_splits:
        count += len(likely_splits)
    if count == 0:
        count += 1
    return count

I wound up using this to guess any words not in Webster's Unabridged Dictionary from 1913, downloaded from Project Gutenberg and parsed into a database. Neither the dictionary nor this function is remotely infallible (the dictionary thinks the word "every" has three syllables, and the function doesn't know how to distinguish between "seneschal" -- three syllables -- and "sometimes" -- two), but I do think it's a refinement. I got the basic approach from syllapy and would be delighted to contribute this back to the repo :) If you want to see an expanded version that makes stronger efforts to be human readable, you can check out https://github.com/Hathaway2010/poetry-meter/blob/95d5fdbe7ffb8cde2191b4fd417010240060ea05/recurse_final.py#L89

Hathaway2010 · 2022-04-24T19:01:32Z

I later switched to 'pronouncing' https://github.com/aparrish/gen-text-workshop/blob/master/cmu_pronouncing_dictionary_notes.md

I'll update the repo and add a license so it becomes more useful. Thanks for the feedback

"Pronouncing" looks splendid :) I should be using this too probably.

eyaler · 2022-04-24T20:26:43Z

i am using this table for some manual fixes:
https://raw.githubusercontent.com/harrisj/nyt-haiku-python/master/nyt_haiku/data/syllable_counts.csv

in @mholtzscher writeup for syllapy: https://mholtzscher.github.io/2018/05/29/syllables/
he mentions: "The closest thing I found was the CMU Pronouncing Dictionary. However, this database shows the phonemes for the words rather than syllables. In some cases the phonemes align with syllables but this is not always the case."

maybe @mholtzscher can advise regarding the issues you saw with CMU?

peterchinman · 2024-08-05T03:51:40Z

I know this is two years later, but I am curious about @mholtzscher phoneme/syllable misalignments. I couldn't think of an example where counting the arpabet vowels from the cmudict didn't give an accurate syllable count. (Though there are some instances where there are competing syllable counts for different pronunciations.)

mholtzscher · 2024-08-05T04:03:50Z

hi @peterchinman I can't recall the exact issues I ran into with cmu but if I remember correctly it was that cmu usually had more phonemes than syllables for some words. So for the work I was doing in readability this would greatly affect the readability scores as it would inflate the syllable count.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syllable count seems not always correct #94

syllable count seems not always correct #94

icezee commented Nov 5, 2020

Hathaway2010 commented Jan 26, 2021 •

edited

Loading

eyaler commented Mar 10, 2022

hszhai commented Mar 11, 2022

Hathaway2010 commented Apr 24, 2022

Hathaway2010 commented Apr 24, 2022

eyaler commented Apr 24, 2022

peterchinman commented Aug 5, 2024

mholtzscher commented Aug 5, 2024

syllable count seems not always correct #94

syllable count seems not always correct #94

Comments

icezee commented Nov 5, 2020

Hathaway2010 commented Jan 26, 2021 • edited Loading

eyaler commented Mar 10, 2022

hszhai commented Mar 11, 2022

Hathaway2010 commented Apr 24, 2022

Hathaway2010 commented Apr 24, 2022

eyaler commented Apr 24, 2022

peterchinman commented Aug 5, 2024

mholtzscher commented Aug 5, 2024

Hathaway2010 commented Jan 26, 2021 •

edited

Loading