Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syllable count seems not always correct #94

Open
icezee opened this issue Nov 5, 2020 · 8 comments
Open

syllable count seems not always correct #94

icezee opened this issue Nov 5, 2020 · 8 comments

Comments

@icezee
Copy link

icezee commented Nov 5, 2020

import syllapy
syllapy.count('feature')
2
syllapy.count('features')
3

@Hathaway2010
Copy link

Hathaway2010 commented Jan 26, 2021

I found a pretty good way of dealing with "es" and "ed" endings (and a couple other issues) using regular expressions! I'm extremely new to open-source, though — are you open to pull requests now?

(I'm thinking of using syllapy or something of the sort in a poetry analysis app!)

@eyaler
Copy link

eyaler commented Mar 10, 2022

@Hathaway2010 can you share your solution?

@hszhai
Copy link

hszhai commented Mar 11, 2022

I later switched to 'pronouncing'
https://github.com/aparrish/gen-text-workshop/blob/master/cmu_pronouncing_dictionary_notes.md

I'll update the repo and add a license so it becomes more useful. Thanks for the feedback

@Hathaway2010
Copy link

@eyaler belatedly:
Here's a short and nasty one:

def syllables(word):
    """Guess syllable count of word not in database.
    Parameters
    ----------
    word : str
        word not found in database
    
    Returns
    -------
    count : int
        estimated number of syllables
    
    See also
    --------
    tests/test_scan.py to clarify regular expressions
    """
    vowels_or_clusters = re.compile("[AEÉIOUaeéiouy]+")
    vowel_split = re.compile("[aiouy]é|ao|eo[^u]|ia[^n]|[^ct]ian|iet|io[^nu]|[^c]iu|[^gq]ua|[^gq]ue[lt]|[^q]uo|[aeiouy]ing|[aeiou]y[aiou]") # exceptions: Preus, Aida, poet, luau
    final_e = re.compile("e$")
    silent_final_ed_es = re.compile("[^aeiouydlrt]ed$|[^aeiouycghjlrsxz]es$|thes$|[aeiouylrw]led$|[aeiouylrw]les$|[aeiouyrw]res$|[aeiouyrw]red$")
    lonely = re.compile("[^aeiouy]ely$")
    audible_final_e = re.compile('[^aeiouylrw]le$|[^aeiouywr]re$|[aeioy]e|[^g]ue')
    word_lower = word.lower()
    voc = re.findall(vowels_or_clusters, word_lower)
    count = len(voc)
    if final_e.search(word_lower) and not audible_final_e.search(word_lower):
        count -= 1
    if silent_final_ed_es.search(word_lower) or lonely.search(word_lower):
        count -= 1
    likely_splits = re.findall(vowel_split, word_lower)
    if likely_splits:
        count += len(likely_splits)
    if count == 0:
        count += 1
    return count

I wound up using this to guess any words not in Webster's Unabridged Dictionary from 1913, downloaded from Project Gutenberg and parsed into a database. Neither the dictionary nor this function is remotely infallible (the dictionary thinks the word "every" has three syllables, and the function doesn't know how to distinguish between "seneschal" -- three syllables -- and "sometimes" -- two), but I do think it's a refinement. I got the basic approach from syllapy and would be delighted to contribute this back to the repo :) If you want to see an expanded version that makes stronger efforts to be human readable, you can check out https://github.com/Hathaway2010/poetry-meter/blob/95d5fdbe7ffb8cde2191b4fd417010240060ea05/recurse_final.py#L89

@Hathaway2010
Copy link

I later switched to 'pronouncing' https://github.com/aparrish/gen-text-workshop/blob/master/cmu_pronouncing_dictionary_notes.md

I'll update the repo and add a license so it becomes more useful. Thanks for the feedback

"Pronouncing" looks splendid :) I should be using this too probably.

@eyaler
Copy link

eyaler commented Apr 24, 2022

i am using this table for some manual fixes:
https://raw.githubusercontent.com/harrisj/nyt-haiku-python/master/nyt_haiku/data/syllable_counts.csv

in @mholtzscher writeup for syllapy: https://mholtzscher.github.io/2018/05/29/syllables/
he mentions: "The closest thing I found was the CMU Pronouncing Dictionary. However, this database shows the phonemes for the words rather than syllables. In some cases the phonemes align with syllables but this is not always the case."

maybe @mholtzscher can advise regarding the issues you saw with CMU?

@peterchinman
Copy link

I know this is two years later, but I am curious about @mholtzscher phoneme/syllable misalignments. I couldn't think of an example where counting the arpabet vowels from the cmudict didn't give an accurate syllable count. (Though there are some instances where there are competing syllable counts for different pronunciations.)

@mholtzscher
Copy link
Owner

hi @peterchinman I can't recall the exact issues I ran into with cmu but if I remember correctly it was that cmu usually had more phonemes than syllables for some words. So for the work I was doing in readability this would greatly affect the readability scores as it would inflate the syllable count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants