Skip to content

Latest commit

 

History

History
119 lines (108 loc) · 6.25 KB

IdeaPad.md

File metadata and controls

119 lines (108 loc) · 6.25 KB

Useful links

Regex

^([^\s]+ )+[([a-zA-Z0-9 ]+)] /([^/]+)/ ^(?[^\[#]+) [(?[a-zA-Z0-9 ]+)] ^(?[^\[#]+) [(?[^\] ]+)] /(?.+)/$ ^(?[^\[#]+) [(?[^\]]+)] /(?.+)/$

val chPhonemesPattern = "^([^\\[#]+) \\[([^\\]]+)\\] /(.+)/$".r def removeTonesBrackets(ph: String): String = ph.filterNot(c=>c=='1'||c=='2'||c=='3'||c=='4'||c=='['||c==']') def parseCnLinePhonemes(line: String): Option[ChineseEntry] = { line match { case chPhonemesPattern(word, pinyin, definition) => Some(ChineseEntry(word, pinyin, definition, line)) case _ => None } }

A A [A] /(slang) (Tw) to steal/ AA制 AA制 [A A zhi4] /to split the billto go Dutch/ AB制 AB制 [A B zhi4] /to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two P民 P民 [P min2] /(slang) shitizen/commoner/hoi polloi/ Q Q [Q] /cute (loanword)/(of food) having a pleasant chewiness (like mochi, tapioca pearls, taro balls etc - foods with a springy or gel-like mouthfeel)/ T字帳 T字帐 [T zi4 zhang4] /T-account (accounting)/ T字褲 T字裤 [T zi4 ku4] /thong (underwear)/ T恤 T恤 [T xu4] /T-shirt/ T裇 T裇 [T xu1] /T-shirt/ 〧 〧 [qi1] /numeral 7 in Suzhou numeral system 蘇州碼子|苏州码子[Su1 zhou1 ma3 zi5]/ 〨 〨 [ba1] /numeral 8 in Suzhou numeral system 蘇州碼子|苏州码子[Su1 zhou1 ma3 zi5]/ 〩 〩 [jiu3] /numeral 9 in Suzhou numeral system 蘇州碼子|苏州码子[Su1 zhou1 ma3 zi5]/ 㐄 㐄 [kua4] /component in Chinese characters, mirror image of 夂[zhi3]/

Future ideas

  • Create command line interface that takes a sentence and returns the sentence and an annotation of each word that has a chinese pun
  • Deal with dirty data
  • Figure out how to handle conversion to phonemes better
  • Search-ability needs to be more flexible than just map ... need wildcards
  • Can have a Chinese phoneme map to more than one English phoneme, e.g. chinese("an") -> english("AE", "AO"). And vice-versa.
  • What to do when there's no exact match
  • English words with multiple pronounciations?
  • Deal with Chinese -i final which depends on what it follows, e.g. zhi vs qi
  • Use WordNet to determine meaning of words and identify cognates (since those are less interesting)
  • Computational Humor: https://en.wikipedia.org/wiki/Computational_humor ** Apparently there is a "That's what she said" joke detector
  • Mash-up with twitter feed ** Take a tweet -- take each word and see if there's a cross-language pun in each word of the tweet
  • Type/implicit to make "english sentence" toPhenomes possible. Detects that english and then translates
  • Spoonerism = sentence where two consecutive words' first letters can be swapped and produce another interesting sentence.
  • Ability to add to a list of rules to ignore, e.g. ignore word2 = "ING", ignore matching suffix LY when word is LEIGH or LEE
  • Need to find an association dictionary to lookup word and get related words. Like word association cloud.
  • When doing searches by word, e.g. find ("NAZI")
  • http://www.datamuse.com/api/ for a word finding API
  • http://www.rhymer.com/
  • nearby word search needs to be smarter about which phonemes can be inserted, removed, or patched. Soundex has possibilities.
  • as I'm looking through results, it'd be nice to be able to file away ideas as they come up
  • should make sure that when I file ideas, I can store the formula or search I used to find it
  • always have the formula
  • consider using machine learning to determine proximity of words rather than imperative
  • use lex
  • use notebook interface for exploring output
  • use reddit as target (and then upvotes helps give feedback)
  • use punoff.com as target
  • https://api.datamuse.com/words?rel_trg=canoe
  • Observation that 1 and 2 syllable words are easier to build from
  • WAGGING is two phonemes from WAGON, but still quite close
  • FIX: index on Seq[Phoneme], not Seq[Option[Phoneme]]

Things I've learned

  • Regex ** Can unapply the regex to parse a line ** IntelliJ ctrl-f find can be used to figure out the regex
  • Use collect with an anonymous partial function to only get the lines that match
  • Promote stuff out of main into their own classes so I can work with them in console
  • Should record puns and use machine learning to help score quality

Patterns

  • PATTERN = PAT URN
  • TIDE => T + I'D => THOUGHT + I'D
  • VAMPIRE => RE + VAMP + IRE => REVAMP + IRE
  • STAR => STARCH => STARCH WRECK => STAR TREK
  • LONELY => (R for L) => RONERY
  • PREREQ => PRE-WRECK
  • CANOE => RUDDER => READY OR NOT, HERE I COME => RUDDER OR NOT, HERE I COME
  • PUT THAT ECLIPSE WHERE THE SUN DON'T SHINE
  • I'M ACHING OF MY CASTLE
  • Formula = [find pun], then create story in which pun makes sense
  • TAN GENTS = FILIPINO KINGS OF DISTRACTION = TAN GENTS
  • [PHONEMES OF WORD ENDING WITH AH N] - [AH N] sounds like a [WORD ENDS WITH IH NG] ACCORDION => ACCORDING .... ACCORDING to my sources, tango music... ALIEN => ALING IF the US got invaded by martians, we'd have an ALIEN NATION
    • Input: Erring -> EH R IH NG
    • EH R IN NG -> EH R AH N
    • EndsWith: EH R AH N, sorted by closest (least phonames different)

Notes for Presentation

  • Observation: I can generate lots of candidates, but how to filter ** Filter sentences that have unreocgnized words ** Rules that filter out boring things like possessives
  • How to generate phrases in a category
  • Ideas: highlight the word
  • Ideas: how to show results better
  • Ideas: how to find "on topic phrases"
  • http://punoff.com/traditional-topics-for-punslingers/
  • Observation: It's not sufficient to have a sentence that's related to the topic. The substitution must be relevant
  • Resource: https://github.com/niderhoff/nlp-datasets
  • Learning: using itertools helpful to deal with large data
  • Learning: chaining maps like in scala isn't pythonic and is awkward
  • Learning: be careful with iterators/generators cause you can accidentally empty them
  • Evolution: instead of just looking for whether a sentence puns with my word, look for whether it contains ** Synonyms
  • Evolution: spoonerisms
  • Evolution: parallelization
  • Problem: how to cull sets with combinatoric explosion