Python API for retrieving American-English child-directed speech transcripts, ordered by the age of the target child.
from aochildes.dataset import AOChildesDataSet
transcripts = AOChildesDataSet().load_transcripts()
from aochildes.dataset import AOChildesDataSet
transcripts = AOChildesDataSet(sex='male').load_transcripts() # excludes many transcripts not annotated with sex
Retrieve sets of entities, like fictional characters mentioned during child-language interactions (e.g. book reading):
from aochildes.persons import FICTIONAL
print(FICTIONAL)
A variety of parameters can be set, to influence much processing should be performed on the raw transcripts.
These parameters can be found in params.py
and should be edited there, directly.
For example, one can set a parameter determining whether or not all utterances with the unicode symbol '�', 'xxx', and 'yyy' are discarded.
Developed on Ubuntu 18.04 and Python 3.7.