Children Conversational Training Data for Machine Learning

While I have written quite a bit about the potential uses of a chatbot in educating young children, I am not the first person to ever get the idea. Indeed, the limitations in this specific application do not seem to be idea-based primarily, but instead based on other practical factors.

One such limitation, at the very least a limitation for smaller entities and startups creating chatbots, is a lack of publicly available annotated conversations (training data) by young children. Such data is essential to train NLP tools to correctly identify the meaning behind early childhood language. Without the data, any chatbot geared towards young children would not be very useful, since without understanding the the purpose of the child’s words, it would fail to give an adequate response no matter how well thought out that response is itself.

While there are many pieces of children conversational data lying around the internet, several factors make many inapplicable for practical usage. First are university ethics guidelines, which usually state that conversational data from children must be collected specifically for research as opposed to being simply sold to research as an afterthought. Then, such data must be cleaned up and/or transcribed, which is again harder in the case of messy/unintelligible children. In addition, with children, small age differences have big implications for speech. Hence, it’s essential that any dataset has metadata including child age (or be limited to a small age bracket altogether). Gender could potentially be relevant as well.

“A surprisingly small number of corpora have been produced which specifically contain child and/or teenage language”

Children Online: a survey of child language and CMC corpora (Baron, Rayson, Greenwood, Walkerdine and Rashid)

Even accounting for these challenges, one study finds that a “surprisingly small number of corpora have been produced which specifically contain child and/or teenage language.” It is worth noting that this study’s focus was skewed by their specific application of “protection of children online” and their status as a British university, meaning that datasets that were otherwise pretty valid but were mostly of Americans had that listed as a con, when in reality, it might be a good thing to have a chatbot most fluent in a relatively generic, American vernacular. However, on the flip side, it might not have emphasized enough the lack of younger-child focused datasets (many were broadly K-12 or only late teen).

One corpora that I found separately but was also mentioned in the study was CHILDES, a database of children primarily 5 and younger. It stood out to me for the breadth of data and the precise age-range for the conversations, while not finding the low amounts of British English speakers to be as much of an initial problem as the researchers did. I will certainly explore this corpora further and start training with it.