Children Conversational Training Data for Machine Learning

While I have written quite a bit about the potential uses of a chatbot in educating young children, I am not the first person to ever get the idea. Indeed, the limitations in this specific application do not seem to be idea-based primarily, but instead based on other practical factors.

One such limitation, at the very least a limitation for smaller entities and startups creating chatbots, is a lack of publicly available annotated conversations (training data) by young children. Such data is essential to train NLP tools to correctly identify the meaning behind early childhood language. Without the data, any chatbot geared towards young children would not be very useful, since without understanding the the purpose of the child’s words, it would fail to give an adequate response no matter how well thought out that response is itself.

While there are many pieces of children conversational data lying around the internet, several factors make many inapplicable for practical usage. First are university ethics guidelines, which usually state that conversational data from children must be collected specifically for research as opposed to being simply sold to research as an afterthought. Then, such data must be cleaned up and/or transcribed, which is again harder in the case of messy/unintelligible children. In addition, with children, small age differences have big implications for speech. Hence, it’s essential that any dataset has metadata including child age (or be limited to a small age bracket altogether). Gender could potentially be relevant as well.

“A surprisingly small number of corpora have been produced which specifically contain child and/or teenage language”

Children Online: a survey of child language and CMC corpora (Baron, Rayson, Greenwood, Walkerdine and Rashid)

Even accounting for these challenges, one study finds that a “surprisingly small number of corpora have been produced which specifically contain child and/or teenage language.” It is worth noting that this study’s focus was skewed by their specific application of “protection of children online” and their status as a British university, meaning that datasets that were otherwise pretty valid but were mostly of Americans had that listed as a con, when in reality, it might be a good thing to have a chatbot most fluent in a relatively generic, American vernacular. However, on the flip side, it might not have emphasized enough the lack of younger-child focused datasets (many were broadly K-12 or only late teen).

One corpora that I found separately but was also mentioned in the study was CHILDES, a database of children primarily 5 and younger. It stood out to me for the breadth of data and the precise age-range for the conversations, while not finding the low amounts of British English speakers to be as much of an initial problem as the researchers did. I will certainly explore this corpora further and start training with it.

Looking Behind the Surface for Child-Oriented Chatbots

Previously, I mentioned how a chatbot designed for children has to treat its interactions fundamentally differently than one made for adults. The exigence of a communication between adult and robots, in most cases, “I need help” or “I was re-directed here instead of human support”, is different from the exigence of most child-robot communications, where a child can’t be reasonably expected to try to get anything out of what he or she probably sees as a conversation with a robotic friend. However, this makes the job of a child-oriented chatbot all the more challenging when attempting to deal with or otherwise account for emotional issues of a child.

Of course, this somewhat applies to normal chatbots. One previous example was Woebot, aimed at psychological health. However, the website mentions that Woebot establishes “a bond with users that appears to be non-inferior to the bond created between human therapists and patients.” This implies that at least in part, Woebot gauges emotion due to the patient explicitly stating his/her emotions as would happen in a therapist / patient relationship. Indeed, the exigence of the bot is being downloaded specifically for the purposes of mental health.

Child-oriented chatbots wouldn’t have this same luxury. Even disregarding the fact that not many children I know can adequately express their feelings if they wanted to, if a chatbot adopts a persona of a friend or mentor, it would be more difficult to establish a need to express feelings since children wouldn’t talk to the bot in a non-casual way. While a chatbot can always just ask “how are you feeling?”, this most likely wouldn’t yield accurate results all of the time (imagine asking this question yourself). Instead, a chatbot would have to imply emotions based on the language used.

Given adequately labelled data, natural language models can identify both stress levels and emotion in text. However, it’s unclear if the same method used in the study can be used for the language of young children, especially since with a decreased vocabulary (meaning less emotionally-charged meanings), a lot of human ability to interpret the emotions of young children (for me anyways) is based around non-verbal cues and vocal inflections that can’t be fed into a chatbot.