With the explosion of social media has come new ways in which children can get into unsafe situations online. How can we encourage youth to explore freely while also making sure they stay safe from cyberbullying and other online abuse?
In a recent project, Strong Analytics tackled this problem in collaboration with a fast-growing technology company. We helped them leverage state of the art machine learning and natural language processing to identify abusive content to which children were being exposed online.
Unique challenges of scale and style
Dealing with the scale and style of children's online content were both significant challenges in this project.
When it comes to scale, it's no secret that younger generations have a voracious appetite for online content. The average teenager spends almost 9 hours per day online. Successfully applying machine learning to this problem meant building a system that handles tens of millions of pieces of content — both media and natural language — per day.
Moreover, the style of children's online communication is unique and ever-evolving. Traditional language models and text processing pipelines would not suffice here, as they would risk missing abusive content in the nuance and complexity of these data.
We worked with our client to design and build a custom, end-to-end research and modeling platform that used several custom-trained, interweaved models to identify abusive text and media.
We deployed a pipeline with independent autoscaling for content triage, web content analysis, media content analysis, and text content analysis to ensure that we could identify abusive content in near real-time.
To deal with the nuances of children's online communication, we delivered more than just a model — we built a research platform that enables adaptive model updating and continuous integration into production. Our modeling approach included new foundational models that first learn about the nuances of multi-modal child communication, e.g., comprising language, emojis, and media. We then implemented a real-time feedback loop for model improvement, strategically selecting content to be annotated by a team of experts and then feedback those new data back into model training for continuous improvements.
Finally, we developed a suite of performance monitoring tools that not only keep a close eye on model accuracy and speed but, just as importantly, the quality of data received via human feedback for continuous model tuning.