Applying machine learning to natural language is a uniquely challenging task. If computer vision can be thought of as opening a window onto the physical world, the field of Natural Language Processing (NLP) opens a window onto a speaker’s mind. One could also argue it’s less of a window and more of an abstract code to be deciphered by the listener.
We know it’s possible to learn language; after all, babies learn to speak just as they learn to see. Most babies around the world learn even more than one language at a time!
Nevertheless, it is for good reason that babies’ vision and physical reasoning proceeds more quickly than their language development: language is tricky. The same sentence can mean different things depending on context. Words evolve over time as speakers fundamentally change the meanings of words and how they are used (literally!). And, more often than you would think, speakers play with language in jokes or hyperbole, or even say the exact opposite of what they believe in sarcasm.
We believe that understanding and addressing these challenges through strong NLP research and engineering can lead to more robust, generalizable language models. Here we offer five tips for NLP engineers — or organizations considering a project in NLP — to consider in their approach.
Try to break your model
The best-case scenarios you present your model in training often aren’t the most likely ones it will encounter out in the wild. This is truer in some domains than in others. If you’re simply processing New York Times articles to determine if they’re about Canadian politics, you can rely on mostly similar input formats from document to document. But if, say, you’re classifying Facebook posts for whether a user is depressed, or you’re analyzing data from multiple corpora, things get less predictable.
First, the input might break your model entirely by throwing an error. If you’ve trained your model on nothing but social media posts of 200-300 words and it suddenly force it to process a document of 5,000 emojis (this is not a hypothetical; we’re speaking from experience), your model could be unable to process such a large array or deal with one made up of characters that it has no familiarity with.
Second, your input could be unfamiliar in content but similar in structure such that the model produces a result, but an undesirable one. For example, if you have a topic modeling algorithm that sorts every single document into financial, sports, or political news, and then feed it a piece of celebrity gossip, the topic assigned will almost necessarily be incorrect.
To guard against both, NLP engineers should think similarly to information security architects who conduct penetration testing by trying to break what they build. It’s important to regularly stress-test your model, both in the initial training and after deployment, by feeding it unexpected inputs. In some cases, we’ve gone as far as writing quick scripts to generate gibberish to adjust the model accordingly. Additionally, we’ve often found it helpful to swap out words for their synonyms and see if our application still works as intended. For example, when classifying violent text, if “hit” is swapped out for “strike,” do we still get the result we want?
These are just a couple possible small ways of many to make your model more robust. But regardless of exactly how you disrupt your model’s sense of regularity, you should never let your application get too comfortable.
As language changes, so should your model
Much of language’s ambiguity owes to its shifting usage and meaning over time. For NLP engineers, this means that applications are doomed if the models they rely on are too static and aren’t able to adapt to new words, syntax, and constructions.
The past year provides ample examples. A model trained on text from 2018 and deployed to classify text from 2019 would have reasonably situated a document with the word “lockdown” in a vector close to the somewhat threatening, which could have a number of implications depending on the application. By mid-2020, however, “lockdown” had become semi-synonymous with “quarantine”; a model built to classify violent text would be spinning out of control with false positives.
In cases like that, where the word is uncommon enough as to not have previously been a part of everyday speech, training it on a few additional examples is a feasible way to adjust the model’s weights. In such cases, you can even sometimes simply make manual, rule-based adjustments (e.g., if a text contains “lockdown” and also contains “coronavirus,” classify it as nonviolent).
In other situations, where an extremely common word changes meaning (say, if there was a new $100-billion company called Water) or where entirely new words and slang become unignorable, more comprehensive re-training is required. In fact, it’s ideal to introduce new data to your model on a fairly regular basis to train it on words you might not even be aware of (at this point, your application should be able to process “sus”). In any event, use your instinct — if you sense that your model is producing results significantly disproportionate to what you expect, it might be time for an adjustment.
As many NLP engineers know, a rose is not always a rose. The context of a text matters, and this is what makes detecting traits like sarcasm and humor one of the field’s harder problems. It also means potentially incorrect results in a classifier. In the case of our example model that detects violent language, this means, for instance, taking care that “i want to kill him” and “Sarah thought to herself, ‘I want to kill him’” aren’t analyzed in a way that elides their grammatical and syntactical nuances. As humans, the first sentence is more obviously the product of day-to-day speech patterns that make it more likely a person actually typed it when communicating with someone else. In the second, the sentence bears a clear literary quality that may be less concerning.
It’s easy to wind up with models that zero in on a single word or ngram without adjusting for context. Some preprocessing is designed to minimize the sheer number of distinct words or characters your model has to remember, and that can mean removing punctuation and rendering all words either lowercase or uppercase. In our example, it’s precisely the kind of information that’s often processed as extraneous that would help to negatively classify the latter sentence.
Other information can factor into contextualization as well. Is there an unusual degree of italicization and bolding? Is there preceding or succeeding information, for example a reply, that modulates certain qualities that might seem like otherwise clear signals? Contextualizing a document is one of the perennially difficult tasks in NLP, not least of which because there are so many possible ways to do so (and just as changes in language necessitate adjustments to the model, so do new contexts emerge and shift). But taking them into consideration is necessary to minimize sloppiness and maximize accuracy and precision when your application is ultimately deployed.
Don’t necessarily rely on just one model
Sometimes the best way to improve a model is to build a new one entirely and integrate them together in a single pipeline. This can be particularly helpful when trying to contextualize a document. For example, one model can classify a document as violent or non-violent looking just at the vocabulary itself. If it’s labeled as violent, it can then be passed through another model that examines all the data clipped from the first, like punctuation, style, and so on, to determine if the document’s context changes the text’s intent.
By divvying up jobs among multiple models, your application benefits in two ways. First, each model can focus more granularly on its task and perform better than a single model trying to do everything — Machine Learning division of labor. Second, it prevents runaway complexity in your model that makes adjustments more difficult (and resource intensive) and causality more difficult to pinpoint.
Deploying an additional model is especially beneficial when it can be trained on open or existing data that’s already labeled (or is very easy to label) so that building it doesn’t require you to effectively double the time spent labeling data. If, say, you wanted to include a step in a pipeline that first determines the language of the document, there’s no shortage of pre-labeled corpora to begin with. But even less straightforward examples abound, and it’s worth doing a search as you build to see if an existing dataset is available for you to add an additional model more easily than you might initially expect.
Keep humans in the loop
Lastly, never forget what you’re ultimately analyzing is the output of humans — as such, humans themselves are necessary to ensure your model is doing what it’s supposed to, flag false positives, and overall provide quality assurance.
If your application is processing millions of documents every day, you obviously won’t be able to review every single classification or output. But having a few people regularly read through a sample of the results goes a long way to spotting any red flags. In certain applications, you might also pass every single document that’s assigned a particular label to human review so that you don’t end up mistakenly automating an action that deserves rigorous discretion. As a rule of thumb, the more consequential your model’s results are on people’s lives, the more critical human review is.
Contact us to learn more about how we help top organizations build, deploy, and maintain state-of-the-art NLP systems.