Strategy, Technical06/16/2021

Designing Object Tracking Systems

Francisco Gonzalez
Lead ML Scientist
Jacob Zweig
Co-Founder, Principal Consultant

Object tracking involves a distinct set of challenges and trade-offs that make it one of the most demanding specialties within computer vision. Video data is typically heavier than static images used in classification models. Speed is often paramount, as many models are deployed to track objects in real time, necessitating tight optimizations for latency. And unlike other computer vision models, object trackers often have to keep their eyes on more than one object while maintaining those objects’ unique identities.

The complex balancing act has meant that object tracking was relatively unreliable until recently. But the combination of advances in deep learning and the proliferation of edge devices has meant not only that object tracking models have grown more robust, but that there are now viable roads to deployable applications. From traffic to security to inventory management, the sophistication of newer models and hardware has led the discipline to an inflection point and ushered in scores of engineers to tackle problems in object tracking.

If you’re thinking about implementing object tracking into your application, there’s a lot to consider. Here are a five high-level points to get you oriented and hopefully refine your product road map.

Decide whether you need to track objects or just detect them

Unless you’re tracking objects aerially on an ice rink, chances are that the line of sight between your sensor and the objects being tracked will suffer from interference. Whether or not you build a model to continually maintain the identity of those objects depends on your application.

If you’re tracking horses in a race, for instance, where jockeys pull ahead and fall behind one another throughout the race and persistently occlude each other, maintaining the horses’ identities is central to the application — it doesn’t matter that any horse is in third, but particularly that Bag O’ Beans is. On the other hand, a system built to monitor the availability of parking space doesn’t need to know which cars are occupying a lot, just how many of them there are.

If possible, determine how many objects you need to track ahead of time

A self-driving car can’t know ahead of time how many objects will be in its line of sight at a given moment, and a surveillance system probably won’t be deployed to monitor a predetermined number of people. But if you do know how many objects you’ll be tracking, you’ll reduce the need for algorithmic guesswork, greatly simplifying the application.

Consider a model built to track people in a soccer game. The model will try to identify that there are 23 people total, and no more: 11 players from each team, and one referee. By constraining the system, the model avoids false positives (since the unruly fan on the field can be identified as a twelfth player on a team), and mitigates false negatives. And in so doing, it enhances the accuracy of identification: even if your model accidentally assigns high probabilities of being a referee to two separate people, only one of them can be, which means that second non-referee has to be someone else.

In other words, by knowing how many of each object you need to identify, the very identification of an object assists in identifying the others in a way that’s not quite true when the quantities are more ambiguous.

Determine if your app needs to run in real-time or offline

Implementing real-time tracking into your application is sometimes non-negotiable. Self-driving cars need to know whether those things in front of them are people, and security systems have to identify threats before, not after, they happen. But tracking objects in real time comes not only with the typical challenges of minimizing latency (i.e., cost and resources), but also, in some object-tracking applications, with the added hurdle of having to perform the task in an usual, localized setting, like on a moving vehicle or a robot that might require specialized edge devices.

Leveraging offline tracking allows using the future to inform the past and can be helpful to resolve ambiguities.

In cases where the advantages of real-time tracking are more ambiguous — say, in tracking shopper behavior to figure out how to redesign the floorplan — running models offline can be enormously beneficial. Since speed becomes less paramount, you’re more easily able to run models that would be less feasible if used in real-time apps without enormous outlay. But leveraging offline data is analytically useful as well. For historical data, time no longer moves in a straight line — the object can be tracked from second 0 to second 10, and in reverse, from second 10 back to second 0. If your application is heavily reliant on identification, then allowing the tracking to move in both directions can help mitigate problems with object occlusion, since you have more information with which to triangulate the specific object.

But even if running offline is ideal, the choice isn’t all-or-nothing. There might be an imperative for part of your application to track objects as they happen and then run less time-sensitive tracking analytics afterwards for more accurate and/or complex measurements — whether a car was driving at 100mph or 115mph through a race, and all the instantaneous variances of velocity, are much easier to answer offline, and they’re probably less interesting to lay viewers than to the car’s owner.

Minimize the efforts spent on annotation wherever possible

If you’re at the very beginning of road mapping your application, you probably already know that object tracking presents some of the greatest annotation challenges in all of machine learning. Not only do video frames have to be annotated with bounding boxes, identities, and sometimes image segments, but sequences between frames often have to be annotated as well. Depending on your application, you could wind up with a dozen annotations on a single frame to train your model. The fact that annotators have to look at mostly minor variations of the same image over and over doesn’t exactly boost morale.

Sometimes, doing it all frame by frame, one annotation at a time, is the prerequisite for your application. But things aren’t always as dire as they may seem. Consider an application that tracks objects in an industrial setting, like boxes moving on a conveyor belt. The belt will typically move at a steady speed, and so boxes will move at predictable velocities in different segments of the data. In such cases, when you can expect your real-world use case to not involve objects that behave in substantially different ways, you can augment your data by replicating it and slightly altering the replicants, thereby capturing the most essential features needed to train your tracking model while avoiding overfitting — all while reducing the amount of manual annotation required.

And some tracked objects within the same dataset may require less detailed annotation than others. In training a self-driving car, the car should probably be able to recognize objects like pedestrians with pinpoint accuracy, which can involve annotating those objects not just with bounding boxes, but also with more (often more effortful) image segmentations. But inanimate objects, like signage and trees, may not prompt the same concerns for accuracy. Identifying when maximum accuracy is actually critical and when it’s not can save substantial time on annotation.

Adding more sensors can help, but they come at a cost

Increasing the amount and type of data used by your model will, generally and predictably, increase its accuracy and address common issues including occlusion. Using multiple cameras to track objects makes it more likely that if a line of sight is cut off in one camera, it’s picked up by another. Adding sensors can also help moving objects localize themselves in space, either by multi-view geometric approaches from multiple cameras, or by using data from different sources, like depth information from LiDAR.

Adding sensors can significantly boost the robustness and performance of your application. But that robustness comes at a cost: those additional sensors introduce algorithmic complexity, heavier hardware burdens, and compute cost.

Deciding if the trade-offs are worth it depends entirely on your use case. If integrating new sensors increases the safety of a self-driving car, more sensors are probably warranted. But if they simply add marginal accuracy to velocity measurements of shipping containers, that threshold has to be measured against obvious cost/benefit ratios that are specific to your organization, and which can sometimes be achieved in other less costly, more artful ways, like thoroughly analyzing offline data.

Contact Strong

Contact us to find out how we can help build and integrate state-of-the-art object tracking into your products and solutions. 


Strong Analytics builds enterprise-grade data science, machine learning, and AI to power the next generation of products and solutions. Our team of full-stack data scientists and engineers accelerate innovation through their development expertise, scientific rigor, and deep knowledge of state-of-the-art techniques. We work with innovative organizations of all sizes, from startups to Fortune 500 companies. Come introduce yourself on Twitter or LinkedIn, or tell us about your data science needs.