Whether it’s an advanced virtual reality headset, a household robot, or simply an autopilot for our flying cars, a huge proportion of the dreams of science fiction requires computers to see. A machine can’t interact with the physical world if it doesn’t know what’s in that physical world, but engineers struggled to teach computers how to extract a useful spatial understanding from two-dimensional, and eventually even three-dimensional images. Now, machine learning is well on the way to solving that problem by giving computers the power of sight.
The breakthrough came with the switch away from the direct insights of software engineers and toward the slow evolutionary process made possible only with neural networks. Suddenly the job of a computer vision developer changed from designing the underlying rules of sight to building datasets that allow the development of those same rules through machine learning.
By leaving the actual development of sight to the slow, iterative learning process and focusing instead on providing the resources needed by this process, developers suddenly found that computers could not just see, but even begin to assign some measure of understanding to what they see.
Using Datasets to Give Machines the Power of Vision
In machine learning, a dataset is a curated collection of information that’s organized to allow useful learning on a specific topic. So, when Google famously wanted to teach a program to identify videos of cats, it first had to create a number of datasets to be used by its nascent cat-finding neural network. The dataset has to not only contain videos of cats and non-cats but metadata that specifies the true answer — cat-containing, or non-cat-containing.
Without this curation of the dataset, the neural network has no way to know whether a given run was successful in its cat-guessing, or not. And it’s the feedback from correct and incorrect guesses that provide the context for machine learning algorithms to restructure a neural network to be better at solving a given problem.
So, the creation of high-quality, highly accurate datasets is a huge concern in the development of neural network models in general, and computer vision models in particular. With a well-formed dataset in hand, along with a well-chosen machine learning algorithm, a developer can largely sit back and wait for their program to improve.
Today, more than five years after Google’s computers began reliably identifying cats, the computer vision space has evolved considerably. Where computers used to struggle to identify human faces in ideal conditions, now many home security systems offer automated facial recognition of visitors (or intruders) in real-time. Even seemingly small companies like Bitmoji are digging into the potential of computer vision technology to allow users to auto-create an avatar that looks like the user.
Technologies that Emerge from Computer Vision Technology
Computer vision can start to identify not just general types of objects, but the more nuanced details and nested informational content. This extends from optical character recognition (OCR), which reinterprets the outlines of visible letters as readable text, to algorithmic lip reading, which does much the same for spoken language. With the advent of cheap three-dimensional range-finding cameras like that used by Microsoft’s Kinect gaming camera, developers have even more possible avenues through which to mine data from human behavior.
In particular, computer vision technology has recently graduated beyond quick facial recognition, the simple finding of faces within frames, to quick facial identification — the checking of this face against a database of known individuals. An even more advanced, and difficult, area of study has to do with so-called “sentiment analysis,” in which the program takes a guess at not just at a person’s spoken words but their emotional effect as well.
Now, however, the biggest potential source of change for computer vision is based in hardware: the high-quality cameras now found in virtually every smartphone, along with an increasingly powerful computer built right in. This means not only that users will be able to capture photos and video in a wider array of situations, but also that the analysis of those images can be increasingly done without the need to upload the data to a remote server, greatly increasing the possible mobile applications.
Whether it’s an address book that remembers people’s faces or an augmented reality game in the style of Pokemon Go, the advent of quick, ubiquitous computer vision products for mobile devices will change the role of the technology forever.
Computer Vision and Data Collection Will Take Us Into the Future
At Globalme, one of our specialties is the collection, tagging, and overall curation of datasets. Whether it’s for market research or scientific insight, law enforcement or product development, any machine learning product requires a dataset specifically tailored to be the baseline needed the job at hand.
In the future, computer vision will be tasked with two things: increasingly complex analysis of photos and video, and increasingly quick analysis of the same. Each of these seemingly contradictory goals will require increasingly artful image datasets that allow machine learning algorithms to falsify guesses along increasingly subtle lines. That means that Google and other search engines may soon allow a wide variety of contents-based searches for photos and videos — that is to say, searches for images based on what those images contain, rather than what is said by the title or metadata tagging.
But the truly incredible new applications will come through direct application by users. These applications include everything from simultaneous tracking and analysis on crowds of hundreds or even thousands, to always-on wearable facial ID hardware and services, to seemingly impossible goals like accurate, algorithmic lie detection. The ability of wearables to incorporate “always-on” functionality will also be revolutionary, and only possible thanks to increases in mobile computing power and decreases in the electricity demands of modern neural network architectures.
Possibilities When Providing Machines With the Right Data
In the end, better AI products and services will only come about with the continued evolution of the datasets that facilitate their creation. Some datasets require human annotation of emotions to provide essential context, while others require complex computational changes to advanced image formats. The possibilities for machine learning applications are nearly endless — meaning that the need for properly structured machine learning datasets will continue to be equally vast.
Collecting and curating training data is already one of the most difficult aspects of producing an AI product, and there is every indication that the trend will continue for the foreseeable future. Developers will need more and more specialized experts to help them produce these crucial tools, so machine vision can continue down the path to revolution. A path that it has been walking for more than half a decade, already.