Sometime around 2013 and 2014, deep learning was going through a revolution that required pretty much everyone to reset their expectations as to how things worked, and leveled the playing field for what people were doing with computer vision.
At least thats the philosophy that Pinterest engineer Andrew Zhai and his team have taken, because around that time he and a few others began working on some internal moonlightproject to build computer vision models within Pinterest. Machine learning tools and techniques had really been around for some time, but thanks to revelations in how deep learn worked and the increasing use of GPUs, the companywas able to take a fresh look at computer vision and see how it would work in the context of Pinterest.
From a computer vision perspective we have a lot of images where visual search stimulates sense, Zhai told. Theres this product/ data-set fit. Users that come to Pinterest, theyre often in this visual discovery experience mode. We were in the right place at the right time where the technology was in the middle of a revolution, and we had our data set, and were very focused on iterating as quickly as we can and get user feedback as fast as we can.
The end result was Lens, a product Pinterest launched earlier this month that allows users to basically point at an object in the real world with their camera and return search results for Pinterest. While a semi-beta was launched last year, Lens was research results of years of scrapped prototypes and product experimentation that eventually made something that would hopefully turn the world collectively into a bunch of pins that were searchable through your camera, creative result Albert Pereta said.
When a user looks at something through Lens, Pinterests visual detection kicks in and determines what objects are in the photo. Pinterests technology can then frame the image around, tell, a chair, and use that to ask a query use Pinterests existing search technology. It uses certain heuristics, like a confidence score of what kind of object it is, and the context of it like whether it is the dominant object, the largest one, the one the most in focus or something along the lines. Zhai said part of the priority was leveraging as much of Pinterests existing technology, like search, to construct its visual search products.
Pinterest had collected a lot of data from users initially cropping objects in their images in order to search for objects, drawing bounding boxes for their searches. The company had positive feedback loops to determine if those searches were correct if users engaged with results for a chair, then it was probably a chair. With that, the company had lots of ways to initially train these deep learning algorithms in order shift the process over to camera photos and try to do the same thing. All that paid off in the future, as the initially janky projects dedicated the company the critical data set to build something more robust.
Pinterests goal was to emulate the servicescore user experience: that sort of putzingaround and discovering new products or conceptson Pinterest. Just getting theliteral outcomes like you might expect from a Google visual search wasnt enough to extend the Pinterest experience beyond its typical search with keywords and theories to what youre doing with your camera. There are other ways to get to that outcome, like literally reading the label on a bottle or asking person what kind of shoes they are wearing.
If Im in my kitchen and have an avocado in front of me, if we point at that and we return a million photos of avocados, thats close to as useless as you can get, Pereta told. When person tags am avocado on Pinterest, what they expect is towander about. It can go from cooking a recipe to health benefits and growing one in a garden. You know the related pins, you dont quite understand why theyre there but sometimes they feel like exactly whatyou want to see.
One of the biggest challenges Pinterest faced was figuring out how to jump from user-generated content like low-quality photos to results that included more professionalhigh-quality photography. It was easy to map from low-quality photos, like ones that are blurry or without great lighting, to other low-quality photos, visual search engineering director Dmitry Kislyuk said. Thats primarily what the results were returning in the first demos that the team was working on, so the team had to figure out how to get to higher-quality results. Both objects clustered together on their own, so the company had to basically forth them to deliver the same semantic results and bucket them together.
Collectively, these all piece together to put together a strong debate that Pinterest is trying to be a leader in visual search. Thats largely been consideredone of Pinterests biggest strengths. Because of its big data set that lends itself so neatly to products, each part of an image can easily be broken out into searches for other products. These searches existed early on at Pinterest, but only in limited kind and users couldnt figure out what to do with them but in the past years theyve started to mature more and more. The pitching is an example of whats stimulated Pinterest attractive to advertisers, though it needs to ensure it induces the jump from a curiosity cooked into an innovation budget to a mainstay product alongside Facebook( and soon potentially Snapchat ).
A lot of the success and origins of Pinterests modern visual search dovetails almost perfectly with the rise of GPU usage for deep learning. The processors had existed for a long time, but GPUs are great at running processes in parallel such as rendering pixels on a screen and doing it very quickly. CPUs have to be more versatile, but GPUs were specialized at running these kinds of processes in parallel, enabling the actual maths thats happening in the background to execute faster.( This revolution has furthermore rewarded NVIDIA, one of the largest GPU makers in the world, by more than tripling its stock cost in the past year and turning it into a critical component in the future of deep learning and autonomous driving .)
Methods for deep learning existed for 10 or 20 years, but it was this one paper around 2013 and 2014 that showed when you provided those methods on a GPU you can get amazing accuracy and outcomes, Zhai said. Its actually because of the GPU itself, without that the matter is revolution probably wouldnt happen.GPUs only care about these specific things like matrix multiplication, and you can do it really fast.
The actual process is a careful dance between what happens on the phone and what happens online, in order to build a more seamless user experience. For example, when a user looks at something through their telephone, the annotations for Lens are returned speedily while the company finishes doing the image search on the back-end. That kind of perceived user latency helps smooth out the experience and induces it feel more real-time.That will be important going forward as Pinterest begins to expand internationally and has to start grappling with problems like low-latency areas, potentially moving more operations to the phone.
Pinterests results were partially the result of a lot of new learnings, and proportion luck that everyones teams had to scrap and re-learn all their approaches to deep learning. Beyond that, Pinterest has billions of images that are largely loaded with high-quality versions of images that lend themselves to be naturally searchable, an archive of data that other companies or academics might not have. The whole move fast, break things kind of fits with Pinterest, which was trying to get versions in front of users in order to figure out what worked best, because the team( of less than a dozen) felt like it was inventing new user behavior.
There are plenty of other tries by other companies to weaponize this technology into something commercial, with startups like Clarifai raising a lot of capital and build metadata-driven visual search that it make available for retailers and industries. Google is always a looming animal with its vast sum of data, though whether that translates into a commercial product is another story. Pinterest, meanwhile, hopes that its focus on returning pertained ideas rather than direct one-to-one image results and the tech behind it is something thatll continue to differentiate it going forward.
Were trying to use camera to turn your world into Pinterest, Pereta said. Its not that were generating some completelynew experience to a user. It feels like when we nailed it, its when you feel like the entire world is made of pins. That thing, I take a photo of that chair, its not only that chairs similar styles but also it in context. If you were to find that chair on Pinterest, thats exactly what youd expect to find. That straying, that discovering. When we do a really good task with camera, its gonna feel like the world is made of pins.