Diffbot Adds Page Classifier API to Help Developers Categorize the Web

Diffbot, the visual robotic software tool that uses machine learning to crawl the web and identify types of content, today released a new beta API for developers. Dubbed the “Page Classifier API,” this new hook for developers to plug into providers an analysis engine that’s designed to recognize and categorize the entire web into just 20 different basic page types. Back when BetaKit covered Diffbot’s $2 million seed round in May, its engine could only spot two different types of content: front pages and articles. Now, the startup is betting it has got the bulk of the web covered, though it’s also open to suggestions.

“We’d released commercial APIs for only articles and front pages, so we had a lot of existing customers pass stuff into our article API that weren’t articles, because they were just piping things in from their users,” Diffbot founder and CEO Michael Tung said in an interview. “If you pass an image into the article API, you’re not going to get a good experience. So one immediate use is for existing customers to redirect the flow of URLs, to only send the article URLs to the article API, and to send the other ones elsewhere.”

That’s how the new API will benefit existing customers, but in general, the product is aimed at making sense of a massive amount of incoming URL information that’s not organized by content type or category to begin with. Diffbot is providing a great example of how that might work in practice with its own Chrome extension, which is also being released today. That Chrome extension allows users to see exactly what type of links are being shared on Twitter.com without having to click through to a page. They can just click the link tag, and tweets will expand to show detailed information from the source, including article text if it’s an article, or a photo or video if it’s classified as either of those types of pages (Diffbot has an infographic of information they’ve found using the extension about what we share on Twitter available here).

Unlike some other solutions, the Diffbot engine can parse a page with multiple elements and decide what the primary focus of that is. So if you’re looking at a video with an accompanying short article, it’ll recognize the video is the intended showpiece of the page and serve that up via the API. The Twitter use case is a particularly strong one, since it shows what Twitter client apps could also accomplish using the Diffbot API. Diffbot VP of Product John Davi explained that aside from being able to provide an instant reading view for articles, and viewing windows for image and video pages, another possible application is creating ecommerce windows that recognize product pages at online stores and include a ‘buy-it-now’ link, which could lead to additional monetization opportunities for people aggregating and serving a lot of links from users and other sources.

While this is a big step in Diffbot’s goal of categorizing the entire web, there’s still plenty left to map. The startup says that the next step is drilling down to additional individual content types it’s identified, and then creating more sophisticated tools for gathering even more specific information. The next page type Diffbot will tackle with an individual API will be photos, Tung said, since it represents a massive percentage of what’s being shared on the web. It’s easy to imagine the applications of being able to identify and relate information about what type of picture is on a given page, especially for tools that use family filters.

Wednesday, Hopper announced a significant round to help it tackle the task of categorizing just one subset of the web’s pages, specifically in the travel vertical. Diffbot is after a much bigger fish, so it’ll be interesting to see how it fares with a broadly focused approach compared to those taking on much smaller chunks of the mass of information that makes up the web.