Today, one company…Google controls nearly all of the world’s access to information on the Internet. Their search monopoly is for billions of people their gateway to knowledge, to products, and their exploration of the Web is in the hands of a single company.Most agree that this lack of competition in search is bad for individuals, communities and democracy.
Unbeknownst to many, one of the biggest barriers to competition in search is the lack of crawl neutrality. The only way to build an independent search engine and the chance to compete fairly with Big Tech is to first effectively and efficiently crawl the Internet. However, the web is an actively hostile environment for beginner search engine crawlers, with most websites only allowing Google’s crawler and discriminating against other search engine crawlers like Neeva’s.
This critically important, yet often overlooked, issue has a huge impact on preventing emerging search engines like Neeva from providing users with genuine alternatives, further reducing competition in search. Similar to net neutrality, today we need an approach to crawl neutrality. Without a change in policy and behavior, seeking competitors will continue to fight with one hand tied behind their backs.
Let’s start at the beginning. Building a comprehensive web index is a prerequisite for competition in search. In other words, the first step of building the Neeva Search Engine is “downloading the Internet” through Neeva’s crawler, called Neevabot.
This is where the trouble begins. For the most part, websites only allow crawlers from Google and Bing unimpeded access while discriminating against other crawlers like Neeva’s. These sites disallow everything else in their robots.txt files, or (more commonly) say nothing in the robots.txt, but send errors instead of content to other crawlers. The intention may be to screen out malicious actors, but the consequence is to throw the baby out with the bathwater. And you can’t serve search results if you can’t crawl the web.
This forces startups to spend inordinate amounts of time and resources finding workarounds. For example, Neeva implements a policy of “crawling a site as long as the robots.txt allows GoogleBot and does not specifically prohibit Neevabot”. Even after a workaround like this, parts of the web that contain useful search results remain inaccessible to many search engines.
As a second example, many websites often allow a non-Google crawler via robots.txt and block it in other ways, either by throwing various types of errors (503s, 429s, …) or by limiting rate. Crawling these sites requires deploying workarounds such as “obfuscate by crawling using a bank of proxy IP addresses that rotate periodically.” Legit search engines like Neeva are loath to deploy conflicting workarounds like this.
These barriers are often aimed at malicious bots, but have the effect of stifling legitimate search competition. At Neeva, we put a lot of effort into building a well-behaved crawler that respects throughput limits and crawls at the minimum rate needed to build a great search engine. Meanwhile, Google has carte blanche. He crawls the web 50B pages a day. It visits every page on the web once every three days and taxes network bandwidth on all websites. It is the tax of the Internet monopolist.
For the lucky crawlers among us, a bunch of well-meaning supporters, webmasters, and editors can help you whitelist your bot. Thanks to them, Neeva’s crawl now spans hundreds of millions of pages per day, on track to reach billions of pages per day soon. Even then, it still requires identifying the right people at those companies that you can talk to, sending cold emails and calls, and hoping for goodwill from webmasters on webmaster aliases. which are generally ignored. An interim fix that is not upgradable.
Getting permission to explore shouldn’t be about who you know. There should be a level playing field for everyone who competes and follows the rules. Google is a search monopoly. Websites and webmasters face an impossible choice. Let Google crawl them or they won’t appear prominently in Google results. As a result, Google’s search monopoly causes the internet as a whole to reinforce the monopoly by giving Googlebot preferential access.
The internet should not be allowed to distinguish between search engine crawlers based on who they are. Neeva’s crawler is able to crawl the web at the speed and depth of Google. There are no technical limits, just anti-competitive market forces that make it harder to compete fairly. And if that’s too much extra work for webmasters to tell bad bots slowing down their websites from legitimate search engines, those with carte blanche like GoogleBot should be required to share their data with responsible actors.
Regulators and policymakers must step in if they care about competition in research. The market needs creeping neutrality, similar to net neutrality.
Vivek Raghunathan is co-founder of Neeva, a private ad-free search engine. Asim Shankar is Neeva’s Chief Technology Officer.