Navigating the Evolving Relationship Between AI Companies and Publishers

In the rapidly changing landscape of generative AI, the dynamic between AI companies and news publishers is becoming increasingly complex. While it’s still too early to predict the long-term implications of recent partnerships, one trend is clear: the frequency with which top news outlets block OpenAI’s web crawlers is diminishing.

The Data Gold Rush and Publishers’ Responses

The generative AI boom has ignited a fierce competition for data, leading many publishers to take protective measures to safeguard their content. Numerous news organizations have sought to block AI crawlers to prevent their work from being used as training data without consent. A notable example occurred when Apple introduced a new AI agent this summer; several leading news outlets quickly opted out of Apple’s web scraping by utilizing the Robots Exclusion Protocol (robots.txt). With an influx of AI bots, publishers have often felt like they were playing a game of whack-a-mole to manage the situation.

OpenAI’s GPTBot, with its high profile, has faced more blocking attempts than competitors such as Google AI. Following its launch in August 2023, the number of prominent media websites using robots.txt to disallow GPTBot surged. An analysis by Originality AI, an Ontario-based AI detection startup, indicated that at its peak, over a third of the websites had blocked it. However, this number has since dropped to about a quarter. Among a select group of prominent news outlets, the block rate remains above 50%, though it has decreased from nearly 90% earlier this year.

A significant change occurred last May when Dotdash Meredith announced a licensing deal with OpenAI, leading to a notable drop in blocking rates. This decline continued after Vox announced a similar arrangement, and again in August when Condé Nast, the parent company of WIRED, struck a deal. It appears that the trend toward increased blocking has plateaued, at least for now.

The Shift in Publisher Strategies

These fluctuations are understandable. Once companies engage in partnerships and grant permission for their data to be utilized, their incentive to block access diminishes. This results in updates to their robots.txt files to allow crawling. Some outlets, like The Atlantic, even unblocked OpenAI’s crawlers on the same day they announced their partnership, while others, such as Vox, took several weeks to follow suit.

While robots.txt is not legally binding, it has historically served as the standard governing web crawler behavior. For much of the internet’s existence, webmasters have expected compliance with these files. The situation escalated earlier this summer when WIRED’s investigation found that AI startup Perplexity was likely ignoring robots.txt commands, prompting Amazon’s cloud division to investigate possible violations. The optics of ignoring robots.txt could explain why many prominent AI companies, including OpenAI, explicitly state they adhere to these guidelines. Originality AI CEO Jon Gillham believes this urgency drives OpenAI’s aggressive pursuit of partnerships: “It’s clear that OpenAI views being blocked as a threat to their future ambitions,” he states.

Current Trends and Future Speculations

To date, OpenAI has formed partnerships with 12 publishers, most of which have adjusted their robots.txt files accordingly. However, there are exceptions. Time magazine, for example, still blocks GPTBot, and the publication did not respond to WIRED’s request for comment on the continued blocking. According to OpenAI spokesperson Kayla Wood, once these agreements are established, traditional crawling methods become less significant: “We leverage direct feeds,” she explains.

Interestingly, some notable media outlets have unblocked OpenAI’s web crawler without any partnership announcements. Data journalist Ben Welsh tracked this trend and observed a slight decline in block rates a few weeks ago. Outlets like Alex Jones’ Infowars and the revived comedy staple The Onion have caught his attention.

Does this indicate potential undisclosed deals or negotiations with OpenAI? “Absolutely not,” says Onion CEO Ben Collins, explaining that the unblocking likely resulted from the outlet migrating its website to a new hosting service and content management system. “We are not doing any business with the Plagiarism Machine.”

Infowars has not responded to inquiries, but OpenAI has confirmed that it does not have a partnership with the site.

While the initial rush to block OpenAI’s bots seems to have subsided, it’s uncertain whether this lull will persist. Gillham suggests that there may be future spikes in blocking if publishers begin to see it as a bargaining tactic. “Is step one in a negotiation with OpenAI to block them? Does that bring them to the table?” he wonders. Whatever unfolds, this moment is revealing: as publishers initially reacted to the rise of AI scraping bots with a collective impulse to block them, OpenAI’s proactive approach to forming partnerships has cooled this industry-wide drive.

The Data Gold Rush and Publishers’ Responses

The Shift in Publisher Strategies

Current Trends and Future Speculations

Leave a Reply Cancel reply