Reddit stands firm against AI companies scraping content for training without paying

A hot potato: Reddit has been making moves as part of a crackdown on companies indiscriminately scraping the website for AI training purposes. Its philosophy is that AI companies stand to make millions or billions on large language models they are developing with resources they do not own. It’s analogous to someone taking two-by-fours from a lumberyard to build their house just because the yard doesn’t have a locked gate. But the issue goes way beyond Reddit and is central to how the open web has worked so far.

The Robots Exclusion Protocol is a web standard used to control and manage web crawler and bot access to websites. Defined by the robots.txt file, it tells search engines which parts of a site can be crawled or indexed, helping webmasters protect sensitive content and manage traffic efficiently. However, it works on the honor system with few ways to enforce it.

Last week, Ars Technica was reporting that Reddit posts were not appearing in any search engines except for Google. It’s no big mystery that Reddit already penned a $60 million licensing deal with Alphabet to use its content for training – meanwhile Reddit has been increasingly ranking at the top of Google searches this past year (quid pro quo, or maybe not…).

The company also recently notified users that it changed its robots.txt file to exclude bots and crawlers that didn’t have permission to access its data. Reddit CEO Steve Huffman said he believes in an open internet but that companies now use search engine web crawlers to scrape information for profit, a far cry from their historical use. “I think the traditional value exchange from search engines has changed,” Huffman told The Verge.

“Search and summarization and training are merging, and the value exchange of crawling in exchange for traffic back is becoming muddied.”

To this point, Huffman said that blocking companies unwilling to pay for data harvesting has been “a real pain in the ass,” prompting the changes to Reddit’s robots.txt. For the most part, companies have respected Reddit’s wishes, and several, including Microsoft, Anthropic, and Perplexity, have entered negotiations to license its content.

Hoffman said that the biggest thorn in his side is that some companies scraping Reddit data are turning around and selling it to other AI firms via their APIs. He specifically called out Microsoft AI CEO Mustafa Suleyman for recently comparing all public data on the internet to “freeware.”

“We’ve had Microsoft, Anthropic, and Perplexity act as though all of the content on the internet is free for them to use,” said Huffman. “That’s their real position.” While Microsoft Bing has been gracious in respecting Reddit’s decision to block its crawlers, the company managed to slip in a denigrating remark.

Microsoft AI CEO Mustafa Suleyman: the social contract for content that is on the open web is that it’s “freeware” for training AI models pic.twitter.com/FN1xrqnJC0

– Tsarathustra (@tsarnick) June 26, 2024

“Reddit has blocked Bing from crawling their site for search, favoring another search engine and impacting competition from Bing and Bing-powered engines,” Microsoft spokesperson Caitlin Roulston said last week. “We honor the directions provided by websites that do not want content on their pages to be used with our generative AI models.”

So far, Google and OpenAI are the only search engines on Reddit’s whitelist. If other engines return anything but outdated Reddit content, then they are not abiding by the website’s robots.txt document.

Reddit profiting from user-generated content through these licensing deals is still a hot potato. On the one hand, the lucrative fees do not go into the pockets of the community who make up Reddit’s forums. On the other hand, these licensing deals are not much different from those of other companies.

OpenAI already pays licensing fees to large publishers like Dotdash Meredith, Axel Springer, the Associate Press, and The Atlantic. It is unconfirmed but doubtful that these publications pass those profits to their writers via raises or bonuses. Does that make it right? No, and the courts are still trying to decide about this unprecedented activity. However, it’s par for the course at this point.

And this very issue is not limited to Reddit but all online publishers, big and small. In the race against AI training abuse, Reddit is one of the few with the muscle and influence to call out AI companies. While big media companies try to monetize and reach agreements, the rest of the internet is struggling. In fact, some subreddits have their own bots that copy and paste entire written content from original sources and display it as the first comment in the thread, effectively copying the content and then selling that to AI companies.

Until there are governing regulations, the AI gold rush will be like the California gold rush of 1848. Artificial intelligence firms will continue flocking to shovel AI products down everyone’s throats for profit or to gather more data. Meanwhile, companies like Reddit and Vox will keep handing them the shovels.

Image credit: Jernej Furman

Source link