We have experienced an avalanche of news based on AI in recent months, not least due to the perceived chatbot wars between the likes of OpenAI, Google Gemini, Anthropic, and others. But the sheer volume of media output concerning AI innovation can be exhausting, and it also means that important news stories can get lost in the mix. Arguably, a case in point is the growing legal battle between legacy news media and some AI companies, particularly OpenAI, and Perplexity over data scraping and web crawlers.
For example, both Forbes and Wired Magazine have been beating the drum against Perplexity of late. For those not aware, Perplexity AI is most noted for its “answer engine”, a kind of hybrid form of a chatbot and search engine that provides direct answer. Perplexity, which is now valued at over $3 billion, has been accused by those media companies of plagiarism and outright theft. Wired claims Perplexity’s web crawlers ignored the ‘robots.txt” file guardrail, which essentially acts as a gatekeeper to bots trying to access data from a particular website.
Web crawlers make the internet tick
Web crawling is, of course, essential to the modern internet, both in terms of established internet gatekeepers and new AI companies. The phrase is often talked about in negative terms when it is often benign. For example, Google would use web crawlers to index the web through SEO. It might look at an online casino website, for instance, to make them searchable on the internet, but it would stop short of monitoring online slots players’ individual activity. In short, web crawlers are useful and make the web tick.
The problem, of course, is that legacy media is not happy that some AI companies seem – and we should stress the word “seem” for legal reasons – to be bypassing guardrails to train on the data contained within their websites. Some of these platforms hold a de facto record of the world’s economic, social, political, and cultural history, and they do not simply want to hand it over without serious financial recompense. How serious? Well, consider that the New York Times has initiated a lawsuit to sue OpenAI and Microsoft for “billions of dollars.”.
Legacy media will likely go to court
The lawsuit from the “Paper of Record” might sound somewhat frivolous, but it is indicative of a growing concern among legacy media outlets that their models may become defunct. Consider the following – how likely is it that we will one day soon ask Gemini or ChatGPT for a summary of the daily news rather than open up the websites of the New York Times, CNN, or BBC? It’s certainly not that far away, and, to some extent, it is already possible.
The key aspect of legacy media is that it believes that it has provided the foundation for the training of AI LLMs, and it now looks enviously at the billions of dollars pouring into AI companies. Despite all the advances in AI, we must remember that LLMs don’t think: they ‘predict’ a series of words based on probabilities, and the only way they can do this is through access to data. The Times, Wired, Forbes, and others like the Guardian believe they should be financially awarded for providing that data – both in the past and future.
Of course, there are solutions. Look at the huge deal that Google stuck with Reddit for LLM training. Reddit, which is sometimes a bit eclectic, is nonetheless a repository for some of the most important questions and answers on the internet. If you want to know a Star Wars fan theory or learn a trick to change a tire quickly, Reddit is the place to go, so it’s natural that AI companies will want to train on that very ‘human’ data.
Yet, there is a difference between the publicly available data on Reddit and the information contained within a pay-walled article. Even if the article is free to all, these legacy outlets believe they should be read on their platforms, allowing them to recoup money through advertising, etc. Deals can – and will be – struck. But make no mistake about it: this row is starting to bubble over, and we may see huge legal battles in the coming months and years over how and where AI models are trained.