Australian news sites open to AI crawlers


Joseph Brookes
Senior Reporter

Most of the world’s biggest online publishers are not blocking bots from crawling their content to train generative AI, with some notable exceptions including the ABC and Australia’s regional publishers.

Australia’s most read and most prolific news websites, including news.com.au, Daily Mail, the Nine newspapers, The Australian and The Guardian appear to have not moved to block the bots by disallowing a crawler that feeds data to ChatGPT.

Regional publisher Australian Community Media has blocked the bot from crawling its publications like The Canberra Times and Newcastle Herald, as has the national broadcaster.

Earlier this month, OpenAI allowed websites to block its web crawler from scraping content to train GPT models by disallowing the crawler in site robots.txt files.

The company is backed by Microsoft in a multi-billion dollar deal and uses bots to crawl web pages for content that is then used to train its AI models. OpenAI says does not use content behind paywalls or personally identifiable information.

“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” OpenAI’s own website says.

The New York Times – the biggest US news site by monthly visits — quickly disallowed the GPTBot crawler by updating its file this month. It has also updated its terms of service to forbid its content from being used in the development of “any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system”.

The next most visited US news site, CNN, has blocked GPTBot as well.

The change at the New York Times was first reported by The Verge, a US tech site that has also blocked the GPTBot crawler.

Several other major US news sites have not, however, including the Wall Street Journal, Washington Post, MSN, and Fox News.

International outlet Reuters has disallowed GPTBot, but state media the BBC, RT and Al Jazeera have not.

In Australia, the ABC is the only major national news site to have blocked the crawler so far. The national broadcaster was contacted for comment on Thursday but did not provide a response.

Australian Community Media also appears to have updated the robots.txt files on its news sites like the Canberra Times, Newcastle Herald, and Western Advocate.

But Australia’s regular leader of reader rankings, news.com.au, has not blocked Chat GPT’s crawler. Nor has the online versions of the Sydney Morning Herald, The Age, Australian Financial Review, or The Australian.

The Guardian, which also reported the uneven approach earlier on Friday, still allows GPTBot to crawl its site.

InnovationAus.com has not disallowed the bot but has recently introduced a paywall, in part to stop its articles being accessed by AI.

OpenAI’s ChatGPT is underpinned by a large language model that requires massive amounts of data to function and improve. The company reportedly fed its GPT3 model 570gb of writing from the internet – around 300 billion words – to train the market leading chatbot.

But experts see the trawling and exploiting of data as a “privacy nightmare” for individuals because no users have ever consented to having their data train an AI model. It also raises issues with potential breaches of copyrighted material and not compensating creators.

The Copyright Agency, a not-for-profit representing Australian creators, has told the current federal inquiry into AI in education that it is concerned that content is being used by AI platforms without permission from the creators and rightsholders.

“This gives rise to serious ethical and legal issues, including concerns about copyright infringement. These issues extend to the use of AI outputs that are derived from unauthorised uses of content,” the submission said.

Australia’s tech industry lobby group, representing large US firms with major investments in generative AI, this week called for clarity in how AI regulation may intersect with copyrights, but said tightening the Australia’s copyright regime would be a mistake.

“This may limit our potential to design, develop, and train AI/ML models locally, while also potentially posing barriers to inward AI investment and adoption,” the Tech Council said in its submission to a separate consultation on AI regulation.

Google has previously argued that Australia’s copyright laws are stifling innovation by not allowing tech companies exceptions to mine websites for information when training AI tools.

Do you know more? Contact James Riley via Email.

Leave a Comment

Related stories