Amazon Investigating Perplexity AI for Alleged Unauthorized Website Scraping

June 28, 2024

Amazon Web Services is currently looking into allegations that Perplexity AI may be violating its rules by using a crawler hosted on its servers that ignores the Robots Exclusion Protocol. This protocol is a web standard where developers can indicate whether bots are allowed to access certain pages by including a robots.txt file on a domain. While following these instructions is typically voluntary, most reputable companies have been respecting them since the ’90s when the standard was first implemented.

In a recent report by Wired, it was discovered that a virtual machine hosted on an Amazon Web Services server with the IP address 44.221.181.252 was bypassing robots.txt instructions on websites. This machine is believed to be operated by Perplexity AI and has been scraping content from various publications, including Condé Nast properties, The Guardian, Forbes, and The New York Times. Wired conducted an experiment where they inputted article headlines or short descriptions into Perplexity’s chatbot, and the results were closely paraphrased articles without proper attribution.

While Reuters reported that Perplexity is not the only AI company bypassing robots.txt files to gather content for training language models, Amazon’s investigation is currently focused on Perplexity AI. An Amazon spokesperson emphasized that customers using their services must comply with robots.txt instructions when crawling websites to ensure they are not engaging in any illegal activities.

In response to Amazon’s inquiries, Perplexity spokesperson Sara Platnick stated that their PerplexityBot, which operates on AWS, respects robots.txt instructions. However, she did admit that the bot may ignore robots.txt in certain situations, such as when a specific URL is included in a chatbot inquiry. Perplexity CEO Aravind Srinivas denied accusations of ignoring the Robot Exclusions Protocol and insisted that the company is not violating AWS Terms of Service.

It is important for companies like Perplexity AI to ensure that their web crawlers comply with industry standards and regulations to avoid potential legal issues. As technology continues to advance, it is crucial for AI companies to prioritize ethical practices and transparency in their operations to maintain trust with consumers and other businesses. Amazon’s investigation into Perplexity AI serves as a reminder of the importance of upholding these standards in the rapidly evolving digital landscape.