Investigation into Amazon’s Scraping Abuse Allegations

June 27, 2024

185

Amazon’s cloud division is conducting an investigation into Perplexity AI to determine if the AI search startup is violating Amazon Web Services rules by scraping websites that have tried to block it. The Robots Exclusion Protocol, a common web standard, is often used by websites to indicate which pages should not be accessed by automated bots and crawlers. While this protocol is not legally binding, most companies have traditionally respected it.

Perplexity AI, which has backing from the Jeff Bezos family fund, Nvidia, and was recently valued at $3 billion, has been accused of relying on content from scraped websites that have explicitly forbidden access through the Robots Exclusion Protocol. This has raised concerns about scraping abuse and plagiarism by systems linked to Perplexity’s AI-powered search chatbot.

Forbes reported that Perplexity stole at least one of its articles, which was later confirmed by WIRED. Further investigations found that Perplexity’s crawler had accessed Condé Nast properties using an unpublished IP address, despite being blocked by a robots.txt file. The IP address associated with Perplexity was also detected on servers of news websites like the Guardian, Forbes, and The New York Times, indicating widespread crawling of content that prohibits bot access.

Perplexity CEO Aravind Srinivas initially dismissed the allegations, stating that there was a misunderstanding of how the company operates. He later claimed that the IP address observed scraping websites belonged to a third-party company providing web crawling and indexing services. However, he refused to disclose the name of the company due to a nondisclosure agreement.

In response to Amazon’s investigation, a Perplexity spokesperson stated that the company had not made any changes to its operation and that their PerplexityBot respects robots.txt. However, it was revealed that the chatbot ignores robots.txt in certain instances, contradicting the initial claim.

Digital Content Next, a trade association for the digital content industry, has expressed concerns about potential copyright violations by AI companies like Perplexity. CEO Jason Kint emphasized that AI companies should not assume they have the right to take and reuse publishers’ content without permission. If Perplexity is found to be disregarding terms of service and robots.txt, it raises serious red flags about the company’s practices.

The investigation into Perplexity’s scraping abuse allegations highlights the importance of upholding ethical standards when using AI technology to access and utilize online content. It serves as a reminder for companies to respect the protocols and guidelines set by websites to protect their intellectual property and ensure fair use of information.