Spider Traps and AI Scraping Bots
Information on why bots get into loops indexing content, the impact on sites, and how to mitigate it.
With the rapid adoption of AI and Large Language Model (LLM) technologies, websites are increasingly targeted by sophisticated web crawlers designed to gather content for AI training datasets. Unlike conventional bots, these newer crawlers often:
- Disregard
robots.txt
directives. - Ignore HTML attributes such as
rel="nofollow"
andnoindex
. - Conceal their activities by mimicking regular browser and OS user agents.
- Originate from numerous, varied IP addresses, complicating traditional blocking techniques.
These advanced crawlers can significantly affect your site’s performance and stability.
Understanding Spider Traps
Any site crawler’s primary goal is to scan the content on your website. The crawlers accomplish this by visiting your site and scanning a page for links it hasn’t visited yet, then repeating for the newly found pages and scanning for links again. This can be problematic on sites that utilize dynamic content features, such as search or filters, because each unique combination of search/filters generates a distinct URL. Site crawlers get stuck scanning these dynamic pages in what is called a “spider trap”, which results in site crawlers scanning your site exponentially more times, potentially causing hundreds of thousands of hits or more, per-crawler. This can lead to:
- Increased server load due to the high volume of uncached, resource-intensive requests.
- Real users experiencing degraded performance.
- Potential downtime due to overwhelming system resources.
For a broader overview of bots on the platform, including how to identify them in logs and how Pantheon handles indexing, refer to Bots and Indexing on Pantheon.
Strategies for Managing Spider Traps and Unwanted Crawlers
Pantheon recommends several strategies for effectively handling spider traps and problematic AI-driven crawlers:
- Drupal Solutions: Utilize dedicated modules like Facet Bot Blocker, designed specifically to detect and block excessive crawler activity around faceted searches.
- Edge Solutions: Pantheon’s Advanced Global CDN customers can set up specific rules to monitor and limit traffic patterns typical of bot activity.
- External Security Measures: Employ external Web Application Firewalls (WAF) to detect and manage bot traffic through sophisticated rule sets. For example, Cloudflare users might implement rules to detect and mitigate traffic patterns associated with spider traps.
- Monitoring and Log Analysis: Regularly review your server logs and site traffic to identify abnormal spikes or patterns indicative of bot traffic.
- Use of Robots.txt and Nofollow Attributes: For well-behaved bots, such as major search engines (Google, Bing, etc.), ensure you use clear
robots.txt
directives and thenofollow
attribute on links that should not be indexed. This can effectively reduce unnecessary crawling and indexing by compliant bots.
By adopting these targeted measures, Pantheon administrators can effectively minimize the impact of unwanted crawler traffic, protecting performance and ensuring stability for legitimate users.