list-of-ai-web-crawlers/README.md
2024-07-05 20:57:12 +00:00

1.6 KiB

📜 List of AI Web-Crawlers

Welcome to the repository dedicated to maintaining a list of AI web-crawlers, aimed to help website owners manage and block specific crawlers using the robots.txt file. This resource is valuable for those wishing to control access to their website by various AI-driven bots.

🚫 robots.txt

This robots.txt file serves as a guideline for web crawlers, explicitly blocking access to a list of known AI-driven bots.

Blocked AI Web-Crawlers

The robots.txt file blocks a comprehensive list of AI web-crawlers, such as:

  • Google bots (including AdsBot and Google-Extended)
  • Applebot
  • FacebookBot
  • Amazonbot
  • OpenAI bots (GPTBot and ChatGPT-User)
  • Anthropic Claude bots (ClaudeBot and Claude-Web)
  • PerplexityBot
  • Anthropic's general bot
  • Cohere's bot
  • Diffbot
  • img2dataset crawler
  • Various friendly crawlers (e.g., FriendlyCrawler and Bytespider)
  • CCBot
  • Omgili crawlers
  • Peer39 crawlers
  • Russian state-sponsored crawlers (e.g., Awakari)
  • YouBot

For the exact rules and bot names, please refer to the robots.txt file in this repository.

🛠 Contributing

We welcome contributions to enhance and expand this list. If you know of any AI web-crawlers that should be added, follow these steps to contribute:

  1. Fork the repository.
  2. Create a new branch.
  3. Add your entry to the robots.txt file.
  4. Open a merge request.

📧 Need an Account?

If you would like to contribute but don't have an account on git.cyberwa.re, please contact @revengeday@corteximplant.com on the Fediverse to request an account.