revengeday/list-of-ai-web-crawlers

Fork 0

A comprehensive repository for maintaining and sharing a robots.txt file that blocks a variety of known AI web-crawlers. This helps website owners control access and manage crawlers more effectively.

Find a file

revengeday f4fd4aeac6 Add README.md		2024-07-05 20:57:12 +00:00
README.md	Add README.md	2024-07-05 20:57:12 +00:00

README.md

📜 List of AI Web-Crawlers

Welcome to the repository dedicated to maintaining a list of AI web-crawlers, aimed to help website owners manage and block specific crawlers using the robots.txt file. This resource is valuable for those wishing to control access to their website by various AI-driven bots.

🚫 robots.txt

This robots.txt file serves as a guideline for web crawlers, explicitly blocking access to a list of known AI-driven bots.

⛔ Blocked AI Web-Crawlers

The robots.txt file blocks a comprehensive list of AI web-crawlers, such as:

Google bots (including AdsBot and Google-Extended)
Applebot
FacebookBot
Amazonbot
OpenAI bots (GPTBot and ChatGPT-User)
Anthropic Claude bots (ClaudeBot and Claude-Web)
PerplexityBot
Anthropic's general bot
Cohere's bot
Diffbot
img2dataset crawler
Various friendly crawlers (e.g., FriendlyCrawler and Bytespider)
CCBot
Omgili crawlers
Peer39 crawlers
Russian state-sponsored crawlers (e.g., Awakari)
YouBot

For the exact rules and bot names, please refer to the robots.txt file in this repository.

🛠 Contributing

We welcome contributions to enhance and expand this list. If you know of any AI web-crawlers that should be added, follow these steps to contribute:

Fork the repository.
Create a new branch.
Add your entry to the robots.txt file.
Open a merge request.

📧 Need an Account?

If you would like to contribute but don't have an account on git.cyberwa.re, please contact @revengeday@corteximplant.com on the Fediverse to request an account.