commit f4fd4aeac6132f9dcab060e860e5ff726c5d87ae Author: revengeday Date: Fri Jul 5 20:57:12 2024 +0000 Add README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..2b55d86 --- /dev/null +++ b/README.md @@ -0,0 +1,46 @@ +# 📜 List of AI Web-Crawlers + +Welcome to the repository dedicated to maintaining a list of AI web-crawlers, aimed to help website owners manage and block specific crawlers using the `robots.txt` file. This resource is valuable for those wishing to control access to their website by various AI-driven bots. + +## 🚫 robots.txt + +This `robots.txt` file serves as a guideline for web crawlers, explicitly blocking access to a list of known AI-driven bots. + + +## ⛔ Blocked AI Web-Crawlers + +The `robots.txt` file blocks a comprehensive list of AI web-crawlers, such as: + +- Google bots (including AdsBot and Google-Extended) +- Applebot +- FacebookBot +- Amazonbot +- OpenAI bots (GPTBot and ChatGPT-User) +- Anthropic Claude bots (ClaudeBot and Claude-Web) +- PerplexityBot +- Anthropic's general bot +- Cohere's bot +- Diffbot +- img2dataset crawler +- Various friendly crawlers (e.g., FriendlyCrawler and Bytespider) +- CCBot +- Omgili crawlers +- Peer39 crawlers +- Russian state-sponsored crawlers (e.g., Awakari) +- YouBot + +For the exact rules and bot names, please refer to the `robots.txt` file in this repository. + +## 🛠 Contributing + +We welcome contributions to enhance and expand this list. If you know of any AI web-crawlers that should be added, follow these steps to contribute: + +1. **Fork** the repository. +2. **Create** a new branch. +3. **Add** your entry to the `robots.txt` file. +4. **Open** a merge request. + +## 📧 Need an Account? + +If you would like to contribute but don't have an account on [git.cyberwa.re](https://git.cyberwa.re), please contact [@revengeday@corteximplant.com](https://corteximplant.com/@revengeday) on the Fediverse to request an account. +