[cctalk] Re: Large language model (LLM) Web Scrapers

17 Sep 2025

On Wed, Sep 17, 2025 at 09:33:25AM -0400, Paul Koning via cctalk wrote:
...
  A web crawler that does not obey robots.txt is not a
law abiding outfit.  Best would be to block it entirely.  If they are that dismissive of
honesty, they are also unlikely to pay attention to such matters as copyright and
intellectual property ownership. 
So, you want to block the whole of the Internet, including every AI company that all
ignore robots.txt?
All of the AI companies have been already sued by the book publishers for outright
pirating
all of Z-Lib and Scilib. Several have settled.
Meta admited, we only torrented the books, not shared them. (not how that works).
Besides Cloudflare (which has a vested interest in this already), the
AI constant scraping has prompted solutions such as https://anubis.techaro.lol/
forcing browsers to do proof-of-work to connect to websites to protect their content.

2026

2025

2024

2023

2022

[cctalk] Re: Large language model (LLM) Web Scrapers