[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#52338: Crawler bots are downloading substitutes
From: |
Tobias Geerinckx-Rice |
Subject: |
bug#52338: Crawler bots are downloading substitutes |
Date: |
Thu, 09 Dec 2021 16:42:24 +0100 |
Mathieu Othacehe 写道:
Hello Leo,
+ (nginx-location-configuration
+ (uri "/robots.txt")
It's a micro-optimisation, but it can't hurt to generate ‘location
= /robots.txt’ instead of ‘location /robots.txt’ here.
+ (body
+ (list
+ "add_header Content-Type text/plain;"
+ "return 200 \"User-agent: *\nDisallow:
/nar/\n\";"))))))
Use \r\n instead of \n, even if \n happens to work.
There are many ‘buggy’ crawlers out there. It's in their own
interest to be fussy whilst claiming to respect robots.txt. The
less you deviate from the most basic norm imaginable, the better.
I tested whether embedding raw \r\n bytes in nginx.conf strings
like this works, and it seems to, even though a human would
probably not do so.
Nice, the bots are also accessing the Cuirass web interface, do
you
think it would be possible to extend this snippet to prevent it?
You can replace ‘/nar/’ with ‘/’ to disallow everything:
Disallow: /
If we want crawlers to index only the front page (so people can
search for ‘Guix CI’, I guess), that's possible:
Disallow: /
Allow: /$
Don't confuse ‘$’ with ‘supports regexps’. Buggy bots might fall
back to ‘Disallow: /’.
This is where it gets ugly: nginx doesn't support escaping ‘$’ in
strings. At all. It's insane.
geo $dollar { default "$"; } #
stackoverflow.com/questions/57466554
server {
location = /robots.txt {
return 200
"User-agent: *\r\nDisallow: /\r\nAllow: /$dollar\r\n";
}
}
*Obviously.*
An alternative to that is to serve a real on-disc robots.txt.
Kind regards,
T G-R
signature.asc
Description: PGP signature
bug#52338: Crawler bots are downloading substitutes, Mark H Weaver, 2021/12/10