bug#52338: Crawler bots are downloading substitutes

bug-guix

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#52338: Crawler bots are downloading substitutes

From:	Tobias Geerinckx-Rice
Subject:	bug#52338: Crawler bots are downloading substitutes
Date:	Thu, 09 Dec 2021 16:42:24 +0100

Mathieu Othacehe 写道：

Hello Leo,

+           (nginx-location-configuration
+             (uri "/robots.txt")

It's a micro-optimisation, but it can't hurt to generate ‘location= /robots.txt’ instead of ‘location /robots.txt’ here.

+             (body
+               (list
+                 "add_header  Content-Type  text/plain;"
+ "return 200 \"User-agent: *\nDisallow:/nar/\n\";"))))))


Use \r\n instead of \n, even if \n happens to work.

There are many ‘buggy’ crawlers out there. It's in their owninterest to be fussy whilst claiming to respect robots.txt. Theless you deviate from the most basic norm imaginable, the better.

I tested whether embedding raw \r\n bytes in nginx.conf stringslike this works, and it seems to, even though a human wouldprobably not do so.

Nice, the bots are also accessing the Cuirass web interface, doyou
think it would be possible to extend this snippet to prevent it?


You can replace ‘/nar/’ with ‘/’ to disallow everything:

 Disallow: /

If we want crawlers to index only the front page (so people cansearch for ‘Guix CI’, I guess), that's possible:


 Disallow: /
 Allow: /$

Don't confuse ‘$’ with ‘supports regexps’. Buggy bots might fallback to ‘Disallow: /’.

This is where it gets ugly: nginx doesn't support escaping ‘$’ instrings. At all. It's insane.

geo $dollar { default "$"; } #stackoverflow.com/questions/57466554

 server {
   location = /robots.txt {
     return 200
     "User-agent: *\r\nDisallow: /\r\nAllow: /$dollar\r\n";
   }
 }


*Obviously.*

An alternative to that is to serve a real on-disc robots.txt.

Kind regards,

T G-R

signature.asc
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

bug#52338: Crawler bots are downloading substitutes, Leo Famulari, 2021/12/06
- bug#52338: [maintenance] hydra: berlin: Create robots.txt., Leo Famulari, 2021/12/06
  - bug#52338: Crawler bots are downloading substitutes, Mathieu Othacehe, 2021/12/09
    - bug#52338: Crawler bots are downloading substitutes, Tobias Geerinckx-Rice <=
    - bug#52338: Crawler bots are downloading substitutes, Leo Famulari, 2021/12/10
    - bug#52338: Crawler bots are downloading substitutes, Tobias Geerinckx-Rice, 2021/12/10
    - bug#52338: Crawler bots are downloading substitutes, Mathieu Othacehe, 2021/12/11
    - bug#52338: Crawler bots are downloading substitutes, Mathieu Othacehe, 2021/12/19
- bug#52338: Crawler bots are downloading substitutes, Mark H Weaver, 2021/12/10
  - bug#52338: Crawler bots are downloading substitutes, Tobias Geerinckx-Rice, 2021/12/10

Prev by Date: bug#52051: [core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)
Next by Date: bug#52393: pdf links not clickable when zathura is launched via xdg-open
Previous by thread: bug#52338: Crawler bots are downloading substitutes
Next by thread: bug#52338: Crawler bots are downloading substitutes
Index(es):
- Date
- Thread