I was wrong about robots.txt

KarlHeinzSchwuke@feddit.org · 5 months ago

I was wrong about robots.txt

General_Effort@lemmy.world · 5 months ago

What did he think a crawler is? Why was he surprised that not allowing companies to use his data lead to them not using his data? Looks like he has another surprise coming when he notices that search engines no longer index his blog.

Archr@lemmy.world · edit-2 5 months ago

I feel like most casual users would not make the connection of “crawlers” to link previews that they talk about it the article.

Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.

General_Effort@lemmy.world · 5 months ago

that is not how general news media has been talking about robots.txt.

Ahh, yes. I think there is a lesson there.

thedruid@lemmy.world · 5 months ago

So. If I can add something here for everyone’s benefit

No search engine really obeys robots.txt

Their publicly acknowledged crawlers do, but they have other crawlers that aren’t know that ignore the file.

Google knows every inch of your site, allowed or not.

See, just because a search engine says it doesn’t know, doesn’t mean it hasn’t crawled. Just doesn’t display the results based on your settings.