Google Selected Canonical Blocked by robots.txt

Jafferson · August 12, 2024, 8:07am

I think we’re on the way to resolving this one, but I thought it was worth posting for others to be aware and to share my disbelief that this could happen.

I’m working with a client who’s on Magento, so we’ve got a fairly robust robots.txt in place to manage the flood of automatically created pages. One of the rules is:

Disallow: *?*=

From what I understand, this is fairly standard for Magento (though I’d be happy to hear other perspectives if anyone has a better approach).

About a month ago, the site’s visibility in Ahrefs absolutely plummeted, seemingly coinciding with the recent spam update, so all attention has been on that—looking at backlinks, updating content, page titles, etc.

Then we discovered something in Search Console: All of the category pages have a “Google Selected Canonical” that includes ?product_list_limit=all.

So, /category-url is being ignored, and Google has selected /category-url?product_list_limit=all as its canonical. However, because of the rule in robots.txt, that page is blocked to robots.

This isn’t a new rule, so it leaves me wondering why Google suddenly started favoring these pages it can’t even see over the pages it has indexed for a long time. The canonical tags on the site are set up as expected.

For now, I’ve added this rule, and the pages are showing as crawlable again in Search Console, but I’m just waiting for them to be reindexed:

Allow: *?product_list_limit=all

I’m curious if anyone has any thoughts on this. It feels like an error on Google’s part—I can understand the logic that they want to show the “all products” version of the page, but surely not if it’s been blocked to them.