Deepcrawl is now Lumar. Read more.

Disallow Directives in Robots.txt

The disallow directive (added within a website’s robots.txt file) is used to instruct search engines not to crawl a page on a site. This will normally also prevent a page from appearing within search results.

Within the SEO Office Hours recaps below, we share insights from Google Search Central on how they handle disallow directives, along with SEO best practice advice and examples.

For more on disallow directives, check out our article, Noindex, Nofollow & Disallow.

Use rel=”canonical” or robots.txt instead of nofollow tags for internal linking

A question was asked about whether it was appropriate to use the nofollow attribute on internal links to avoid unnecessary crawl requests for URLs that you don’t wish to be crawled or indexed.

John replied that it’s an option, but it doesn’t make much sense to do this for internal links. In most cases, it’s recommended to use the rel=canonical tag to point at the URLs you want to be indexed instead, or use the disallow directive in robots.txt for URLs you really don’t want to be crawled.

He suggested figuring out if there is a page you would prefer to have indexed and, in that case, use the canonical — or if it’s causing crawling problems, you could consider the robots.txt. He clarified that with the canonical, Google would first need to crawl the page, but over time would focus on the canonical URL instead and begin to use that primarily for crawling and indexing.

22 Jun 2022

APIs & Crawl Budget: Don’t block API requests if they load important content

An attendee asked whether a website should disallow subdomains that are sending API requests, as they seemed to be taking up a lot of crawl budget. They also asked how API endpoints are discovered or used by Google.

John first clarified that API endpoints are normally used by JavaScript on a website. When Google renders the page, it will try to load the content served by the API and use it for rendering the page. It might be hard for Google to cache the API results, depending on your API and JavaScript set-up — which means Google may crawl a lot of the API requests to get a rendered version of your page for indexing. 

You could help avoid crawl budget issues here by making sure the API results are cached well and don’t contain timestamps in the URL. If you don’t care about the content being returned to Google, you could block the API subdomains from being crawled, but you should test this out first to make sure it doesn’t stop critical content from being rendered. 

John suggested making a test page that doesn’t crawl the API, or uses a broken URL for it,  and see how the page renders in the browser (and for Google).

22 Jun 2022

Either Disallow Pages in Robots.txt or Noindex Not Both

Noindexing a page and blocking it in robots.txt will mean the noindex will not be seen, as Googlebot won’t be able to crawl it. Instead, John recommends using one or the other.

23 Aug 2019

Disallowed Pages With Backlinks Can be Indexed by Google

Pages blocked by robots.txt cannot be crawled by Googlebot. However, if they a disallowed page has links pointing to it Google can determine it is worth being indexed despite not being able to crawl the page.

9 Jul 2019

Google Supports X-Robots Noindex to Block Images for Googlebot

Google respects x-robots noindex in image response headers.

21 Dec 2018

Focus on Search Console Data When Reviewing Links to Disavow

If you choose to disavow links, use the data in Google Search Console as this will give you an accurate picture of what you need to focus on.

21 Aug 2018

Block Videos From Search By Adding Video URL & Thumbnail to Robots.txt or Setting Expiration Date in Sitemap

You can signal to Google for a video not to be included in search by blocking the video file and thumbnail image in robots.txt or by specifying an expiration date using a video sitemap file.

13 Jul 2018

Don’t Rely on Unsupported Robots Directives in Robots.txt Being Respected By Google

Don’t rely on noindex directives in robots.txt as they are aren’t officially supported by Google. John says it’s fine to use robots directives in robots.txt, but make sure you have a backup in case they don’t work.

13 Jul 2018

Google Uses the Most Specific Matching Rule in Robots.txt

When different levels of detail exist in robots.txt Google will follow the most specific matching rule.

12 Jan 2018

Check Robots.txt Implementation if Disallowed URLs Accessed by Googlebot

Googlebot doesn’t explicitly ignore URLs in robots.txt files. If Googlebot is crawling these pages then check if the robots.txt file has been set up incorrectly server side using Google’s robots.txt tester. Also, check that there’s nothing on the server which is logging URLs as being accessed in one way but being requested in a different way; this can occur with URL rewriting.

12 Jan 2018

Back 1/3 Next