Disallow Directives in Robots.txt
The disallow directive (added within a website’s robots.txt file) is used to instruct search engines not to crawl a page on a site. This will normally also prevent a page from appearing within search results.
Within the SEO Office Hours recaps below, we share insights from Google Search Central on how they handle disallow directives, along with SEO best practice advice and examples.
For more on disallow directives, check out our article, Noindex, Nofollow & Disallow.
Use rel=”canonical” or robots.txt instead of nofollow tags for internal linking
A question was asked about whether it was appropriate to use the nofollow attribute on internal links to avoid unnecessary crawl requests for URLs that you don’t wish to be crawled or indexed.
John replied that it’s an option, but it doesn’t make much sense to do this for internal links. In most cases, it’s recommended to use the rel=canonical tag to point at the URLs you want to be indexed instead, or use the disallow directive in robots.txt for URLs you really don’t want to be crawled.
He suggested figuring out if there is a page you would prefer to have indexed and, in that case, use the canonical — or if it’s causing crawling problems, you could consider the robots.txt. He clarified that with the canonical, Google would first need to crawl the page, but over time would focus on the canonical URL instead and begin to use that primarily for crawling and indexing.
APIs & Crawl Budget: Don’t block API requests if they load important content
An attendee asked whether a website should disallow subdomains that are sending API requests, as they seemed to be taking up a lot of crawl budget. They also asked how API endpoints are discovered or used by Google.
John first clarified that API endpoints are normally used by JavaScript on a website. When Google renders the page, it will try to load the content served by the API and use it for rendering the page. It might be hard for Google to cache the API results, depending on your API and JavaScript set-up — which means Google may crawl a lot of the API requests to get a rendered version of your page for indexing.
You could help avoid crawl budget issues here by making sure the API results are cached well and don’t contain timestamps in the URL. If you don’t care about the content being returned to Google, you could block the API subdomains from being crawled, but you should test this out first to make sure it doesn’t stop critical content from being rendered.
John suggested making a test page that doesn’t crawl the API, or uses a broken URL for it, and see how the page renders in the browser (and for Google).
Either Disallow Pages in Robots.txt or Noindex Not Both
Noindexing a page and blocking it in robots.txt will mean the noindex will not be seen, as Googlebot won’t be able to crawl it. Instead, John recommends using one or the other.
Disallowed Pages With Backlinks Can be Indexed by Google
Pages blocked by robots.txt cannot be crawled by Googlebot. However, if they a disallowed page has links pointing to it Google can determine it is worth being indexed despite not being able to crawl the page.
Google Supports X-Robots Noindex to Block Images for Googlebot
Google respects x-robots noindex in image response headers.
Focus on Search Console Data When Reviewing Links to Disavow
If you choose to disavow links, use the data in Google Search Console as this will give you an accurate picture of what you need to focus on.
Block Videos From Search By Adding Video URL & Thumbnail to Robots.txt or Setting Expiration Date in Sitemap
You can signal to Google for a video not to be included in search by blocking the video file and thumbnail image in robots.txt or by specifying an expiration date using a video sitemap file.
Don’t Rely on Unsupported Robots Directives in Robots.txt Being Respected By Google
Don’t rely on noindex directives in robots.txt as they are aren’t officially supported by Google. John says it’s fine to use robots directives in robots.txt, but make sure you have a backup in case they don’t work.
Google Uses the Most Specific Matching Rule in Robots.txt
When different levels of detail exist in robots.txt Google will follow the most specific matching rule.
Check Robots.txt Implementation if Disallowed URLs Accessed by Googlebot
Googlebot doesn’t explicitly ignore URLs in robots.txt files. If Googlebot is crawling these pages then check if the robots.txt file has been set up incorrectly server side using Google’s robots.txt tester. Also, check that there’s nothing on the server which is logging URLs as being accessed in one way but being requested in a different way; this can occur with URL rewriting.