Indexing
In order for web pages to be included within search results, they must be in Google’s index. Search engine indexing is a complex topic and is dependent on a number of different factors. Our SEO Office Hours Notes on indexing cover a range of best practices and compile indexability advice Google has released in their Office Hours sessions to help ensure your website’s important pages are indexed by search engines.
“Indexed, though blocked by robots.txt” pages aren’t always an issue
One user asked about having hundreds of pages showing as ‘Indexed, though blocked by robots.txt’ in GSC. This only really becomes a problem if the blocked pages are ranking in place of the content you want indexed. Much of the time, pages showing this error in GSC can only be found via a site: search, and even then many of them get omitted from the initial results. It’s highly unlikely that users would ever come across these, so digging into how and why they’re being found by other sources becomes a much lower priority. If they are showing up in place of actual content, you need to question why Google isn’t prioritizing the desired version in the way you’d expect them to.
Use the new “indexifembedded” robots meta tag to control indexing of embedded content
A user asked how to block embedded videos from being indexed seperatelty. John recommends using the new “indexifembedded” robots tag (in conjunction with standard noindex robots tags) to control which versions of the embedded videos are being indexed.
Mismatch in number of indexed URLs shown in site:query vs. GSC
One interesting question was why the Google search results of a site:query don’t match what Search Console shows for the same website. John responded that there are slightly different optimizations for site:query.
When site:query is used in Google search to determine the number of indexed URLs, Google just wants to return a number as quickly as possible and this can be a very rough approximation. If you need an exact number of URLs that are indexed, he clarified that you should use Search Console to get this information. GSC is where Google provides the numbers as directly and clearly as possible. These can fluctuate, but overall the number shown in Search Console for the indexing report is the number of URLs you have indexed for a website — and is likely to be more accurate than the site:query results shown in the SERPs.
The site:query result is only a rough approximation of pages indexed
Use rel=”canonical” or robots.txt instead of nofollow tags for internal linking
A question was asked about whether it was appropriate to use the nofollow attribute on internal links to avoid unnecessary crawl requests for URLs that you don’t wish to be crawled or indexed.
John replied that it’s an option, but it doesn’t make much sense to do this for internal links. In most cases, it’s recommended to use the rel=canonical tag to point at the URLs you want to be indexed instead, or use the disallow directive in robots.txt for URLs you really don’t want to be crawled.
He suggested figuring out if there is a page you would prefer to have indexed and, in that case, use the canonical — or if it’s causing crawling problems, you could consider the robots.txt. He clarified that with the canonical, Google would first need to crawl the page, but over time would focus on the canonical URL instead and begin to use that primarily for crawling and indexing.
There are several possible reasons a page may be crawled but not indexed
John explains that pages appearing as ‘crawled, not indexed’ in GSC should be relatively infrequent. The most common scenarios are when a page is crawled and then Google sees an error code, or the page is crawled and then a noindex tag is found. Alternatively, Google might choose not to index content after it’s crawled if it finds a duplicate of the page elsewhere. Content quality may also play a role, but Google is more likely to avoid crawling pages altogether if they believe there is a clear quality issue on the site.
If URLs that are blocked by robots.txt are getting indexed by Google, it may point to insufficient content on the site’s accessible pages
Why might an eCommerce site’s faceted or filtered URLs that are blocked by robots.txt (and have a canonical in place) still get indexed by Google? Would adding a noindex tag help? John replied that the noindex tag would not help in this situation, as the robots.txt block means it would not be seen by Google.
He pointed out that URLs might get indexed without content in this situation (as Google cannot crawl them with the block in robots.txt), but they would be unlikely to show up for users in the SERPs, so should not cause issues. He went on to mention that, if you do see these blocked URLs being returned for practical queries, then it can be a sign that the rest of your website is hard for Google to understand. It could mean that the visible content on your website is not sufficient for Google to understand that the normal (and accessible) pages are relevant for those queries. So he would first recommend looking into whether or not searchers are actually finding those URLs that are blocked by robots.txt. If not, then it should be fine. Otherwise, you may need to look at other parts of the website to understand why Google might be struggling to understand it.
503s can help prevent pages dropping from the index due to technical issues
One user described seeing a loss of pages from the index after a technical issue caused their website to be down for around 14 hours. John suggests that the best way to safeguard your site against outages like this is to set up a 503 rule ready for when things go wrong. That way, Google will see that the issue is temporary and will come back later to check whether it’s been resolved. Returning a 404 or another error page as the HTTP status code means that Google could interpret the outage as pages being removed permanently, which is why some pages drop so quickly out of the index if a site is down temporarily.
Regularly changing image URLs can impact Image Search
A question was asked about whether query strings for cache validation at the end of image URLs would impact SEO. John replied that it wouldn’t affect SEO but explained that it’s not ideal to regularly change image URLs as images are recrawled and reprocessed less frequently than normal HTML pages.
Regularly changing the image URLs means that it would take Google longer to re-find them and put them in the image index. He specifically mentioned avoiding changing image URLs very frequently, such as adding a session ID or today’s date. In these instances it’s likely they would change more often than Google would reprocess the image URL and would not be indexed. Regular image URL changes should be avoided where possible, if Image Search is important for your website.
There’s generally no SEO benefit to repurposing an old or expired domain
When asked about using old, parked domains for new sites, John clarifies that users will still need to put the work in to get the site re-established. If the domain has been out of action for some time and comes back into focus with different content, there generally won’t be any SEO benefit to gain. In the same vein, it typically doesn’t make sense to buy expired domains if you’re only doing so in the hopes of a visibility boost. The amount of work needed to establish the site would be similar to using an entirely new domain.
Best practices for canonicals on paginated pages can depend on your wider internal linking structure
John tackled one of the most common questions asked of SEOs; how should we be handling canonical attributes on paginated pages? Ultimately, it depends on the site architecture. If internal linking is strong enough across the wider site, it’s feasible to canonicalize all paginated URLs to page 1 without content dropping from the index. However, if you rely on Google crawling pages 2, 3… and so on to find all of the content you want to be crawled, make sure that paginated URLs self-canonicalize.