Deepcrawl is now Lumar. Read more.
DeepcrawlはLumarになりました。 詳細はこちら

Sitemap Audits and Advanced Configuration

SEO and Digital Marketing Best Practices

While Google and other search engines are getting better at finding pages on their own, Sitemaps can help by giving them extra information about your pages and help them to crawl more efficiently.

In this post we’ll cover general Sitemap rules and advanced configuration. For a full guide to Sitemap implementation, please see Google Webmaster support.
 

Who needs Sitemaps?

 

Large sites:

If you have a large number of pages that constantly churn, with old pages expiring and new ones being created every day, search engines might have to crawl through thousands of existing pages to find the few hundred new pages created. Sitemaps can help them find the new content quickly.
 

Publishing sites:

If your site is set up to be indexed in Google News, an XML Sitemap containing content less than 48 hours old with additional meta data can significantly improve the indexing of content, even if the web crawler has problems.
 

Uncrawlable Sites:

In the early days of the web, many websites were built with content accessible through forms, which search engines could not crawl. Sitemaps were a way to help work around this problem. However, most websites have been completely rebuilt since this problem was understood so it’s been solved in most cases with good internal linking.
 

Everyone?

A Sitemap might not be required if you have a small or optimized site, but they do mean you’ll get extra Webmaster Tools reports that give great feedback on indexing problems. Consider implementing one as a way to get more information on how your site is performing.
 

Creating Sitemaps: A Quick Guide

 

What to Include

  • Do not include pages that return a non-200 status code (you can see and export a list of these pages using Lumar’s Non-200 Status Code report).
  • While not desirable, including disallowed pages is OK as they will be ignored.
  • Including redirecting URLs is OK as a one-off exercise to show Google the redirects. However, they should not be left in Sitemaps as a general rule as their inclusion causes unnecessary crawling or can result in Sitemaps being rejected.
  • Do not include any non-indexable pages, including those with a noindex tag or those that are canonicalised to another URL.
  • Do not mix URLs from different domains.
  • A last modified date (<lastmod>) is optional; if used, it should match any date/time in the HTML to avoid confusion.
  • Tags for <changefreq> and <priority> are also optional but can be left out.

 

Formatting

  • Sitemaps don’t need to be XML format: text files are fine if you don’t want to use the additional attributes required for XML.
  • Google require Sitemaps to be UTF-8 encoded; you can use entity escape codes for ampersands, single quotes, double quotes and greater than/less than symbols.
  • Always use absolute URLs.
  • Do not include URLs with any additional tracking parameters.

 

Thresholds

 

Internal Linking & Sitemaps: Identifying Gaps

Pages can exist in the Sitemaps but not linked internally, or they can be linked internally but not included in Sitemaps. Whether accidental or deliberate, both scenarios are a problem and should be fixed. Either improve your internal linking structure to include all pages in the Sitemap, or update your Sitemap(s) to include all pages that are linked within the site.

If you have linked URLs or URLs in Sitemaps that don’t generate traffic or are no longer required, disallow or delete them to minimize your crawl space.
 

Other Considerations

  • If you have multi-language/country websites, you should be using hreflang. Sitemaps are the best way to implement hreflang because the information is only required by search engines, so including it in the HTML of pages adds unnecessary weight to the page.
  • Having content duplicated in sitemaps is OK, but it makes the site data you get from your Sitemaps unreliable.
  • If you have separate sub-domains for mobile and desktop versions of your site, you should also use a rel=”alternate” tag to help Google identify each version. This can also be done in the Sitemap. For more information, see Google’s mobile configuration guide.

For more information on creating Sitemaps, please visit Google Webmaster support.

  • If you have multi-language/country websites, you should be using hreflang. Sitemaps are the best way to implement hreflang because the information is only required by search engines, so including it in the HTML of pages adds unnecessary weight to the page.
  • Having content duplicated in sitemaps is OK, but it makes the site data you get from your Sitemaps unreliable.
  • If you have separate sub-domains for mobile and desktop versions of your site, you should also use a rel=”alternate” tag to help Google identify each version. This can also be done in the Sitemap. For more information, see Google’s mobile configuration guide.

For more information on creating Sitemaps, please visit Google Webmaster support.
 

Naming Your Sitemaps

How you name your Sitemaps depends on how public you want them to be: some sites choose to keep them private so that competitors can’t access data about their site’s structure.
 

Public:

If you want to make your Sitemap or index Sitemap accessible to everyone, name it sitemap.xml. Include all of your Sitemap index URLs, or individual Sitemaps, in your robots.txt file so that Google can find them.
 

PRIVATE TO ANYONE WITHOUT A LINK:

To hide your Sitemaps from competitors, consider naming them something that could not be guessed. Remove the Sitemap URL from the Robots.txt and submit your Sitemap(s) manually so that Google can find it.
 

ONLY ACCESSIBLE BY SEARCH ENGINES (ADVANCED):

Do a reverse DNS lookup on the request IP address to confirm the identity of the user and block access. Submit your Sitemap(s) manually.
 

MULTIPLE SITEMAPS FOR THE SAME SITE

Using one Sitemap for a very large site might be unwieldy and unmanageable, putting your site at risk of errors and meaning you’ll waste time by sifting through a large amount of data. Splitting it into multiple Sitemaps can help.
 

USE A DIFFERENT SITEMAP FOR DIFFERENT TYPES OF CONTENT:

Generally it’s useful to include as many Sitemaps as possible, broken down into different types. For example, one for product pages, one for new product pages and one for category pages.

You can also use an extra Sitemap for different purposes, such as:

  • Unlinked content
  • New content (anything less than a couple of days old)
  • Indexing stats
  • Redirected content, only to show Google the redirects

 

INDEX SITEMAPS:

Index Sitemaps allow you to build multiple Sitemaps and submit them to Google together. The structure of index Sitemaps should be two levels deep: don’t nest index Sitemaps within other index Sitemaps.
 

MULTI-DIMENSIONAL SITEMAPS (ADVANCED):

Multi-dimensional Sitemaps allow you to include the same URL in multiple Sitemaps. For example, for an ecommerce site, you could use a set with your products broken down into main categories. With multi-dimensional Sitemaps, you could also include additional Sitemaps with the products grouped by those in stock, and those out of stock.

Using this method, you might be able to identify a pattern of pages out of stock that are not being indexed, which you wouldn’t necessarily spot from the category Sitemaps.
 

SITEMAP AUDITS WITH LUMAR (formerly Deepcrawl): USEFUL REPORTS

 

CRAWL TYPE: UNIVERSAL CRAWL

Run a Universal Crawl to check all the URLs in your Sitemaps and compare them to the rest of your site. Once the crawl has finished, you will be able to drill down to each URL (select your report and click the URL you want to analyze).

universal sitemaps in deepcrawl

A Universal Crawl with a full crawl of the website will also reveal gaps in the Sitemap or internal linking structure by showing you where they don’t match: you can see which URLs are in the Sitemap but not linked, and those that are linked but not contained in your Sitemap.

Lumar (formerly Deepcrawl) will automatically detect your Sitemaps. If you need to manually identify the Sitemaps for Lumar, then the same is probably true for Google.

Remember that, like Google, Lumar will only crawl your Sitemap two levels deep.
 

1. XML SITEMAPS

Navigate to Universal > XML Sitemaps in your report to view the HTTP status, type, errors and the number of URLs for each discoverable Sitemap on your site.
 

2. ALL URLS IN XML SITEMAPS

Check that your Sitemaps contain all the URLs intended by going to Universal > All URLs in XML Sitemaps in your report. Click a URL to see all information about the URL in one dashboard:

Universal Crawl
 

3. BROKEN XML SITEMAPS

Check for XML Sitemaps that return a 4XX or 5XX error using Universal > Broken XML Sitemaps in your report.
 

4. MISSING FROM SITEMAPS

Use Universal > Missing In Sitemaps to find URLs that are linked internally but that aren’t in your Sitemaps. Add these URLs to maximise your indexable space.
 

5. ONLY IN SITEMAPS

Use Universal > Only In Sitemaps within your report to find pages that are included in your Sitemaps, but that aren’t linked internally. If this is a mistake, you can use this information to link these pages from within your site.

Click a URL to see all the information about that page; from here you can view the Sitemaps In tab to see all the Sitemaps the URL is in:

Sitemaps
 

6. REDIRECTING SITEMAP URLS

The report at Universal > Redirecting sitemap URLs will show you all URLs that are included in the Sitemap but that are returning a 3XX status. These should be removed from the Sitemap, or replaced with the new URL if this is not already included.
 

7. HREFLANG IN SITEMAPS

View all the hreflang tags contained in Sitemaps for a particular URL, along with any conflicts.

To get to this information, select any report under Universal and click a URL you wish to examine. Click the HREFLANG tab in the subsequent screen to see the hreflang configuration for that page:

Sitemaps HREFLANG

Avatar image for Tristan Pirouz
Tristan Pirouz

Marketing Strategist

Tristan is an SEO enthusiast, strategist, and the former Head of Marketing at Lumar.

Newsletter

Get the best digital marketing & SEO insights, straight to your inbox