There are several things to consider when creating sitemaps, as well as understanding how search engines view them. We cover a range of these topics within our SEO Office Hours Notes below, along with best practice recommendations and Google’s advice on sitemaps.
For more on sitemaps and SEO, check out our article: How to Improve Website Crawlability with Sitemaps.
It’s okay if the same URL appears on multiple sitemap files
It’s fine to have the same URL included in multiple sitemap files. The only caveat is ensuring that there is no conflicting information being provided across the different sitemaps. For example, having a URL in a ‘regular’ sitemap and an hreflang-specific sitemap (for different language versions of your site) is perfectly acceptable, as long as any hreflang annotations given to that page are consistent across both sitemaps.
It’s possible to host sitemap files on a separate domain
One user asked whether they could host their sitemap files externally (perhaps on a separate server or a staging site). John explains that yes, that’s possible as long as the sitemaps are handled correctly. This means either having both domains verified in Google Search Console (GSC), or including a link to the sitemap file within robots.txt. Redirecting the old sitemap to the new location is also a best practice here (note some reporting issues may occur in GSC if the sitemaps are on a different domain, but this shouldn’t impact the functionality of the sitemap file itself).
Robots.txt file size doesn’t impact SEO, but smaller files are recommended
John confirmed that the size of a website’s robots.txt file has no direct impact on SEO. He does, however, point out that larger files can be more difficult to maintain, which may in turn make it harder to spot errors when they arise.
Keeping your robots.txt file to a manageable size is therefore recommended where possible. John also stated that there’s no SEO benefit to linking to sitemaps from robots.txt. As long as Google can find them, it’s perfectly fine to just submit your sitemaps to GSC (although we should caveat that linking to sitemaps from robots.txt is a good way to ensure that other search engines and crawlers can find them).
Image sitemaps can be useful for sites that use lazy loading
When “lazy loading” images on a page in a way that doesn’t include defined image elements, it’s recommended to have back-up in the form of structured data or an image sitemap. That way, Google will know to associate those images with the page even before they’re loaded.
Use Sitemaps Ping, Last Modified and Separate Sitemaps to Index Updated Content
To help Google index updated content more quickly, ping Googlebot when a Sitemap has been updated, use Last Modified dates in Sitemaps, and use a separate Sitemap for updated content so it can be crawled more frequently.
Specify Timezone Formats Consistently Across Site & Sitemaps
Google is able to understand different timezone formats, for example, UTC vs GMT. However, it’s important to use one timezone format consistently across a site and its sitemaps to avoid confusing Google.
Include Most Recently Changed Content in Separate Sitemap
Rather than submitting all of your sitemaps regularly to get Googlebot to find and crawl newly updated pages, John recommends adding recently changed pages into a separate sitemap which can be submitted more frequently, while leaving more stable, unchanged pages in existing sitemaps.
Use the Last Modified Date to Provide a Hierarchy of Changes Made to A Site
John recommends using the last modified date in sitemaps in a reasonable way to provide a clear hierarchy of the changes that have been made on a site. This helps Google to understand which pages are important and ensures they focus on crawling these first.
“Discovered Not Indexed” Pages May Show in GSC When Only Linked in Sitemap
Pages may show as “Discovered Not Indexed” in GSC if they have been submitted in a sitemap but aren’t linked to within the site itself.
Google Has a Separate User Agent For Crawling Sitemaps & For GSC Verification
Google has a separate user agent that fetches the sitemap file, as well as one to crawl for GSC verification. John recommends making sure you are not blocking these.