Most SEO’s and marketers are familiar with the issues around duplicate content, and are probably already using a tool like DeepCrawl to identify pages with duplicate body content, titles, and descriptions.
But when you start to think about what duplication really means, things stop being black and white, and get really messy.
Here’s our guide to advanced techniques for detecting, measuring, and solving duplicate content issues using advanced techniques.
What is Duplicate Content?
Duplication isn’t really a simple concept, as no two people would define it in the exactly same way.
- Exact character for character duplicate pages
- Near duplicate
- Duplicate titles and descriptions
- Duplicate body content
- Duplicate search results/tag pages (ordered and unordered, including pagination)
- International/Local duplication
- Shared source content
- Hierarchy aliases, indistinct categorisation. e.g. nearby locations,
- Architecture dupe – shared platforms
What matters less than the amount of duplicate content on a page, is the amount of unique content on a page. Even the duplicate content itself adds value to a page.
How can Duplicate Content Have Value?
A page with duplicate content can rank for a search term containing words in the unique content on a page, or the duplicated content. Examples of this might be: a duplicated product description plus a unique text listing colors, allows the page to rank for both product + color terms.
Focus on the unique aspects of content on pages when looking into duplicate content.
Unless you are the original source of the content, you can’t expect to rank on duplicate content alone.
Where Does Duplication Occur?
Duplicate content usually exists within a single website, or, it spans over multiple websites.
Detecting duplicate content anywhere on the internet requires a global database of all web content. Duplicate content on an internal site is often much easier to find.
Which Version is the Original?
There is no specific instance of duplication that is the primary.
Google tries to establish the original source of content, which is presumably based at least partially on discovery date.
This not possible when you run a limited crawl on your site which doesn’t have the full history of every page.
Advanced Duplicate Detection Methods
Unique Text Search
Finding the amount of unique text on a page, and any other copies on the web, requires a full web crawl.
The best tool to do this is CopyScape. However, you can also try searching for strings of text inside double quotes in Google. There are a few alternatives as well:
Duplicated Content Items
Sometimes pages have different titles, and breadcrumbs, but identical search results. These would not appear in many duplicate reports because they contain some variation.
DeepCrawl’s duplication system does allow some variation and still detects, and reports, pages as duplicated. However, it’s hit and miss depending on the level of variance and the duplication setting.
A good method for detecting is to combine all the IDs from the content being displayed, e.g. product IDs in the case of a product results listing page., then use that as a hash to detect duplicates.
If the IDs are numeric, sum them together. This allows you to create a unique key used to identify other pages with identical results. It’s very unlikely that two pages would ever share the same summed value if they contain different results.
Use DeepCrawl custom extraction to pull IDs from search results, or pass them to your web analytics package.
You can also extract other dimensions around your pages, such as number of results, content length, and other potential characteristics shared by duplicate pages.
If you sort a list of pages by each of these metrics, you can find pages with identical attributes, which can also highlight similar pages.
Sometimes content is duplicated across multiple paginated pages, sometimes the same content is returned, but in a different order. These cases are much harder to detect.
You might be able to get your CMS to output a hash across the entire results set, even if you’re looking at just the first 10 items. This allows you to detect duplication for full sets of results spanning multiple pages.
Sometimes a site may have a duplicated content category, or tag pages, targeting the same content topic – without actually sharing any content. They are fighting over the same keywords. These should usually be consolidated, by redirecting the weakest versions to the strongest.
Another sign of duplicate content, is that it’s not indexed by Google.
If you submit detailed Sitemaps of each page, broken down into as much detail as possible, you can spot patterns of low indexing, which may be caused by duplication.
Preventing Duplicate Content
Duplicate content can be devastating to organic traffic and site ranking. Monitoring your website’s structure and content with a tool like DeepCrawl, allows you to quickly identify duplicate content problem areas like pages, titles, and descriptions.