Duplicate content is where identical content appears on two or more different web pages on the web. In fact, the content does not have to be exactly the same. As Google and other search engines become more complex, they are getting better at spotting spun content and treating it as duplicate content. For example, if you have simply reworded an article, this can also count as duplicate content even though it is not an exact replica of another page. Either way, duplicate content can reduce the authority of the individual page and if widespread, the credibility of an entire website. It can then result in loss of rankings for individual pages or the entire site. Google’s definition of duplicate content is “refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar”.
Whilst it may at first appear that duplicate content is a spammy tactic (which it is), there are multiple forms of duplicate content that can occur by accident. As such, duplicate content is often an issue faced by oblivious webmasters without having consciously created duplicate content. Duplicate content has been around for a very long time, but ever since the Panda Update of 2011 and the various iterations there after, duplicate content has been a hot topic.
Does Duplicate Content Incur Penalties?
This is a confusing one. For a long time people thought that Google penalised duplicate content – yet Google employees stated that it simply reduces the chances of your page ranking rather than incurring a direct penalty. It was then lauded over others by those ‘in the know’ that this duplicate content penalty was just a huge myth. Or is it? It would appear that the masses were in fact correct all along. Google’s webmaster guidelines states that for sites which have seeming manipulative intent with their duplicate content “the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results”. Sounds like a penalty to us! So who is right? In the vast amount of duplicate content cases, there is very little malicious intent and as such Google simply tries to figure out which version of the content is the original, therefore not incurring a penalty. However, if the duplicate content is highly spammy then it can occur a penalty. Penalty or no penalty, solving your website’s duplicate content issues is a necessary part of SEO. Here are the potential ramifications of duplicate content:
- Google has to decide which page to rank for a given search term. If you have multiple instances of duplicate content on your site, Google may choose a less preferred page to return to searchers as a search result.
- Search Engines want to offer a variety of results for a given search term. Duplicate content can therefore reduce the ranking power of a given webpage or website.
- In spammy, manipulative cases Google can penalise the website by reducing rankings or de-indexing the site.
- Duplicate content (especially duplicate URLs) can cause ambiguity for third party websites. As a result, they may link to a duplicate version which will result in a loss of link juice for your website.
- Excessive duplicate content on a website can be an indication of poor quality and therefore reduce the authority of your website. In turn this will affect the ability of your website to rank for given search terms.
- The concept of a crawl limit (i.e the amount of pages that Google will crawl on your website dependent on your authority) and a limit to the amount of pages that Google is willing to index for your site does throw interesting questions around whether duplicate content will significantly impair the regularity with which Google crawls your entire site and/or indexes your site. Regardless of whether these limits are fixed and true, it really isn’t worth the risk.
What are common duplicate content issues?
Pages with similar content
This is also referred to as keyword cannibilisation. It occurs when two or more webpages on the same site have very similar content – or in this world of latent semantic indexing and the knowledge graph, have content that tackles the same issue or point! This type of duplicate content can be very confusing for Google and as such it is important that your content creation strategy is well structured, along with clear hierarchy for the existing pages on your site. For pages with similar content it is worth merging them, using rel=prev or next, or simply rethinking the structure of your site. Pages with similar content can not only cause duplicate content issues for search engines but can also create a more difficult user flow through the website and/or dilute link juice throughout the site as external websites are unsure which page to link to.
Printer Friendly Versions
Printer friendly versions of webpages can sometimes be indexed by Google and therefore appear as a direct replica of a webpage. Whilst this is pretty obvious to Google that it is a printer version of the page it is not worth the risk of leaving it available to robots.
Session IDs, Search Filters and URL Parameters
URLs will often be generated for different purposes, whether that be to track a particular session for a user or as a result of the search or filter function being used on the website. These generated URLs can sometimes count as duplicate content and should therefore be addressed.
Duplicate URLs are a very common type of duplicate content. If you have not chosen a canonical URL (or preferred domain in Google’s Search Console) it effectively means that there will be one or more duplicate versions of every page on your website. Luckily Google is reasonably good at choosing a canonical URL. However, duplicate URLs can result in loss of link juice as other websites link to the duplicate versions rather than the canonical URLs. Duplicate URLs should be 301 redirected to the canonical and a canonical tag applied to the URL, Google do not recommend no indexing the duplicate content but instead simply permanently redirect the page to the canonical (original) version. We would also recommend indicating the canonical URL by choosing a preferred domain on Google’s Search Console.
Default Server Settings & Duplicate URLs
Servers can often create multiple versions of a webpage through their default settings. Most are easy to simply 301 redirect to the canonical, although be careful not to create an infinite loop with /index.html.
Multiple Categories and Products
Duplicate URLs can be created through complex information architecture (site structure) or websites with hundreds of products. For example, e-commerce sites can often suffer from duplicate content issues with their products that appear in multiple categories. For example, a pair of red Nike trainers could be returned for male trainers, running trainers, nike trainers, red trainers, size searches, etc. All of these searches could have their own unique URL and therefore create duplicate content.
Boilerplate content and widgets
Excessive amounts of boilerplate content or widgetized content can count as duplication. After all, each web page should have its own unique content and if too much of the content appears on multiple other pages via widgets or boilerplate content it makes sense that it would count as duplicate content. Google advises that boilerplate content be reduced by providing a synopsis and link to a page with more detailed content such as terms and conditions or copyright. Furthermore, they recommend adding the No Content Class Attribute to the boilerplate content.
Different TLDs (usually with international sites)
International companies can often have separate domains for each country in which they operate, whether that be .co.uk for the United Kingdom, .es for Spain, .de for Germany, .fr for France, etc. Duplicate content issues occur when each website uses the same or similar content. As with any content we advise that each country has as much unique content as possible.
Website with content in multiple languages do not automatically count as duplicate content. It is completely fine if the website’s content has been professionally translated by a human being because there will be certain linguistic nuances that will make the difference between the languages. However, if you have used an automated system to translate the content (such as Google translate) this can not only count as duplicate content but will also be of a poorer quality than professionally translated text, therefore doubly impacting your website’s ability to rank.
Duplicate Meta Data
This is very common indeed, especially with larger sites. Best practice dictates that each page has its own unique title tag and meta description. It can be a laborious job but if you want your SEO to be perfect then it is a necessary job. We would advise starting with ensuring that there are no duplicate title tags due to the fact that Google claim to no longer take meta descriptions into account.
Some content management systems automatically generate multiple versions of a webpage and therefore automatically generate duplicate content. It is worth identifying if this is occurring and making the necessary changes to prevent this from automatically occurring.
This usually occurs with search results or articles with multiple pages. Google has difficulty telling that the pages are connected and therefore (in the case of an article) it can count as duplicate content because there are multiple pages talking about the same topic.
Identifying Duplicate Content
Know the rules
Unfortunately for some of the duplicate content issues you simply need to know the rules. Lucky for you this article should give you all the information you need to make sure that you are duplicate content free!
There are loads of sites that will scrape your content and repost them. This is not something to worry too much about because the sites are usually of low authority and also repost with links, as such it is fairly obvious to Google that yours is the original. However, if you are scared of higher authority websites reposting an article that you wrote and outranking you for it then use Google to keep an eye out for the content!
Google’s Search Console is a quick way to identify duplicate meta data. Simply sign in, go to ‘Search Appearance’ and click on ‘HTML Improvements’.
Screaming Frog’s SEO Spider can filter its results by duplicate Title Tags, Meta Descriptions and URLs.
Solving Duplicate Content Issues
Don’t Scrape or Duplicate Content
The most obvious way to prevent duplicate content issues is not to copy content from another website and publish it on your site in the first place! This not only creates duplicate content issues but will also damage the credibility of your website in both the users and Google’s eyes – high quality unique content should always be the aim! Make sure that you are not creating similar content to that which already exists on your site and therefore falling foul of keyword cannibalisation.
If you do come across an article which is simply just too good not to include on your own blog, you should give credit where credit is due. A canonical tag on a piece of content effectively tells Google that you have full knowledge that it is a duplicate and indicates where the original piece is. For more information on canonical tags see here.
The 301 redirect is the most effective tool for solving duplicate content issues. A 301 redirect does not only indicate to Google that a page has permanently moved (essentially merging the two pages) but also automatically redirects the user to the correct page and passes the majority of any misdirected link juice to the correct page. Moving forward then, users are highly unlikely to link to the duplicated page and Google will identify the canonical page as the page that each duplication has been redirected to.
Merge Similar Content
For pages with keyword cannibilisation issues it is advisable to merge the two pages by 301 redirecting one to the other. This may require some thought as to the rest of the user flow but in the long run will be far more beneficial for the site as it should streamline your user journey and improve UX.
Parameter Handling Tool
Note that the Parameter Handling Tool is much like the Data Highlighter in that it is a Google tool and therefore only for Google. Rules set out in the Parameter Handling Tool will have no effect for Bing and Yahoo! The parameter handling tool allows webmasters to identify which URL parameters they want indexing and which ones they do not want indexing, for example for filtered searches. For more information on the Parameter Handling Tool for duplicate content see Google’s guide (https://support.google.com/webmasters/answer/6080548?hl=en).
No content Class Attribute
The no content class attribute is recommended by Google to be used on Boilerplate content. It is very similar to the no index attribute placed on whole pages, but instead can indicate to both Google and Yahoo! that a particular piece of content within a page is not to be indexed. Pretty neat stuff.
Again, much like the Parameter Handling Tool and Data Highlighter, setting your preferred domain through Google’s Search Console will have no effect on your search rankings with Yahoo! or Bing. Regardless though, even if you have 301’d and placed canonical tags on all of your duplicate URL’s to identify your canonical domain, there is no harm in setting your preferred domain on the Search Console.
Complete Removal / 404
Completely removing a duplicate page is an option, especially for content which users ever visit. However, we would still advise 301 redirecting the removed duplicate page to the original because you never know who might stumble across a link in an email or may have previously linked to that page. Also, 404s are just annoying to come across so let’s try to keep the UX as good as possible!
Search Console URL Removal
The URL removal request has changed somewhat since the days of Webmaster Tools and having to 404 and no index the page prior to requesting the removal a URL. Nowadays, you simply have to log into the Search Console click on the ‘Google Index’ option, go to ‘Remove URLs’ and enter the URL you desire to remove. However, this only lasts for 90 days and we would always advise 301 redirecting the page to the canonical anyway.
This can be used to link paginated pages and tell Google that they are part of the same article/search.