How do bots crawl the web?
So, how do search engine bots actually work? They use algorithms to determine which sites they need to crawl, how frequently they should crawl them, and how many pages on a site need to be crawled – that’s a lot of crawling. Once they’ve crawled the links and arrived at a certain web page, the bots will read the content of the page to try to understand it and work out for which search terms that page should be displayed. They do this by reading any text, tags and alt attributes on a page, however, they are unable to read any rich media, such as video or audio files.
Contrary to what some people believe, Google does not allow websites to pay to have their sites (nor any specific web pages) crawled deeper or more frequently. Websites can request a crawl, otherwise known as a ‘fetch’ via Google’s search console but again this is at the discretion of Google. As a rule of thumb the more popular your website the more regularly it will be crawled by bots.
What you want to ensure is that when these bots do come to crawl your site, however regularly that may be, that it is as easy as possible for these digital spiders to read your site. Make sure you have a clear sitemap and information architecture. Help bots move throughout your site by providing useful internal linking and structure pages so that they are easy to understand. Remember that focussing on relevance and value for user will probably result in all of these things happening anyway.
SEO is all about relevance, so you want the search engine bots arriving at your site to read and index the most relevant content for your users, sometimes that requires you to give them a little nudge on where, or where not, to look. This is where robots.txt comes in.
robots.txt is a text file inserted into a website’s code that gives instructions to the bots that arrive at your website. Using your robots.txt file incorrectly can have major negative consequences for your website, theoretically it would be possible to use the code to have Google completely de-index you from their searches. But don’t worry, when used correctly, robots.txt can be one of the most important tools in your SEO arsenal.
You can use robots.txt beneficially for:
- Adding an XML sitemap to your site, making it easier for bots to crawl
- Preventing duplicate content appearing on SERPs
- Keeping areas of your website private
- Keeping internal search result pages from appearing on SERPs
- Preventing search engines from indexing certain files, such as images, PDFs, etc.
- Specifying crawl delays to prevent servers becoming overloaded when crawlers load up multiple pieces of content at once
To check if you have robots.txt running on your website, add it to the end of your URL like so:
If you don’t have a robots.txt running, Google have produced this easy to understand guide to setting it up.
What are malicious bots?
As we’ve mentioned, while conducting SEO on your site, the most common and most important bots you’ll encounter will be search engine bots. However, they’re not the only bots out there. Black hat SEO techniques often make use of more nefarious bots to spam, scrape and hack your site, potentially seriously damaging how well your page ranks in search engines. Contrary to popular belief, there are very few manual cyber-attacks carried out by ‘hackers’ in their basements, in fact around 90% of cyber-threats are bot-based, so it’s important for webmasters to be aware of the threat.
So what do malicious bots do to your site? The most prominent methods used by bots to disrupt your SEO campaigns are spamming, scraping and hacking. Spam bots will attempt to generate as many links as possible across comments on thousands of websites. However, thanks to Google’s anti-spam campaigns, only around 0.5% of web traffic comes from spam bots. Fortunately, the link farming used by spam bots is likely to do considerably more damage to the website they link to rather than any sites that they leave comments on. An easy way to combat comment spam is to make sure that you have a reCaptcha installed in the comments section.
Scrapers, on the other hand, can be a real problem for webmasters. Scraper bots lift content from your own site and duplicate it elsewhere. The main issue with this is that the plagiarised content may lead to Google punishing your site for having duplicated content, even though it was duplicated from you in the first place! There is no direct way to combat this except to make sure that your website is popular enough to be crawled regularly – in this way Google will likely see your content published on your website first, prior to these nasty scrapers. Furthermore, trust that scraper website are usually low quality and therefore less trustworthy than yours!
Finally, hacking bots can inject malware into your website’s code that you likely won’t even notice. However, if Google detects that there is malware on your site, they will warn users that your site may be infected, leading to a slow but sure decrease in ranking. If malware remains on your site, it stands to reason that both the search engines and potential customers will lose confidence.
Reportedly, over a quarter of all web traffic is made up of malicious bots, so knowing how to protect yourself from them is important for any webmaster. This article from Moz is a great guide on how to fight back and protect yourself from the major methods used by malicious bots. However, it boils down to this: Be proactive and keep an eye on your site, utilise the best anti-spam programs, and always take action against malicious bots as soon as possible.
Looking for an SEO agency?
Our team of SEOs run fully managed campaigns for a wide variety of clients, focussed on growth and sustainability.Learn more about our SEO service