Yellowball.

Crawl Budget: A Non-SEOs Guide to Ensuring Your Website’s Available to Google

By Ncedo Vilakazi
A Spider Carefully Crawling On Its Web
SEO 08.04.2022

“Crawl budget” is one of those bits of jargon that most SEO specialists throw at their clients to either make a point about what content should be placed on the site – or used to reaffirm their expertise to justify a retainer renewal during an annual status review! 

 

Either way, a discussion about crawl budget can easily become overly technical or grossly over-simplified, with the results that business owners end up misunderstanding the concept or not knowing why or how it should be applied. 

 

In this article, we’re giving you a digestible definition of what crawl budget is, and how to approach your understanding of it. We’ll also work through a couple of examples of how it works, and lastly, we will provide a few ideas of what to do to make search engines visits to your site as concise and effective as possible to ensure that all your most important content is indexed.

How does Google define crawl budget?

So, what is a crawl budget? According to the search giant, crawl budget is the number of URLs Google has both the ability to visit and the desire to visit, based on what they term “Crawl Capacity (Rate Limit)” and “Crawl Demand”. 

 

  • Crawl Capacity/Rate Limit – The maximum rate at which Google’s bots can simultaneously access and evaluate your website content.
  • Crawl Demand – How popular your website pages are, how often you add fresh content, how often your website is updated, how active it is, etc.

 

When combining these two factors, you can see general size of Google’s appetite to visit your site as well as the frequency of those visits. In other words, even if Googles crawl rate limit isn’t reached, a low crawl demand will trigger Google to visit your website less, potentially resulting in some content not being picked up in a timely manner, if at all.

A closer look at why crawl capacity/rate limit matters

If you’ve ever heard of a DDoS cyber-attack by hackers, you’ll know that it is possible to flood a website with web visits to a point where it makes it impossible for the website to serve genuinely interested users with its content. This is a technique hackers use, sending extreme amounts of robot traffic to the site with the intention of crippling the company that owns the website, removing their ability to serve customers. 

 

Interestingly, this is similar to what would happen if Google decided to visit your website with all their powerful technology! It would essentially render your site unusable to your prospective customers by using up all its resources and making it crash as a result – a bit like clogging a road up with too much traffic. And the larger your website (and greater your volume of website visitors) the worse your problem will be. 

 

So, to avoid this Google implements what they call Crawl Rate Limiting, which essentially limits the number of simultaneous parallel connections to your server as well as limiting the amount of time each connection will wait before creating another connection. 

 

In addition to this, Google’s technology is so smart that it can monitor the availability of your website’s servers. Based on the level of available resources, they will limit the number of connections made with your servers – this is what they call being a good internet citizen. This means that Google’s bots can crawl your website and list it in search results without affecting the quality of your website experience.

 

So, the goal of Crawl Capacity/Rate Limiting is to find the balance between visiting your website enough times to grab enough of your available content to make it accurate in search results, while at the same time avoiding becoming a burden on your website.

 

The crawl capacity limit that Google sets for visitations to your site can be affected by a few factors, namely:

  • Completely automatic influences: If your server availability is good and your website responds reliably over a period, Google Search Console gives you a 3-month window. If server responses are healthy your crawl rate limit will increase, and Google will visit your suite more frequently. If your server responds slowly or unreliably over that period, Google will visit it less often.
  • Website owner influence: Using the tools provided to us by Google in the Google Search Console, a website owner can reduce the rate at which Google visits the website. Limits can be increased, but this invitation to visit more often will not necessarily result in higher rankings in search results.
  • Search engines influenced limits: From a broad perspective, Google also will set economically viable deployments of their immense computing power. This may seem unlimited at times – but believe it or not, they also sometimes have to use their limited resources wisely.

 

The crawl capacity/rate limit consideration doesn’t occur in isolation, because at the same time as this is happening, Google is also drawing some pre-emptive assumptions about what they think about the content you’ve put on offer. This includes frequently updated product pages, news articles, in-depth research papers, engaging ever-green blog posts, etc. Each of these content types will create different levels of desire for re-visitation by Google, and this coupled with the overall quality, uniqueness, and usefulness of the content, will determine crawl demand.

What about Crawl Demand?

Not all websites are created equal. That’s not just to say that some websites are good for SEO and others are not – although a great quality website is at the foundation of any successful SEO project. Different websites have different purposes, and therefore will have different requirements for how the content they present should and may be treated. 

 

The most important factor that determines Crawl Demand or how desirable a webpage is Google is URL popularity in Search. Google wants to keep the URLs that receive the most attention from searchers fresh in their index – and to avoid keeping any stale content. For Google, stale content is anything outdated, that is no longer wanted by the site owner, that is ignored by readers, or that is no longer relevant.

 

There are some additional layers of sophisticated analysis that determine the popularity of a website since this sort of behaviour can be driven by several different factors. In some cases, this analysis won’t even look at whether the content is particularly entertaining or pleasant, as popularity is usually based on the context of the site. For example, a gossip website may get lots of traffic compared to a site that sells a highly specific engineering solution, but Google will view both as having high crawl demand if they meet more in-depth and contextual requirements. Here’s how these layers of analysis work.

 

Initially, there is the sector or industry level, where Search Engines like Google begin trying to learn what a website is about and what value the website has to offer, and therefore who it may be useful to. This is where you’ll hear terms like YMYL – Your Money, Your Life – websites. These tend to be websites that Google will hold to a high accountability standard, especially from a content offering and rankings point of view because they can have a big impact on someone’s finances, health, or wellbeing. In contrast, websites that don’t have much of an impact on your money or your life will tend to have less of a strict expectation even though they are still taken just as seriously from a crawlability and indexation perspective. 

 

A good example of a YMYL website would be a banking website, in contrast to say, a gardening guide website. Both need to rank well and be highly crawlable so that customers can find them, but the bank will need a higher level of accountability to rank.

 

At this level, Google is more concerned about the quality and credibility of content for the purposes of ranking rather than hard factors like quantity of content or the update frequency of the content.

 

The next level of analysis is the business model level of the website. This is where crawlability and subsequent indexing will be the most impactful. This is because different business models will have different demands on the quantity and frequency of content that ends up on the website. For example, eCommerce websites tend be very large, fast growing, and will have a larger number of webpages than a niche software user registration website which may only need a single page to present the entire value proposition. 

 

Here’s the is concept of Crawl Demand becomes most relevant, since search engines like Google need to figure out the popularity of the content against the likelihood of additional content of similar nature being produced, the likelihood of that content being updated, it’s popularity with website visitors, and its topical relevance. These factors are all viewed together in relation to other websites and from there Google determines the necessary amount of time to be spent crawling the website.

 

Here are the all-important SEO factors that have an influence on Craw Demand:

  • Completely automatic influence: Here, Google evaluates various metrics including clicks form search results, time spent on URLs, number of visits over a period, and similar metrics that imply that a URL is popular among readers. This makes keeping your URLs useful and targeted to specific internet users is your best hedge to ensure this factor is as positive as possible.
  • Website owner influence: As an SEO best practice, you should always publish an XML Sitemap as well as a robots.txt file. If website owners do not offer explicit instructions of what URLs they want to be crawled (/sitemap.xml) and what they don’t what crawled (/robots.txt), then Google will attempt to crawl all the available URLs on the site that return a 200-status code. This can include an assortment of URLs including duplicates, internally relevant content, resource URLs like APIs and so forth, which can all use up valuable crawl capacity. If these use up capacity at the expense of valuable, business-oriented content, then this needs to be properly addressed.
  • Search engine influence: Here, Google analyses content staleness to determine whether a URL should remain in the index or not. In other words, Google frequently visits a webpage to determine if any updates have occurred. Depending on the nature of the content, Google will attempt to keep the URL or discard it to keep the index fresh. So, be careful of static page content that stays the same for years on end. This content should be rewritten on occasion, added too, or enhanced with various content offerings and sales messaging to keep things fresh.

 

Why does knowing about and managing crawl budget matter?

Crawl budget is essential for large websites that consist of a few thousand URLs (including all resources form HTML to images and third-party resources) as well as websites with frequently published or updated content.

 

However, it’s not something smaller websites should ignore either. Since URL popularity is a factor that has the potential to increase or decrease crawl demand, it’s possible for small sites that do not reach their crawl rate limits and have a low crawl demand to get crawled less often by Google. This can cause a negative impact on your website’s indexation and thus will have a negative influence on your rankings, since your content isn’t indexed efficiently. It may also mean missing out on opportunities to rank in search results as well.

 

So, apart from the highly technical focus required when managing crawl budget for large sites, it is also important for smaller sites to ensure that they keep their websites content well presented, up-to-date and produced at a high level of quality to keep the website’s popularity high. It’s all about ensuring that you don’t lose valuable crawl demand and keep Google interested in coming back for more content over the long term, which is what every SEO strategy needs. 

 

Best practices for crawl budget optimisation

Although there’s a higher incentive to implement the following best practices on larger more frequently published websites, crawl budget optimisation is also in the best interest smaller websites with long-term SEO strategies. Crawl budget optimisation can be the factor that breaks the tie in highly competitive industries pushing for valuable search engine attention, and with a healthy crawl rate and efficient experience on your site, this can boost your websites authority.

 

General crawl efficiency best practices for all websites

  • Manage your URLs through crawl directive documents: Website owners can consult with an experienced SEO specialist or upskill themselves enough to generate basic /sitemap.xml and standard /robots.txt files. These are documents that tell Google what URLs you want them to spend time crawling, and which URLs you do not what them to crawl. The idea is that without these documents, Google will attempt to crawl all the available URLs on the site, and if too many of these do not offer any user value, then Google may stop the crawl without exploring the whole site and will likely not increase the crawl budget allocation for the site to do so. Along with crawl control documents, always apply disciplined maintenance of the website by ensuring that:

 

    • Duplicate content is consolidated or addressed accordingly, since each one of the duplicate URLs will be counted as part of the crawl capacity count, and it is preferable to Google to crawl unique content.
    • URLs that contain content that you do not want to be indexed in search results are blocked from being crawled. Be careful not to block the URLs of resources required to render the content you want indexed. Everything else should be blocked from being crawled, keeping the number of URLs available to be crawled by Google as minimal as possible.
    • Return 410/404 errors for content that is no longer available or that has been permanently deleted. Google will remember that a URL is no longer available and will not include it in the crawl queue for the next visit to your site. Remember content blocked using the Robots.txt directives will remain in the crawl queue waiting to be crawled soon as it is removed from the blocking document, so blocking URLs will not reallocate crawl budget to other pages of the site unless the crawl capacity is reached on that visit.
    • Avoid or limit the number of soft 404 errors on the site. A soft 404 is your server incorrectly returning the wrong status code (i.e., 200 for URL Ok to be served) when in fact the content of the page is no longer available. This tends to happen when a formal 404 status code has not been implemented on a piece of content (hard 404). This gives search engines the false impression that there is content on the page, and they will attempt to index these pages, which is not good for optimal crawl budget management.

 

  • Avoid excessively long redirect chains and loops. These cause a search engine to spend time visiting numerous URLs to access a single piece of content. In some cases, the search engine won’t even reach the destination page. Ideally, the page can redirect once to the final target destination page and skip al the additional URLs in between.

 

  • Ensure that you server response time is as fast as possible. This ensures that when Google attempts to access URL content, this content is returned as soon as possible, increasing crawl capacity over time.


  • Keep an eye on your crawl statistics. These analytics help you see if an appropriate amount of content is being viewed on each visit to your site, and that your site’s availability is conducive to a healthy crawl relationship with Google. You will also be able to ensure that all the necessary parts of your website are being crawled effectively and manage any instances of overdrawing if this is the case on your site.

Maintaining a healthy crawl budget is an often-overlooked aspect of managing the technical health of a website. Though this may not be a deal breaker for most small websites, it is still essential to managing the authority of a website in the eyes of a search engine – and ensuring that there are no unnecessary hindrances to your websites authority can go a long way in achieving long term SEO success.

 

We’d like to go deeper on this subject and explore some historic data to improve this article by including an experiment of how what the average crawl budget values are for common types of websites, and perhaps get some concrete evidence of how crawl budget is influenced by site size and content production frequency, but more on that later!

 

Yellowball is a leading London SEO agency with a team of in-house content creators, technical experts, and account managers. It’s our job to get to the heart of your brand and deliver on what your audience is looking for, offering a superb user experience, and offering data-driven insights that deliver real results. Take that next step towards success in organic search, and contact our team today.

Let’s work together

WE DON’T BITE
Please enter your name
Please enter a valid email address
Please enter a telephone number
Please confirm that you agree to our Terms & Conditions