Web crawling made easy with Google Sheets

Google Sheets template with a custom function for quick and easy web crawling.

Using our custom web crawl function in Google Sheets

If you work in SEO, you’ll know just how arduous web crawling can be at times – it often involves a lot of copying and pasting from a tool to a spreadsheet and vice versa to get the data you're after.

There is, however, a canny solution. With a bit of tinkering in Apps Script –  a JavaScript console that allows you to talk to and enhance Google products and services –you can bring web crawling functionality to Google Sheets. It’s perfect for quick and simple checkups like checking status codes, robots directives, canonicals, titles and headings, the list goes on…

We use it regularly – and it saves us a ton of time, meaning we can prioritise other high-value SEO activities. Here’s what you need to know to get started.

Getting started

First things first, if you're going to crawl a site you don’t own, get permission from the domain owner.
Once you’ve been given the green light, you can get started by making a copy of our Google Sheet and using the custom formula “=crawl(url, element)”, inputting the desired URL and the element you would like to extract from the page.

The element is a HTML element. Currently you can pass these elements to the formula:

  • title
  • robots
  • canonical
  • meta keywords
  • href lang
  • status code
  • h1 
  • h2 
  • h3 
  • h4 
  • outlinks 
  • images

In addition to these elements you can also extract text using a class name or an id.

To extract text using a class name you must prepend a “.”. For example:

=crawl(“https://www.meltdigital.com/knowledge-hub/how-to-easily-automate-your-keyword-research-in-google-sheets/, “.author”)

To extract text using an ID you must prepend a “#”. For example:

=crawl(“https://www.meltdigital.com/knowledge-hub/how-to-easily-automate-your-keyword-research-in-google-sheets/, “#menu-header”)

For more advanced extractions using the class name or id, check out MDN’s documentation for the document.querySelector(), which is what we’re using behind the scenes in Apps Script.

And let me know if you have any suggestions for additional elements to extract.

Behind the scenes

Out of the box, Google Sheets does not have the function “=crawl()”.  We have written a simple function in Apps Script and imported it into our Google Sheets.

If you navigate to the menu in the Google Sheet and tap on Extensions and then Apps Script it will open a new window in Apps Script. Here you’ll be able to see the custom function we have created written in JavaScript:

function getContent(url) {  return UrlFetchApp.fetch(url).getContentText()}/*** Crawl page and extract an element.** @customfunction*/function crawl(url, element) {  const content = getContent(url);  const $ = Cheerio.load(content);  const elemObj = {      'title': $("title").text(),      'canonical': $("link[rel='canonical']").attr("href"),      'h1': String([...$('h1')].map(i => $(i).text())),      'h2': String([...$('h2')].map(i => $(i).text())),      'h3': String([...$('h3')].map(i => $(i).text())),      'h4': String([...$('h4')].map(i => $(i).text())),      'robots': $("meta[name='robots']").attr("content"),      'description': $("meta[name='description']").attr("content"),      'status code': UrlFetchApp.fetch(url).getResponseCode(),      'outlinks': String([...$('a')].map(i => i['attribs']['href'])),      'p': String([...$('p')].map(i => $(i).text())),      'images': String([...$('img')].map(i => i['attribs']['src'])),      'meta keywords': $("meta[name='keywords']").attr("content"),      'hreflang': String([...$("link[rel='alternate']")].map(i => i['attribs']['href']))    }  if(element in elemObj){    return elemObj[element];  }  let extraction = $(element).text()  return extraction}

The code above has two functions. The first function named “getContent”  is used to open the page and return the source code. The function uses the inbuilt App Script function UrlFetchApp to make a HTTP request which returns the source code as the response.

The second function named “crawl” uses the source code from the first function and extracts the element specified within the Google Sheet. We have used a library called “cheerio”, which allows us to query the source code as we would query the DOM (Document Object Module) with JavaScript – using the element name, class name or id. 

Within the second function we also carry out some simple processing to get the required attribute. For example, when looking at the title, we want to grab what's within the title tag. However, when looking at the canonical, we want to pull the href attribute.

Both these functions are imported into our Google Sheet to be used as formulas.

Wrap up

This custom function brings web crawling to Google Sheets. It's perfect for smaller scale SEO health check ups, such as checking for pages status code. As it's built in Apps Script and uses Google's infrastructure, it’s reasonably speedy – Google Sheets makes asynchronous requests (they are not crawling one URL at a time).

It's worth noting that for larger crawls this won't cut the mustard (and it is not supposed to). You’ll still need your Screaming Frog, your DeepCrawl, your Botify and so on.

It also doesn’t function in the same way a traditional web crawler does, which starts with one page and discovers the entire site (hopefully) using internal links. This Google Sheet function accepts a single URL and extracts items from the given page.In regards to Apps Script more broadly, we are just beginning to scratch the surface. There’s a world of opportunity out there. For example, take a look at our post on SEO automation with Apps Script - in this post we look at how we fetch data from third party API’s, schedule crawls with email alerts and much more.

We’d love to hear your thoughts and recommendations – we’ll be making updates with your feedback. And if you're requiring custom crawls for large domains, we’d love to hear from you. That’s right up our street.

Other Posts

All Posts