A web crawler, also known as spider or bot, is a computer program that automatically browse through the pages of a website and collects the data that it needs. Googlebot and Bingbot are two popular web crawlers used by Google and Bing search engines respectively. These crawlers scans a webpage, collect its content and index it. It will then follow any hyperlinks on that page to move on to the next page and the process is repeated. There are different ways a website author can tell a crawler not to crawl a particular page. One such method is to use the "nofollow"
attribute in HTML anchor tag <a>
Here is a basic web crawler program written in Python that crawls a website to find any broken links.
Program Logic
This program requires three modules - sys, requests and lxml. Sys module gives the program access to the command line argument. Request module offers the capability to send HTTP requests and Lxml module is for parsing HTML documents.
The start page from which the crawling begins is passed as an argument to the program. For example to crawl the website www.mallukitchen.com you run the command.
python WebCrawl.py www.mallukitchen.com
where WebCrawl.py is the name of your Python program.
A list variable called site_links
stores all the hyperlinks that are discovered. When execution begins, this list contains just the base URL.
The URLs are checked to determine if it if it is an absolute link, a relative link (i.e., a location relative to the current page), a root-relative link (i.e., a location relative to the base directory). Relative and root-relative type of links are converted into absolute link format and then passed to the getlinks(url)
function. Inside the getlinks(url)
function an HTTP request is made to the URL. The resulting response code is checked to see if the link is accessible or broken and the corresponding result is printed
The response content is also parsed to obtain all hyperlinks in that page. Links with "nofollow"
attributes and links that point to page sections (those which start with a #) are ignored. The discovered hyperlinks are appended to the site_links
list variable and the process is repeated.
Program Source
#!/usr/bin/python from lxml import html import requests import sys # Base url from where the crawl begins base = sys.argv[1] #Function to get the links and their status def getlinks(url): response = requests.get(url) if (response.status_code == 200): status = '[OK]' else: status = '[BROKEN]' print(status, url ) pagehtml = html.fromstring(response.content) page_links = pagehtml.xpath("//a[not(@rel='nofollow') and not(contains(@href,'#'))]/@href") return page_links site_links =[base] #Repeat for each link discovered for item in site_links: #absolute link if item.startswith('http://') or item.startswith('https://'): url = item #root-relative link elif item.startswith('/'): url = base + item else: url=base + "/" + item page_links = getlinks(url) for link in page_links: if link not in site_links: site_links.append(link)