Skip to content

Scraping using Python and BeautifulSoup4

This tutorial is a blast from the past. I wrote this script 5 years back to learn web scraping and to my surprise it still works.

I’ll walk you through a python program that scrapes the comic strips from an index website.

BeautifulSoup4 is a python library to scrape information from web pages. The library provides methods to access contents of the HTML parse tree of a web page. To install BeautifulSoup4,

pip install beautifulsoup4 

Using python’s requests library and BeautifulSoup4 library, let’s extract all the links from the given URL.

import requests
from bs4 import BeautifulSoup

url = 'http://downloads.esbasura.com/comics/Calvin%20and%20Hobbes/'
with requests.Session() as s:
    p = s.get(url)
    soup = BeautifulSoup(p.text)
    home_links = soup.find_all('a')

This returns a list of links corresponding to the years of publication of the comics, “Calvin and Hobbes”. When you print home_links, you would see,

[<a href="?C=N;O=D">Name</a>, <a href="?C=M;O=A">Last modified</a>, <a href="?C=S;O=A">Size</a>, <a href="?C=D;O=A">Description</a>, <a href="/comics/">Parent Directory</a>, <a href="1985/">1985/</a>, <a href="1986/">1986/</a>, <a href="1987/">1987/</a>, <a href="1988/">1988/</a>, <a href="1989/">1989/</a>, <a href="1990/">1990/</a>, <a href="1991/">1991/</a>, <a href="1992/">1992/</a>, <a href="1993/">1993/</a>, <a href="1994/">1994/</a>, <a href="1995/">1995/</a>]

You can go into each of this year link to fetch all the comic strip urls published in that year.

for index in range(5,len(home_links)):
	new_url = url + home_links[index]['href']
	p1 = s.get(new_url)
	soup1 = BeautifulSoup(p1.text)
	image_links = soup1.find_all('a')

Now that you have the image links, you can download the image file in the links using urllib library.

file_path = new_url.split('/')[-2]
for index in range(5,len(image_links)):
		file_name = image_links[index]['href']
		print(file_name)
		urllib.request.urlretrieve(base + file_name, file_path + "/" + file_name)

That is all that’s needed to scrape a web page and download all the images. You can use multi-threading to speed up the download process for all the directories in parallel.

import requests
from bs4 import BeautifulSoup
import urllib.request
import os
import threading

def image_fetch(base, image_links):
	file_path = base.split('/')[-2]
	for index in range(5,len(image_links)):
		file_name = image_links[index]['href']
		print(file_name)
		urllib.request.urlretrieve(base + file_name, file_path + "/" + file_name)


def main():
	url = 'http://downloads.esbasura.com/comics/Calvin%20and%20Hobbes/'

	with requests.Session() as s:
		p = s.get(url)
		soup = BeautifulSoup(p.text)

		home_links = soup.find_all('a')
		threads = []	     
		
		for index in range(5,len(home_links)):
			new_url = url + home_links[index]['href']
			p1 = s.get(new_url)

			soup1 = BeautifulSoup(p1.text)
			image_links = soup1.find_all('a')

			file_path = new_url.split('/')[-2]

			if not os.path.exists(file_path):
				os.makedirs(file_path)
			
			t = threadsding.Thread(target = image_fetch, args = (new_url, image_links))
			threads.append(t)
			t.start()

		for t in threads:
			t.join()


if __name__ == '__main__':
    main()

Das Ende 🙂

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.