Scraping using Python and BeautifulSoup4
This tutorial is a blast from the past. I wrote this script 5 years back to learn web scraping and to my surprise it still works.
I’ll walk you through a python program that scrapes the comic strips from an index website.
BeautifulSoup4 is a python library to scrape information from web pages. The library provides methods to access contents of the HTML parse tree of a web page. To install BeautifulSoup4,
pip install beautifulsoup4
Using python’s requests library and BeautifulSoup4 library, let’s extract all the links from the given URL.
import requests from bs4 import BeautifulSoup url = 'http://downloads.esbasura.com/comics/Calvin%20and%20Hobbes/' with requests.Session() as s: p = s.get(url) soup = BeautifulSoup(p.text) home_links = soup.find_all('a')
This returns a list of links corresponding to the years of publication of the comics, “Calvin and Hobbes”. When you print home_links, you would see,
[<a href="?C=N;O=D">Name</a>, <a href="?C=M;O=A">Last modified</a>, <a href="?C=S;O=A">Size</a>, <a href="?C=D;O=A">Description</a>, <a href="/comics/">Parent Directory</a>, <a href="1985/">1985/</a>, <a href="1986/">1986/</a>, <a href="1987/">1987/</a>, <a href="1988/">1988/</a>, <a href="1989/">1989/</a>, <a href="1990/">1990/</a>, <a href="1991/">1991/</a>, <a href="1992/">1992/</a>, <a href="1993/">1993/</a>, <a href="1994/">1994/</a>, <a href="1995/">1995/</a>]
You can go into each of this year link to fetch all the comic strip urls published in that year.
for index in range(5,len(home_links)): new_url = url + home_links[index]['href'] p1 = s.get(new_url) soup1 = BeautifulSoup(p1.text) image_links = soup1.find_all('a')
Now that you have the image links, you can download the image file in the links using urllib library.
file_path = new_url.split('/')[-2] for index in range(5,len(image_links)): file_name = image_links[index]['href'] print(file_name) urllib.request.urlretrieve(base + file_name, file_path + "/" + file_name)
That is all that’s needed to scrape a web page and download all the images. You can use multi-threading to speed up the download process for all the directories in parallel.
import requests from bs4 import BeautifulSoup import urllib.request import os import threading def image_fetch(base, image_links): file_path = base.split('/')[-2] for index in range(5,len(image_links)): file_name = image_links[index]['href'] print(file_name) urllib.request.urlretrieve(base + file_name, file_path + "/" + file_name) def main(): url = 'http://downloads.esbasura.com/comics/Calvin%20and%20Hobbes/' with requests.Session() as s: p = s.get(url) soup = BeautifulSoup(p.text) home_links = soup.find_all('a') threads = [] for index in range(5,len(home_links)): new_url = url + home_links[index]['href'] p1 = s.get(new_url) soup1 = BeautifulSoup(p1.text) image_links = soup1.find_all('a') file_path = new_url.split('/')[-2] if not os.path.exists(file_path): os.makedirs(file_path) t = threadsding.Thread(target = image_fetch, args = (new_url, image_links)) threads.append(t) t.start() for t in threads: t.join() if __name__ == '__main__': main()
Das Ende 🙂