How to easily web scrape any website with Python

3 May 2024

Web scraping is the art of browsing websites automatically through scripts to find relevant information, grabbing it and using it for your purposes. Web scraping can be done in different ways and for different purposes. I'll run through some examples in the text below.

Here are the headlines for content below:

Why I web scrape

I'm a small time landlord with a few apartments that I rent out. I'd like to know what the comparable rent is in my city, so I can price my apartments according to the market. Therefore I scrape data from the websites of a few local landlords, so I can easily see how others price their apartments. Before we dive further into this, I think it is important to say, that web scraping should be your last ditch effort to obtain the data. Working through an API is the better solution for both parties involved (provider and receiver). However in some scenarios web scraping is the only option available. The other landlords probably wouldn't provide me willingly with their pricing information. When web scraping you also have to be gentle, so you don't DDOS your target. This will only motivate a negative response and hammering someone else's servers is just a dick move.

Web scraping 101

Web scraping in Python is extremely easy and can help you automate a lot of things. The scraping process can be boiled down into 2 steps. First you need to visit the website and get its HTML code, and then you need to find the relevant information on the website. Let's start by grabbing the website.

Getting the website

To get started scraping a website in python, all you need is a URL. There's a few different ways of doing this. Let's start with the simplest one.

Web scraping with urllib.request

urllib.request is a package for opening URLs. Below I'll use a totally made up URL, which you simply can replace with your own.
import urllib.request
html = urllib.request.urlopen('http://www.locallandlord.com/apartments')
The above saves the resulting HTML code to the variable HTML. As you can see, this method is extremely simple.

The method is however not very good. The webserver can easily see that you probably are not a real human being accessing the website. I believe this has something to do with the headers being sent in the HTTP request. You may have some early success, but in my experience they will eventually block this method.

To make matters even worse, JavaScript-based websites will not work with this method, because most of the pages content is loaded after the fact. JavaScript-based websites requires JavaScript to run on the client machine in order to display properly and urllib won't do that.

Let's take a look at a more advanced method.

Web scraping with Selenium

Selenium is a framework that can automate any website and web application. This means that you through Python can control a web browser, and this is exactly what we want to do when doing web scraping. Because Selenium runs an actual browser we are also able to render the JavaScript-based websites and grab their content.

Another benefit of Selenium is the fact that it uses an actual browser. This means that the request you are sending to a website looks more similar as to the requests a human would make.

Let's take a look at how you can web scrape with Selenium. Making a simple request similar to the urllib.request script above will look like this:
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
import time

opts = FirefoxOptions()
opts.add_argument('--headless')
browser = webdriver.Firefox(options = opts)
browser.get('http://www.locallandlord.com/apartments')
time.sleep(4)
html = browser.page_source
time.sleep(1)
browser.close()
I add the argument headless, because I am running the script on a headless server, but also to save ressources. If you run it without the headless option (with the GUI) on your computer, you'll be able to see Selenium actually opening the web browser and navigating to the specified URL. You'll probably also notice, that I have added wait times in the script. This is to ensure that all the content of the webpage has been loaded before I grab it, as some content may be lazy loaded.

With the desired URLs HTML code saved, we can start to analyze it and find the desired data.

Grabbing the relevant information from a webpage

Extracting data from HTML doesn't get much easier than with BeautifulSoup, which is a great tool for working with HTML and XML.

You load the HTML into the BeautifulSoup parser and then you can “search” through the HTML. In my example each apartment on the website is within a div with the class “tenancies”, so I can do the following.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, html.parser)

for apartment in soup.findAll('div', {'class': 'tenancies'}):
	# Extract the data from each apartment.
Your example will guaranteed be a bit different, but you identify what element(s) your desired data is within on the website by inspecting them and then you try your best to match that with the findAll-method. The optional parameter part, where I have included the class can take all kinds of parameters (mostly class and/or id used). Later in the script I even target some Angular-specific elements, like so:
address = apartment.findAll('span', {'class': 'address'})[0].text
area = apartment.findAll('p', {'ng-bind': 'tenancy.Address.City'})[0].text
rooms = apartment.findAll('span', {'ng-bind': ::(tenancy.Rooms + ' room(s)')})[0].text
rooms = int(rooms.replace('room(s)', '))
As seen above data sanitation is also a huge part of web scraping, because fields may include undesired information.

General tips and tricks when web scraping

Limit actual web scraping when developing scripts

As you probably quickly will realize extracting the desired information from the retrieved HTML code will take some trial and error. Therefore I recommend scraping the website only once during development and saving the HTML. With the saved HTML you can test extracting the desired information as many times as you want until you get the desired result. If you run your script multiple times while also actually doing the web scraping (retrieving the intended website) you run a risk of getting your IP-address blocked from the website. To avoid this you must access the website as few times as possible.

Downloading images with Python

You can also download images when web scraping. Using BeautifulSoup to find all the relevant images in the HTML, you can do something similar to the below to download the images (assuming you already loaded the HTML into the soup).
import requests

images = soup.findAll('img')

for image in images:
	img_data = requests.get(image['url']).content
	with open(', 'wb') as handler:
		handler.write(img_data)

A deeper dive into Selenium

Selenium was mentioned above as a method for web scraping. Selenium can do much more than that though, as Selenium really is controlling an actual web browser (i.e. another program). This means that you can use Selenium to automatically do whatever you can do with a web browser.

Below I'll show a few different use cases that I have used.

Generating screenshots of websites using Selenium

You can e.g. grab screenshots from websites to either ensure that the website is up and running or simply to test responsiveness of the website in question. The service Browserstack does something similar to this.

The script below uses Selenium to save an image of the given website. Simply replace the width and height values above with whatever you are testing.
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
import time

opts = FirefoxOptions()
opts.add_argument('--headless')
opts.add_argument(--width=768)
opts.add_argument(--height=1024)
browser = webdriver.Firefox(options = opts)
browser.get('http://www.locallandlord.com/apartments')
time.sleep(4)
browser.get_screenshot_as_file(website.png)
time.sleep(1)
browser.close()
The images generated can also be used for displaying website previews on social media through meta tags.

Measuring page load times with Selenium

Lambdatest did a proper article on how to use Selenium to test page load times.
import datetime
from selenium import webdriver
from selenium.webdriver import FirefoxOptions

def toDatetime(epoch):
	return datetime.datetime.fromtimestamp(epoch / 1000)

opts = FirefoxOptions()
opts.add_argument('--headless')
browser = webdriver.Firefox(options = opts)

browser.get('https://www.philipsoerensen.com')

navigationStart = toDatetime(browser.execute_script(return window.performance.timing.navigationStart))
responseStart = toDatetime(browser.execute_script(return window.performance.timing.responseStart))
domComplete = toDatetime(browser.execute_script(return window.performance.timing.domComplete))
 
''Calculate the performance''
backendPerformance_calc = responseStart - navigationStart
frontendPerformance_calc = domComplete - responseStart
 
print(navigationStart)
print(responseStart)
print(domComplete)
print(backendPerformance_calc)
print(frontendPerformance_calc)

browser.close()
The navigationStart attribute returns the moment the browser is ready to fetch the document using an HTTP request.
The responseStart attribute returns the time as soon as the user-agent receives the first byte from the server or from the local sources/application cache.
The domComplete attribute returns the time just before the current document/page readiness is set to “complete”. This means parsing of the page/document is complete and all the resources required for the page are downloaded.

With the above 3 values, we are able to calculate the backend and frontend performance of the website. This can be useful information for a web developer to determine the actual impact on different improvements done to the website.

Wrapping up

These were my notes about web scraping in Python using various different techniques. I hope you put it to use and try to make your own web scraper.

Please feel free to add any thoughts you may have in the comment section below. I look forward to hearing from you.

You might also enjoy

Python notes

Python notes

Published 2024-05-03 — Updated 2024-05-14

Notes

Python

Different tips, tricks and how-to's while developing various scripts in Python.

Read the post →
PostgreSQL notes

PostgreSQL notes

Published 2024-04-26

Datahoarding

Notes

Various how-to's, tips and tricks that help running a PostgreSQL server more efficient.

Read the post →
Creating your own self-hosted Instagram

Creating your own self-hosted Instagram

Published 2024-04-26

Datahoarding

Python

With the use of Instaloader and Laravel, you can create your own self-hosted Instagram. Learn how to use Instaloader to download content from Instagram.

Read the post →