Easy Web Scraping HTML Tables with Pandas (Python)

Here’s an easy way to scrape HTML tables with Python. It’s only takes a few lines of code.

Create a Virtual Environment

Optionally create a venv to isolate your python environment:

$ python3 -m venv .venv

Activate the venv:

$ source .venv/bin/activate

Install Pandas

$ pip install pandas

Run Your Scraper in a REPL

Start a Python REPL:

$ python

Find a web page with a table on it. For example, you could scrape the names of cat breeds from this page.

import pandas as pd

# extract all of the HTML tables
dfs = pd.read_html("https://en.wikipedia.org/wiki/List_of_cat_breeds")

# take a look at the first data frame
dfs[0].dtypes

At this point you should should see the types of the table data from the first HTML table on the page (dfs[0]):

Breed                  object
Location of origin     object
Type                   object
Body type              object
Coat length            object
Pattern                object
Image                 float64

To look at the breed column, type this:

dfs[0]["Breed"]

It should display output something like this:

0            Abyssinian[5]
1                   Aegean
2      American Bobtail[7]
3         American Curl[8]
4        American Ringtail
              ...
96             Turkish Van
97       Turkish Vankedisi
98        Ukrainian Levkoy
99          Wila Krungthep
100         York Chocolate
Name: Breed, Length: 101, dtype: object

That data comes from the first column of the table:

List of cat breeds from Wikipedia

You can then use a list comprehension (or for loop) to extract the names of the cat breeds:

[b for b in dfs[0]["Breed"]]

The result will be something like this:

['Abyssinian[5]', 'Aegean', 'American Bobtail[7]', 'American Curl[8]', 'American Ringtail', 'American Shorthair', 'American Wirehair', 'Aphrodite Giant', 'Arabian Mau', 'Asian cat', 'Asian Semi-longhair', 'Australian Mist', 'Balinese', 'Bambino', 'Bengal', 'Birman', 'Bombay', 'Brazilian Shorthair', 'British Longhair', 'British Shorthair', 'Burmese', 'Burmilla', 'California Spangled', 'Chantilly-Tiffany', 'Chartreux', 'Chausie', 'Colorpoint Shorthair', 'Cornish Rex', 'Cymric, Manx Longhair or Long-haired Manx[b]', 'Cyprus', 'Devon Rex', 'Donskoy orDon Sphynx', 'Dragon Li orChinese Li Hua', 'Dwelf', 'Egyptian Mau', 'European Shorthair', 'Exotic Shorthair', 'Foldex[9]', 'German Rex', 'Havana Brown', 'Highlander', 'Himalayan orColorpoint Persian[d]', 'Japanese Bobtail', 'Javanese orColorpoint Longhair[f]', 'Kanaani', 'Khao Manee', 'Kinkalow', 'Korat', 'Korean Bobtail', 'Korn Ja', 'Kurilian Bobtail orKuril Islands Bobtail', 'Lambkin', 'LaPerm', 'Lykoi', 'Maine Coon', 'Manx', 'Mekong Bobtail', 'Minskin', 'Napoleon', 'Munchkin', 'Nebelung', 'Norwegian Forest Cat', 'Ocicat', 'Ojos Azules', 'Oregon Rex(extinct)', 'Oriental Bicolor', 'Oriental Longhair[g]', 'Oriental Shorthair[g]', 'Persian (modern)', 'Persian (traditional)', 'Peterbald', 'Pixie-bob', 'Ragamuffin orLiebling (obsolete)', 'Ragdoll', 'Raas', 'Russian Blue', 'Russian White, Russian Black and Russian Tabby', 'Sam Sawet', 'Savannah', 'Scottish Fold', 'Selkirk Rex', 'Serengeti', 'Serrade Petit', 'Siamese (modern)(for traditional, see Thai below)', 'Siberian orSiberian Forest Cat;Neva Masquerade (colorpoint variety)', 'Singapura', 'Snowshoe', 'Sokoke', 'Somali', 'Sphynx', 'Suphalak', 'Thai orTraditional, Classic, or Old-style Siamese;Wichien Maat[h]', 'Thai Lilac, Thai Blue Point and Thai Lilac Point', 'Tonkinese', 'Toyger', 'Turkish Angora', 'Turkish Van', 'Turkish Vankedisi', 'Ukrainian Levkoy', 'Wila Krungthep', 'York Chocolate']

Or use a loop to print the cat breeds like this:

for breed in dfs[0]["Breed"]:
    print(breed)

See the Pandas documentation for more information.

For another quick way to scrape webpages but using JavaScript, see this video.