Easy Python Web Scraping with Pandas

Updated on

Here’s an easy way to scrape HTML tables from the Web with Python. It’s only takes a few lines of code.

Create a Virtual Environment

Optionally create a venv to isolate your python environment. The following commands should be typed in a terminal on a computer that has Python 3 installed.

python3 -m venv .venv

Activate the venv:

source .venv/bin/activate

Install Pandas

This command will install Pandas in the virtual environment:

pip install pandas

Run Your Scraper in a REPL

Start a Python REPL:

python

Find a webpage with a table on it. For example, you could scrape the names of cat breeds from this page.

import pandas as pd

# extract all of the HTML tables
dfs = pd.read_html("https://en.wikipedia.org/wiki/List_of_cat_breeds")

# give the first table a better name
table = dfs[0]

# take a look at the data types
table.dtypes

At this point you should should see the types of the table data from the first HTML table on the page (which is now named table in the program):

Breed                  object
Location of origin     object
Type                   object
Body type              object
Coat length            object
Pattern                object
Image                 float64

To look at the breed column, type this:

table["Breed"]

It should display output something like this:

0            Abyssinian[5]
1                   Aegean
2      American Bobtail[7]
3         American Curl[8]
4        American Ringtail
              ...
96             Turkish Van
97       Turkish Vankedisi
98        Ukrainian Levkoy
99          Wila Krungthep
100         York Chocolate
Name: Breed, Length: 101, dtype: object

That data comes from the first column of the table:

List of cat breeds from Wikipedia

You can then extract data in various ways. For example, extract the cat breed names as a list:

table["Breed"].to_list()

The result will be something like this:

['Abyssinian[5]', 'Aegean', 'American Bobtail[7]', 'American Curl[8]', 'American Ringtail', 'American Shorthair', 'American Wirehair', 'Aphrodite Giant', 'Arabian Mau', 'Asian cat', 'Asian Semi-longhair', 'Australian Mist', 'Balinese', 'Bambino', 'Bengal', 'Birman', 'Bombay', 'Brazilian Shorthair', 'British Longhair', 'British Shorthair', 'Burmese', 'Burmilla', 'California Spangled', 'Chantilly-Tiffany', 'Chartreux', 'Chausie', 'Colorpoint Shorthair', 'Cornish Rex', 'Cymric, Manx Longhair or Long-haired Manx[b]', 'Cyprus', 'Devon Rex', 'Donskoy orDon Sphynx', 'Dragon Li orChinese Li Hua', 'Dwelf', 'Egyptian Mau', 'European Shorthair', 'Exotic Shorthair', 'Foldex[9]', 'German Rex', 'Havana Brown', 'Highlander', 'Himalayan orColorpoint Persian[d]', 'Japanese Bobtail', 'Javanese orColorpoint Longhair[f]', 'Kanaani', 'Khao Manee', 'Kinkalow', 'Korat', 'Korean Bobtail', 'Korn Ja', 'Kurilian Bobtail orKuril Islands Bobtail', 'Lambkin', 'LaPerm', 'Lykoi', 'Maine Coon', 'Manx', 'Mekong Bobtail', 'Minskin', 'Napoleon', 'Munchkin', 'Nebelung', 'Norwegian Forest Cat', 'Ocicat', 'Ojos Azules', 'Oregon Rex(extinct)', 'Oriental Bicolor', 'Oriental Longhair[g]', 'Oriental Shorthair[g]', 'Persian (modern)', 'Persian (traditional)', 'Peterbald', 'Pixie-bob', 'Ragamuffin orLiebling (obsolete)', 'Ragdoll', 'Raas', 'Russian Blue', 'Russian White, Russian Black and Russian Tabby', 'Sam Sawet', 'Savannah', 'Scottish Fold', 'Selkirk Rex', 'Serengeti', 'Serrade Petit', 'Siamese (modern)(for traditional, see Thai below)', 'Siberian orSiberian Forest Cat;Neva Masquerade (colorpoint variety)', 'Singapura', 'Snowshoe', 'Sokoke', 'Somali', 'Sphynx', 'Suphalak', 'Thai orTraditional, Classic, or Old-style Siamese;Wichien Maat[h]', 'Thai Lilac, Thai Blue Point and Thai Lilac Point', 'Tonkinese', 'Toyger', 'Turkish Angora', 'Turkish Van', 'Turkish Vankedisi', 'Ukrainian Levkoy', 'Wila Krungthep', 'York Chocolate']

Or save the entire table as a CSV file:

table["Breed"].to_csv("cats.csv")

Methods for other formats are available too, like to_json(), to_string(), to_dict(), to_sql(), to_markdown(), to_html(), and more. Some of those functions require extra libraries. For example, to convert the data to markdown, you’ll get an error about missing tabulate until you install that extra library with pip install tabulate. See the Pandas documentation for more information.

Here’s a screenshot that shows an example of running similar code:

Web scraping with Pandas in Python

For another quick way to scrape webpages, but using JavaScript, see the JavaScript Web scraping tutorial.

Tagged with: Programming PythonWeb Scraping

Feedback and Comments

What did you think about this page? Do you have any questions, or is there anything that could be improved? You can leave a comment after clicking on an icon below.