Scraping data with Scrapy and PostgreSQL

Introduction

In the following tutorial I will use Scrapy to retrieve the items in The Legend of Zelda from Gamepedia. I will focus on those items that have a name and a cost and add them to the database. In this post I have installed PostgreSQL on my VPS and configured it to work with Python. These items will help in the next project where I will use them to create a simple REST API with Flask.

Objective

  • Retrieve data from website using Scrapy.
  • Store results in a PostgreSQL database.

Prerequisites

  • pipenv installed (or any other Python virtual environment tool).
  • Scrapy, sqlalchemy and psycopg2 installed in the environment.
  • PostgreSQL installed.
$ pipenv --python 3
$ pipenv shell
data-retrieval $ python --version
Python 3.9.1
data-retrieval $ pipenv install Scrapy
data-retrieval $ pip freeze | grep Scrapy
Scrapy==2.4.1
data-retrieval $ pip freeze | grep SQLAlchemy
SQLAlchemy==1.3.22
data-retrieval $ pip freeze | grep psycopg2
psycopg2==2.8.6

Initialize Scrapy

Start a new Scrapy project inside your new Python project. I picked the creative name crawl for the Scrapy project.

data-retrieval $ scrapy startproject crawl
New Scrapy project 'crawl', using template directory '/Users/jitsejan/.local/share/virtualenvs/testscrapy-jJKHMw2I/lib/python3.9/site-packages/scrapy/templates/project', created in:
    /Users/jitsejan/code/testscrapy/data-retrieval/crawl

You can start your first spider with:
    cd crawl
    scrapy genspider example example.com

Running the startproject command will create a folder with the structure outlined below. There is a top folder with the project name (crawl) that contains the Scrapy configuration and a subfolder with the same name containing the actual crawling code.

NB: I don't want to go into too much detail about Scrapy because there are many tutorials for the tool online, and because I normally use requests with lxml to make (very simple) data crawlers. Many people prefer to use BeautifulSoup or other higher level data crawl libraries so feel free to go for that. I picked Scrapy in this particular case because it creates a nice scaffold when working with crawlers and databases but this can be completely done from scratch as well.

data-retrieval $ tree crawl
crawl
├── crawl
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

2 directories, 7 files

You could choose to not use the generator and write the Scrapy files yourself but for simplicity I use the boilerplate that comes with Scrapy. Now navigate to the top level project folder and create the spider (crawler) using genspider. In my case I will be crawling data from Gamepedia.com about items that were found in the Zelda games. A completely random choice of website but it fits with the theme I normally use when playing around with code.

data-retrieval $ cd crawl
data-retrieval/crawl $ scrapy genspider zelda_items zelda.gamepedia.com
Created spider 'zelda_items' using template 'basic' in module:
  crawl.spiders.zelda_items

If we look again at the tree structure we see that inside the spiders folder a new file (zelda_items.py) has been created. This file creates the basic structure for a spider.

data-retrieval/crawl $ tree
.
├── crawl
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── zelda_items.py
└── scrapy.cfg

The content of the zelda_items.py file is the minimal setup to get started with crawling data.

import scrapy


class ZeldaItemsSpider(scrapy.Spider):
    name = 'zelda_items'
    allowed_domains = ['zelda.gamepedia.com']
    start_urls = ['http://zelda.gamepedia.com/']

    def parse(self, response):
        pass

Setup the scraper

The first element we want to crawl is the link to all the items to get the detailed information for each of them. The code to retrieve these links is added to the parse function. Looking at the source code it is easy to copy the selector to get to the right element.

The selector returns the following information.

#mw-content-text > div > center:nth-child(1) > ul > li:nth-child(2) > div > div.gallerytext > p > a

As a selector I will use the CSS selector li.gallerybox .gallerytext p a::attr(href) to get the hyperlinks to all the items.

def parse(self, response):
    """
    Retrieve the links to the items
    """
    selector = "li.gallerybox .gallerytext p a::attr(href)"
    for href in response.css(selector).extract():
            yield Request(f"{self.base_url}{href}", callback=self.parse_item)

For each link that is retrieved the parse_item function is executed. The structure of the information page is a little ugly but we only care about the information table. Since this is a Wiki the data can be very unstructured and not all the tables will have the same fields which should be taken into account when retrieving data. For my particular case I want to retrieve the name and the price of each item (under the condition that the table contains the cost information). The following shows a simplified version of the HTML from which I am extracting the data.

<table class="infobox wikitable">
    <tbody>
        <tr>
            <th class="infobox-name centered" colspan="2"><span class="term">Arrow</span></th>
        </tr>
        <tr>
            <td class="infobox-image centered" colspan="2"> ... </td>
        </tr>
        <tr class="infobox-field">
            <th>Main appearance(s)</th>
            <td> ... </td>
        </tr>
        <tr class="infobox-field">
            <th>Other appearance(s)</th>
            <td> ... </td>
        </tr>
        <tr class="infobox-field">
            <th>Cost(s)</th>
            <td>
                <div class="infobox-field-content">
                    80 <a href="/Rupee" title="Rupee">Rupees</a> <sup>(<b><span title="The Legend of Zelda">TLoZ</span></b>)</sup>
                </div>
            </td>
        </tr>
        <tr class="infobox-field">
            <th>Location(s)</th>
            <td> ... </td>
        </tr>
        <tr class="infobox-field">
            <th>Use(s)</th>
            <td> ... </td>
        </tr>
        <tr class="infobox-field">
            <th>Strength</th>
            <td> ... </td>
        </tr>
        <tr class="infobox-field">
            <th>Comparable item(s)</th>
            <td> ... </td>
        </tr>
    </tbody>
</table>

To fetch the data I will again use a CSS selector for the name but the xpath selector to find the right element in the table for the cost. I explained in this earlier article how I figured out the right syntax to get the information. Only when the price returns an integer I will return the item. If there is no price I will ignore the item. In the future this code will most probably change to retrieve more items and different fields.

def parse_item(self, response):
    """
    Retrieve the item details
    """
    name_sel = "meta[property='og:title']::attr(content)"
    price_sel = "//tr[th//text()[contains(., 'Cost(s)')]]/td/div/text()"
    name = response.css(name_sel).get()
    price = response.xpath(price_sel).get()
    if price and price.strip().isdigit():
        yield {"name": name, "price": int(price)}

Crawl the data

Now the scraper is ready to be executed and retrieve the items. Run the crawler and verify that it is returning indeed the items that you would expect. There is no output that stores the items yet but the log tells me that there were 10 items that actually had a name and the cost defined ('item_scraped_count': 10,). Note that I set the loglevel to INFO to prevent an information overload in the console.

data-retrieval/crawl $ scrapy crawl zelda_items
2021-01-05 21:26:07 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: crawl)
2021-01-05 21:26:07 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.9.1 (default, Dec 17 2020, 03:41:37) - [Clang 12.0.0 (clang-1200.0.32.27)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020), cryptography 3.3.1, Platform macOS-10.15.7-x86_64-i386-64bit
2021-01-05 21:26:07 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'crawl',
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'crawl.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['crawl.spiders']}
2021-01-05 21:26:07 [scrapy.extensions.telnet] INFO: Telnet Password: 7313d2472beec312
2021-01-05 21:26:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-01-05 21:26:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-01-05 21:26:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-01-05 21:26:09 [scrapy.middleware] INFO: Enabled item pipelines:
['crawl.pipelines.CrawlPipeline']
2021-01-05 21:26:09 [scrapy.core.engine] INFO: Spider opened
2021-01-05 21:26:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-01-05 21:26:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-01-05 21:26:13 [scrapy.core.engine] INFO: Closing spider (finished)
2021-01-05 21:26:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 22501,
 'downloader/request_count': 75,
 'downloader/request_method_count/GET': 75,
 'downloader/response_bytes': 1941159,
 'downloader/response_count': 75,
 'downloader/response_status_count/200': 37,
 'downloader/response_status_count/301': 38,
 'dupefilter/filtered': 1,
 'elapsed_time_seconds': 4.146036,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 1, 5, 21, 26, 13, 283001),
 'item_scraped_count': 10,
 'log_count/INFO': 10,
 'memusage/max': 73228288,
 'memusage/startup': 73224192,
 'request_depth_max': 1,
 'response_received_count': 37,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 73,
 'scheduler/dequeued/memory': 73,
 'scheduler/enqueued': 73,
 'scheduler/enqueued/memory': 73,
 'start_time': datetime.datetime(2021, 1, 5, 21, 26, 9, 136965)}
2021-01-05 21:26:13 [scrapy.core.engine] INFO: Spider closed (finished)

Store the data

First of all I define the schema of the element that I am crawling in the items.py. There is no fancy schema yet but this can obviously be improved in the future when more items are being retrieved and the actual datatypes do make a difference.

"""crawl/crawl/items.py"""
from scrapy import Field, Item


class ZeldaItem(Item):
    """ Definition of the ZeldaItem """

    name = Field()
    price = Field()

The middlewares.py is left untouched for the project. The important bit for storing data in a database is inside models.py. As described before I use SQLAlchemy to connect to the PostgreSQL database. The database details are stored in settings.py (see below) and are used to create the SQLAlchemy engine. I define the Items model with the two fields and use the create_items_table to create the table.

"""crawl/crawl/models.py"""
from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.engine.base import Engine
from sqlalchemy.engine.url import URL
from sqlalchemy.ext.declarative import declarative_base

from crawl import settings

DeclarativeBase = declarative_base()


def db_connect() -> Engine:
    """
    Creates database connection using database settings from settings.py.
    Returns sqlalchemy engine instance
    """
    return create_engine(URL(**settings.DATABASE))


def create_items_table(engine: Engine):
    """
    Create the Items table
    """
    DeclarativeBase.metadata.create_all(engine)


class Items(DeclarativeBase):
    """
    Defines the items model
    """

    __tablename__ = "items"

    name = Column("name", String, primary_key=True)
    price = Column("price", Integer)

Inside the pipelines.py the spider is connected to the database. When the pipeline is started it will initalize the database and create the engine, create the table and setup a SQLAlchemy session. The process_item function is part of the default code and is executed for every yielded item in the scraper. In this case it means it will be triggered every time an item is retrieved with a name and a cost. For every item it is first checked if the item already exists in the database and in case is does not exist yet it will be added to the database. Remember to always commit() when adding (or removing) items to the table.

"""crawl/crawl/pipelines.py"""
from sqlalchemy.orm import sessionmaker

from crawl.models import Items, create_items_table, db_connect


class CrawlPipeline:
    def __init__(self):
        """
        Initializes database connection and sessionmaker.
        Creates items table.
        """
        engine = db_connect()
        create_items_table(engine)
        self.Session = sessionmaker(bind=engine)

    def process_item(self, item, spider):
        """
        Process the item and store to database.
        """
        session = self.Session()
        instance = session.query(Items).filter_by(**item).one_or_none()
        if instance:
            return instance
        zelda_item = Items(**item)

        try:
            session.add(zelda_item)
            session.commit()
        except:
            session.rollback()
            raise
        finally:
            session.close()

        return item

Finally, the settings.py is short and contains the information for the crawler. The only items I have added are the DATABASE and LOG_LEVEL variables. You could choose to add your security details in this file but I would recommend to keep them secret and store them elsewhere.

"""crawl/crawl/settings.py"""
import os

BOT_NAME = "crawl"
SPIDER_MODULES = ["crawl.spiders"]
NEWSPIDER_MODULE = "crawl.spiders"
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
    "crawl.pipelines.CrawlPipeline": 300,
}
DATABASE = {
    "drivername": "postgres",
    "host": os.environ["POSTGRES_HOST"],
    "port": os.environ["POSTGRES_PORT"],
    "username": os.environ["POSTGRES_USER"],
    "password": os.environ["POSTGRES_PASS"],
    "database": os.environ["POSTGRES_DB"],
}
LOG_LEVEL = "INFO"

Verify the data

As a last step in this tutorial I will double check that there are indeed ten items in the database. I have used pandas as an easy way to query the database and add the results to a table.

import os
import pandas as pd
import sqlalchemy
from sqlalchemy import create_engine

USER = os.environ['POSTGRES_USER']
PASS = os.environ['POSTGRES_PASS']
HOST = os.environ['POSTGRES_HOST']
PORT = os.environ['POSTGRES_PORT']
DB = os.environ['POSTGRES_DB']
db_string = f"postgres://{USER}:{PASS}@{HOST}:{PORT}/{DB}"

engine = create_engine(db_string)

df = pd.read_sql_query('SELECT * FROM items',
                       con=engine)
print(df.to_string())
#                  name  price
# 0           Boomerang    300
# 1     Heart Container      4
# 2           Blue Ring    250
# 3   Red Water of Life     68
# 4                Food     60
# 5  Blue Water of Life     40
# 6         Blue Candle     60
# 7               Arrow     80
# 8                 Bow    980
# 9                Bomb     20

See the Github repo for the final code. The code most probably will change once I (slowly) continue working on this side project but I hope it might help anyone playing with data crawling and databases.