This notebook makes use of the Scrapy library to scrape data from a website. Following the basic example, we create a QuotesSpider and call the CrawlerProcess with this spider to retrieve quotes from http://quotes.toscrape.com.

In this notebook two pipelines are defined, both writing results to a JSON file. The first option is to create a separate class that defines the pipeline and explicitly has the functions to write to a file per found item. It enables more flexibility when dealing with stranger data formats, or if you want to setup a custom way of writing items to file. The pipeline is set in the custom_settings parameter ITEM_PIPELINES inside the QuoteSpider class. However, I simply want to write the list of items that are found in the spider to a JSON file and therefor it is easier to choose the second option, where only the FEED_FORMAT has to be set to JSON and the output file needs to be defined in FEED_URI inside the custom settings of the spider. No additional classes or definitions need to be created, making the FEED_FORMAT/FEED_URI a convenient option.

Once the quotes are retrieved the JSON file will be created on disk and can be loaded to a Pandas dataframe. This dataframe can then be analyzed, modified and be used for further processing. This notebook simply loads the JSON file to a dataframe and writes it again to a pickle.

In [1]:

# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Show Python version
import platform
platform.python_version()

Out[1]:

'3.6.1'

Import Scrapy¶

In [2]:

try:
    import scrapy
except:
    !pip install scrapy
    import scrapy
from scrapy.crawler import CrawlerProcess

Setup a pipeline¶

This class creates a simple pipeline that writes all found items to a JSON file, where each line contains one JSON element.

In [3]:

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('quoteresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

Define the spider¶

The QuotesSpider class defines from which URLs to start crawling and which values to retrieve. I set the logging level of the crawler to warning, otherwise the notebook is overloaded with DEBUG messages about the retrieved data.

In [4]:

import logging

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'quoteresult.json'                        # Used for pipeline 2
    }
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

Start the crawler¶

In [5]:

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start()

2017-08-02 15:22:02 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-08-02 15:22:02 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}

Out[5]:

<Deferred at 0x7f8b9a41c7b8>

Check the files¶

Verify that the files has been created on disk. As we can observe the files are both created and have data. The .jl file has line separated JSON elements, while the .json file has one big JSON array containing all the quotes.

In [6]:

ll quoteresult.*

-rw-rw-r-- 1 jitsejan 5551 Aug  2 15:22 quoteresult.jl
-rw-rw-r-- 1 jitsejan 5573 Aug  2 15:22 quoteresult.json

In [7]:

!tail -n 2 quoteresult.jl

{"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt", "tags": ["misattributed-eleanor-roosevelt"]}
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]}

In [8]:

!tail -n 2 quoteresult.json

{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]}
]

Create dataframes¶

Pandas can now be used to create dataframes and save the frames to pickles. The .sjon file can be loaded directly into a frame, whereas for the .jl file we need to specify the JSON objects are divided per line.

In [9]:

import pandas as pd
dfjson = pd.read_json('quoteresult.json')
dfjson

Out[9]:

	author	tags	text
0	Marilyn Monroe	[friends, heartbreak, inspirational, life, lov...	“This life is what you make it. No matter what...
1	J.K. Rowling	[courage, friends]	“It takes a great deal of bravery to stand up ...
2	Albert Einstein	[simplicity, understand]	“If you can't explain it to a six year old, yo...
3	Bob Marley	[love]	“You may not be her first, her last, or her on...
4	Dr. Seuss	[fantasy]	“I like nonsense, it wakes up the brain cells....
5	Douglas Adams	[life, navigation]	“I may not have gone where I intended to go, b...
6	Elie Wiesel	[activism, apathy, hate, indifference, inspira...	“The opposite of love is not hate, it's indiff...
7	Friedrich Nietzsche	[friendship, lack-of-friendship, lack-of-love,...	“It is not a lack of love, but a lack of frien...
8	Mark Twain	[books, contentment, friends, friendship, life]	“Good friends, good books, and a sleepy consci...
9	Allen Saunders	[fate, life, misattributed-john-lennon, planni...	“Life is what happens to us while we are makin...
10	Albert Einstein	[change, deep-thoughts, thinking, world]	“The world as we have created it is a process ...
11	J.K. Rowling	[abilities, choices]	“It is our choices, Harry, that show what we t...
12	Albert Einstein	[inspirational, life, live, miracle, miracles]	“There are only two ways to live your life. On...
13	Jane Austen	[aliteracy, books, classic, humor]	“The person, be it gentleman or lady, who has ...
14	Marilyn Monroe	[be-yourself, inspirational]	“Imperfection is beauty, madness is genius and...
15	Albert Einstein	[adulthood, success, value]	“Try not to become a man of success. Rather be...
16	André Gide	[life, love]	“It is better to be hated for what you are tha...
17	Thomas A. Edison	[edison, failure, inspirational, paraphrased]	“I have not failed. I've just found 10,000 way...
18	Eleanor Roosevelt	[misattributed-eleanor-roosevelt]	“A woman is like a tea bag; you never know how...
19	Steve Martin	[humor, obvious, simile]	“A day without sunshine is like, you know, nig...

In [10]:

dfjl = pd.read_json('quoteresult.jl', lines=True)
dfjl

Out[10]:

	author	tags	text
0	Marilyn Monroe	[friends, heartbreak, inspirational, life, lov...	“This life is what you make it. No matter what...
1	J.K. Rowling	[courage, friends]	“It takes a great deal of bravery to stand up ...
2	Albert Einstein	[simplicity, understand]	“If you can't explain it to a six year old, yo...
3	Bob Marley	[love]	“You may not be her first, her last, or her on...
4	Dr. Seuss	[fantasy]	“I like nonsense, it wakes up the brain cells....
5	Douglas Adams	[life, navigation]	“I may not have gone where I intended to go, b...
6	Elie Wiesel	[activism, apathy, hate, indifference, inspira...	“The opposite of love is not hate, it's indiff...
7	Friedrich Nietzsche	[friendship, lack-of-friendship, lack-of-love,...	“It is not a lack of love, but a lack of frien...
8	Mark Twain	[books, contentment, friends, friendship, life]	“Good friends, good books, and a sleepy consci...
9	Allen Saunders	[fate, life, misattributed-john-lennon, planni...	“Life is what happens to us while we are makin...
10	Albert Einstein	[change, deep-thoughts, thinking, world]	“The world as we have created it is a process ...
11	J.K. Rowling	[abilities, choices]	“It is our choices, Harry, that show what we t...
12	Albert Einstein	[inspirational, life, live, miracle, miracles]	“There are only two ways to live your life. On...
13	Jane Austen	[aliteracy, books, classic, humor]	“The person, be it gentleman or lady, who has ...
14	Marilyn Monroe	[be-yourself, inspirational]	“Imperfection is beauty, madness is genius and...
15	Albert Einstein	[adulthood, success, value]	“Try not to become a man of success. Rather be...
16	André Gide	[life, love]	“It is better to be hated for what you are tha...
17	Thomas A. Edison	[edison, failure, inspirational, paraphrased]	“I have not failed. I've just found 10,000 way...
18	Eleanor Roosevelt	[misattributed-eleanor-roosevelt]	“A woman is like a tea bag; you never know how...
19	Steve Martin	[humor, obvious, simile]	“A day without sunshine is like, you know, nig...

In [11]:

dfjson.to_pickle('quotejson.pickle')
dfjl.to_pickle('quotejl.pickle')

In [12]:

ll *pickle

-rw-rw-r-- 1 jitsejan 5676 Aug  2 15:22 quotejl.pickle
-rw-rw-r-- 1 jitsejan 5676 Aug  2 15:22 quotejson.pickle