Using Scrapy in Jupyter notebook
This notebook makes use of the Scrapy library to scrape data from a website. Following the basic example, we create a QuotesSpider and call the CrawlerProcess with this spider to retrieve quotes from http://quotes.toscrape.com.
In this notebook two pipelines are defined, both writing results to a JSON file. The first option is to create a separate class that defines the pipeline and explicitly has the functions to write to a file per found item. It enables more flexibility when dealing with stranger data formats, or if you want to setup a custom way of writing items to file. The pipeline is set in the custom_settings parameter ITEM_PIPELINES inside the QuoteSpider class. However, I simply want to write the list of items that are found in the spider to a JSON file and therefor it is easier to choose the second option, where only the FEED_FORMAT has to be set to JSON and the output file needs to be defined in FEED_URI inside the custom settings of the spider. No additional classes or definitions need to be created, making the FEED_FORMAT/FEED_URI a convenient option.
Once the quotes are retrieved the JSON file will be created on disk and can be loaded to a Pandas dataframe. This dataframe can then be analyzed, modified and be used for further processing. This notebook simply loads the JSON file to a dataframe and writes it again to a pickle.
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Show Python version
import platform
platform.python_version()
Import Scrapy¶
try:
import scrapy
except:
!pip install scrapy
import scrapy
from scrapy.crawler import CrawlerProcess
Setup a pipeline¶
This class creates a simple pipeline that writes all found items to a JSON file, where each line contains one JSON element.
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('quoteresult.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
Define the spider¶
The QuotesSpider class defines from which URLs to start crawling and which values to retrieve. I set the logging level of the crawler to warning, otherwise the notebook is overloaded with DEBUG messages about the retrieved data.
import logging
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
custom_settings = {
'LOG_LEVEL': logging.WARNING,
'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
'FEED_FORMAT':'json', # Used for pipeline 2
'FEED_URI': 'quoteresult.json' # Used for pipeline 2
}
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
Start the crawler¶
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(QuotesSpider)
process.start()
Check the files¶
Verify that the files has been created on disk. As we can observe the files are both created and have data. The .jl file has line separated JSON elements, while the .json file has one big JSON array containing all the quotes.
ll quoteresult.*
!tail -n 2 quoteresult.jl
!tail -n 2 quoteresult.json
Create dataframes¶
Pandas can now be used to create dataframes and save the frames to pickles. The .sjon file can be loaded directly into a frame, whereas for the .jl file we need to specify the JSON objects are divided per line.
import pandas as pd
dfjson = pd.read_json('quoteresult.json')
dfjson
dfjl = pd.read_json('quoteresult.jl', lines=True)
dfjl
dfjson.to_pickle('quotejson.pickle')
dfjl.to_pickle('quotejl.pickle')
ll *pickle