First experiments with NLTK
In [1]:
import nltk
# Run this the first time you use NLTK
# nltk.download()
In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."
print(sent_tokenize(EXAMPLE_TEXT))
print(word_tokenize(EXAMPLE_TEXT))
Stop words¶
Stopwords should be removed from the list of words.
In [3]:
from nltk.corpus import stopwords
In [4]:
stop_words = set(stopwords.words('english'))
In [5]:
word_tokens = word_tokenize(EXAMPLE_TEXT)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
print(word_tokens)
print(filtered_sentence)
Example one - Bob Dylan - Hurricane's lyrics¶
Lets analyse the lyrics from Bob Dylan's Hurricane. First we get the lyrics from his website:
In [6]:
import lxml.html
import requests
response = requests.get('https://bobdylan.com/songs/hurricane/')
etree = lxml.html.fromstring(response.text)
lyrics = etree.cssselect('div[class*=\'lyrics\']')[0].text_content()
Tokenize the lyrics based on words:
In [7]:
word_tokens = word_tokenize(lyrics)
Remove the stopwords from the words:
In [8]:
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
Create a word frequency graph¶
In [9]:
from nltk import FreqDist
fd = FreqDist(filtered_sentence)
fd.plot(30,cumulative=False)
What we see in the graph above is that punctuation is disturbing the graph. We use a regular expression now to remove these characters and only select words.
In [10]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
word_tokens = tokenizer.tokenize(lyrics)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
Lets create the graph again:
In [11]:
fd = FreqDist(filtered_sentence)
fd.plot(30,cumulative=False)
Example two - Wikipedia about Nintendo¶
In [12]:
import wikipedia
nintendo = wikipedia.page("Nintendo")
word_tokens = word_tokenize(nintendo.content)
In [13]:
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
In [14]:
fd = FreqDist(filtered_sentence)
fd.plot(10,cumulative=False)
In [15]:
tokenizer = RegexpTokenizer(r'\w+')
word_tokens = tokenizer.tokenize(nintendo.content)
In [16]:
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
In [17]:
fd = FreqDist(filtered_sentence)
fd.plot(10,cumulative=False)