Getting started with Elasticsearch
Install ElasticSearch
$ mkdir ~/es
$ cd ~/es
$ wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.5/elasticsearch-2.3.5.tar.gz
$ tar -xzvf elasticsearch-2.3.5.tar.gz
$ cd elasticsearch-2.3.5/
$ ./bin/elasticsearch -d
$ curl http://127.0.0.1:9200
At this point you should see something like
{
"name" : "Gailyn Bailey",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "2.3.5",
"build_hash" : "90f439ff60a3c0f497f91663701e64ccd01edbb4",
"build_timestamp" : "2016-07-27T10:36:52Z",
"build_snapshot" : false,
"lucene_version" : "5.5.0"
},
"tagline" : "You Know, for Search"
}
Create the ES index for the posts
In the mappings part we want to differentiate between finding a hit in the title or in the body. A hit of the search in the title has twice as much value as a hit in the body.
#!/usr/bin/env python
data = {
"settings": {
"number_of_shards": 4,
"number_of_replicas": 1
},
"mappings": {
"blog": {
"properties": {
"title": { "type": "string", "boost": 4 },
"body": { "type": "string", "boost": 2 },
}
}
}
}
import json, requests
response = requests.put('http://127.0.0.1:9200/blog_index/', data=json.dumps(data))
print response.text
Add the entries
#!/usr/bin/env python
import json, requests
from blog.models import Entry
data = ''
for p in Entry.objects.all():
data += '{"index": {"_id": "%s"}}\n' % p.pk
data += json.dumps({
"title": p.title,
"body": p.body
})+'\n'
response = requests.put('http://127.0.0.1:9200/blog_index/blog/_bulk', data=data)
print response.text
Search the entries
#!/usr/bin/env python
import json, requests
data = {
"query": {
"query_string": { "query": "python" }
}
}
response = requests.post('http://127.0.0.1:9200/blog_index/blog/_search', data=json.dumps(data))
print response.json()
This gives the following reply:
{
"hits": {
"hits": [
{
"_score": 0.63516665,
"_type": "blog",
"_id": "4",
"_source": {
"body": "```python\r\n\"\"\" samples\/crawl_01.py \"\"\"\r\n################################################################################\r\n# Application: WebParser example 01\r\n# File: samples\/crawl_01.py\r\n# Goal:\r\n# Input:\r\n# Output:\r\n# Example:\r\n#\r\n# History: 2016-06-27 - JJ Creation of the file\r\n#
...
main\r\n################################################################################\r\nif __name__ == \"__main__\":\r\n main()\r\n```",
"title": "Simple webcrawling in Python "
},
"_index": "blog_index"
},
{
"_score": 0.4232868,
"_type": "blog",
"_id": "7",
"_source": {
"body": "This is a simple script to crawl information from a website when the content is dynamically loaded.\r\n```\r\n\"\"\" samples\/crawl_02.py \"\"\"\r\n################################################################################\r\n# Application: WebParser example 02\r\n# File: samples\/crawl_01.py\r\n# Goal: Retrieve content when JavaScript is used in page\r\n# Input:\r\n# Output:\r\n# Example:\r\n#\r\n# History: 2016-06-27 - JJ Creation of the file\r\n
...
main\r\n################################################################################\r\nif __name__ == \"__main__\":\r\n main()\r\n```",
"title": "Webcrawling in Python using Selenium"
},
"_index": "blog_index"
},
{
"_score": 0.35721725,
"_type": "blog",
"_id": "13",
"_source": {
"body": "#### Installation\r\nUse the [Anaconda](https:\/\/www.continuum.io\/downloads \"Anaconda\") package. It will make starting with Data Science way easier, since almost all necessary packages are included and you can start right away.\r\n
...
[Source](http:\/\/twiecki.github.io\/blog\/2014\/11\/18\/python-for-data-science\/ \"Twiecki@Github\")",
"title": "Get started with data science in Python"
},
"_index": "blog_index"
}
],
"total": 3,
"max_score": 0.63516665
},
"_shards": {
"successful": 4,
"failed": 0,
"total": 4
},
"took": 23,
"timed_out": false
}
Each result will get a score and the results will be ordered accordingly. Of course the better the search query, the more the score will say about the likeliness of the result matching your query.