Introduction
In this tutorial, we will be walking through the process of building a simple Python script that is able to search the contents of PDF files in an Amazon S3 bucket using Apache Tika and OpenSearch. Apache Tika is a library for extracting text and metadata from various types of documents, including PDF files. OpenSearch is a search server that is able to index and search document contents using Tika.
Prerequisites
Before we begin, make sure that you have the following prerequisites:
- An AWS account and credentials with access to an S3 bucket.
- Apache Tika and OpenSearch installed on your system. You can download the latest version of Tika from the Apache Tika website, and the latest version of OpenSearch from the OpenSearch GitHub page.
- The
boto3
andrequests
libraries installed on your system. You can install these libraries usingpip install boto3 requests
.
Your S3 Bucket is a Black Box
If your S3 bucket is anything like mine, it contains so many different file types. Text, pictures, audio clips, videos, the list goes on.
How are we supposed to find anything in this mess of files?
To solve this, we're going to build a pipeline for extracting the text from an example binary file: PDFs using OCR. Then we'll store it in a search engine via AWS OpenSearch and finally we'll build a REST API in Flask to "explore" our S3 bucket.
Here's what our pipeline will look like:
Bonus: Skip this walkthrough and just download the code here.
Retrieve the File from your S3 Bucket
First we need to download the file locally so we can run our text extraction logic.
import boto3
s3_client = boto3.client(
's3',
aws_access_key_id='aws_access_key_id',
aws_secret_access_key='aws_secret_access_key',
region_name='region_name'
)
with open(s3_file_name, 'wb') as file:
s3_client.download_fileobj(
bucket_name,
s3_file_name,
file
)
Use OCR to Extract the Contents
We’ll use the open source, Apache Tika library, which contains a class: AutoDetectParser that does OCR (optical character recognition):
from tika import parser
parsed_pdf_content = parser.from_file(s3_file_name)['content']
Insert the Contents into AWS OpenSearch
We’re using a self-managed OpenSearch node here, but you can use Lucene, SOLR, ElasticSearch or Atlas Search.
Note: if you don’t have OpenSearch locally you must install it first, then run it:
brew update
brew install opensearch
opensearch
OpenSearch will now be accessible here: http://localhost:9200
Let’s build the index and insert the file contents:
from opensearchpy import OpenSearch
os = OpenSearch("http://localhost:9200")
index_name="pdf-search"
doc = {
"filename": s3_file_name,
"parsed_pdf_content": parsed_pdf_content
}
response = os.index(
index=index_name,
body=doc,
id=1,
refresh=True
)
Building the Search API
from flask import Flask, jsonify, request
from opensearchpy import OpenSearch
app = Flask(__name__)
os = OpenSearch("http://localhost:9200/")
@app.route('/search', methods=['GET'])
def search_file():
# value from the api
query = request.args.get('q', default = None, type = str)
# query payload in json forOpenSearch
payload = {
'query': {
'match': {
'parsed_pdf_content': query
}
}
}
# run search query
response = os.search(
body=payload,
index=index_name
)
return jsonify(response)
if __name__ == '__main__':
app.run(host="localhost", port=5011, debug=True)
You can download the repo here: https://github.com/mixpeek/pdf-search-s3
The easy part is done, now you need to figure out:
- Queuing: Ensuring concurrent file uploads are not dropped
- Security: Adding end to end encryption to the data pipeline
- Enhancements: Including more features like fuzzy, highlighting and autocomplete
- Rate Limiting: Building thresholds so users don’t abuse the system
All This in Two Lines of Code
from mixpeek import Mixpeek
# init mixpeek class with S3 connection
mix = Mixpeek(
api_key="mixpeek_api_key",
access_key="aws_access_key",
secret_key="aws_secret_key",
region="region"
)
# index our entire S3 bucket's files
mix.index_bucket("mixpeek-public-demo")
# full text search across S3 bucket
mix.search("system")