2019-05-04 10:40 — By Erik van Eykelen
Recently I created a proof-of-concept for a client to show that a combination of Amazon S3, Amazon SQS, Apache Tika, a Ruby-based Heroku worker, and Algolia can be used to create a surprisingly powerful search engine index ingestion solution with little effort.
By “search engine index ingestion” I mean the process of adding data from documents (.doc, .pdf) and text-based content to a search engine index.
Here’s the bill of materials for the proof-of-concept:
A Ruby on Rails-based web endpoint was used to present rudimentary search results.
This basic search services consists of less than 200 lines of Ruby code and two dozen lines of AWS policy configuration.
The proof-of-concept had to demonstrate several things:
For the proof-of-concept I chose S3 to act as a file-based API:
The ingestion process behind this file-based API uses the following components:
.json
file. The trigger causes SQS to add the S3 upload event to its queue.The following example shows how a plain text snippet can be added to the index:
{
"tenant_urn": "urn:example-app:example-company-dc1294fceb07b30102a5",
"source_urn": "urn:example-app:post:body:dd9dd35530e76603c875",
"operation": "create",
"locale": "en",
"content_body": "Hello world, this is my first indexed post."
}
tenant_urn
contains a user-defined value which is used by the search engine to compartmentalize search results.source_urn
contains a user-defined value which helps us correlate search results with features in our app. In this example urn:example-app:post:body:dd9dd35530e76603c875
might point to https://app.example.com/posts/dd9dd35530e76603c875
. I prefer URNs over raw URLs because URLs may change over time, while well-considered URNs have a much longer shelf life.create
operation tells the index to add the value of content_body
to the index. Two other supported operations are update
and delete
.locale
value is used to tag the entry in Algolia with the value en
. This enables us to filter queries by language.Adding the contents of a PDF file is just as easy:
{
"tenant_urn": "urn:example-app:example-company-dc1294fceb07b30102a5",
"source_urn": "urn:example-app:resume:attachment:711dc4b9b481a5749e67",
"operation": "create",
"locale": "en",
"title": "Résumé Joe Doe 2019.pdf",
"content_s3_region": "eu-west-1",
"content_s3_bucket": "search-proof-of-concept-test",
"content_s3_key": "834cb938-d0a3-404b-a15a-bf7a5a238ce6.pdf"
}
Indexing a file is a two-step process:
content_s3_region
, content_s3_bucket
, and content_s3_key
set to the values of the document you uploaded in step 1. Processing of the document and its associated JSON file is resilient against “out of order” upload completion because the worker simply ignores the manifest if the associated file is not yet present.Making the proper settings in Amazon is always challenging (at least for me, it’s not my daily work). I’ve summarized the settings below:
Add an SQS event to S3:
Select the event types you want to receive, select the file extension (optional), select SQS Queue, and select the SQS instance you have prepared:
In your SQS instance ensure your S3 user has access to it:
Use “View/Delete Messages” to test if your S3-to-SQS pipeline is working:
This proof-of-concept uses Algolia to index our content. There is not much to say about this part because it is ridiculously easy to set it up, have your content indexed, and return search results. And it is blazingly fast to boot.
The concept used just one model called Page
. The model code looks like this:
class Page < ApplicationRecord
include AlgoliaSearch
algoliasearch index_name: "pages" do
attribute :tenant_urn
attribute :source_urn
attribute :locale
attribute :tags
attribute :title
attribute :body
end
end
The only other thing you need to do (in a Rails app) is adding your credentials to ENV
vars and include the algoliasearch-rails
gem. See https://github.com/algolia/algoliasearch-rails for documentation.
Code snippets created for this proof-of-concept are listed below: