Another day, another task. Last year I’ve created a Elasticsearch cluster, but it was sitting around without any data because other projects had higher priority. But last week I finally had the opportunity to use the cluster. After some research I decided to use the Searchkick library for Ruby on Rails. This post will describe how I used Searchkick and Sidekiq to import all our data into Elasticsearch. By the way, Searchkick and Sidekiq are awesome!
In my case we needed to store the data of three Rails models in Elasticsearch; Users, Projects and Assets. Combined these models account for about 40 million records. All the models will use custom ‘search data’ as described in the documentation of Searchkick. In the end there will be about 35 GB of data stored inside Elasticsearch, divided over three cluster nodes with five shards and one replica.
Prerequisites
I always try to use the latest version for every project. So let’s start with the tools and versions I’m going to use;
- Elasticsearch (5.4.3)
- Ruby (2.4.1)
- Rails (5.1.2)
- Redis (3.2.8)
Additional gems
- Searchkick (2.3.0)
- Sidekick (5.0.3)
- OJ (3.2.0)
- Typhoeus (1.1.2)
Setting up
First we need to set up the connection to the elastic cluster. We’re using Haproxy as a loadbalancer and so we only need to specify one host. We’ll do that in a ‘initializer’ with a few specific configuration settings. First we’ll add retry_on_failure, this will make sure if a request time’s out or fails, it will retry the request. Making sure we don’t lose any data. We also set the request timeout, because what we notice in first attempts it was timing out. The default is 10 seconds but we’ll set it to 250. Reindexing with high concurrency can cause traffic congestion and will definitely slow down the the cluster and/or the relational database. This is what our initializer looks like;
I would like to add a side note to Searchkick background job processing with Sidekiq. It will use ActiveJob and not the native Sidekiq job runner. We’re using ActiveJob only in combination with Searchkick. Our other workers are using regular Sidekiq job runners. They do work together as they use different queues. To enable ActiveJob you will need to add the following code to application.rb;
It is wise to look into the documentation of Searchkick because it has a lot of important information on how to configure and optimise Searchkick.
Configure and start Redis. You will also need to run Sidekiq and let it process the ‘searchkick’ queue. This is the queue where all the reindex jobs are send to. You can do that with the following command;
Bulk Reindexing
The process for asynchronous reindexing is a little confusing. You want to send the reindexing tasks to the running Sidekiq process, so it can run on multiple processes without blocking. For this to work I had to read the documentation thoroughly and even look into the code to see what was actually happening. I finally came up with the following code:
Create reindex background jobs
Let’s go through it. First we start ‘reindex’ call on the model with the settings async to true and a refresh_interval to 30 seconds.
The refresh_interval is a Elasticsearch setting which times when the index needs to be updated. The default value is one second, but if we use that it will slow down the reindexing, because every second the index is updated with new data. You can set the refresh_interval to -1 to disable it when doing the complete reindex. But I decided to set it to 30 seconds so I could also track the progress. This is also the default in Searchkick.
When we call the reindex method it will insert a lot of jobs into Redis for the Sidekiq workers to process. It does this based on the first and last primary key (MIN/MAX) in the database. So if your primary key starts with 100 and ends with 5000 it will create a job for the records 100 til 1100 and then from 1101 til 2101, etc. It does this until all your jobs are inside the queue. In our situation we had about 50k jobs added to the queue for each Rails model.
Checking the queue
When all the jobs are in the queue it will create a loop which checks how many jobs are in the queue. I first did a check with ‘Searchkick.reindex_status(index_name)’ call from Searchkick, but it didn’t work for me. It always said the jobs were finished. So instead, I used Sidekiq’s API to determine how many jobs were in the queue and if it would hit zero it will break the loop and continue. In our case it would hit zero because we don’t do that much traffic and our cluster is blazing fast. But if you don’t get to that number, you could set it to check on lower than ten for example.
Promoting the index
The index is filled with data if the queue is empty. But the new index might not be the active one yet. You will need to ‘promote’ it to become the main index. When promoting it, it will merge with the previous main index, so new data isn’t lost. But I’d guess you only want to promote it when the reindex is successful. That’s why I added it to be optional, but enabled by default. With the promote call we also set the update_refresh_interval setting, which will reset the refresh_interval back to the 1 second default. After the promotion it will clean the old indices and everything should be good to go.
You can now start querying with Searchkick, have fun!