Common Pitfalls While Using Sidekiq

Sidekiq is one of the most commonly used libraries with Ruby on Rails, allowing for code to be executed independently from requests made to an application. It handles a lot of the complexities of managing defined background jobs, with built in logic for scheduling provided tasks and retrying them, but it doesn’t handle every possible issue for the user. Most of these are documented in the wiki on the Sidekiq Github repository. The best practices page there makes a number of recommendations, namely using simple parameters for new jobs, as well as making the content of these jobs idempotent & transactional to prevent issues arising from race conditions and retries.

Keyword Arguments

Sidekiq allows users to define the signature for the perform method that Sidekiq will call through its perform_async, perform_in & perform_at methods, however these methods do not support keyword & named parameters, symbols & some complex objects. This is because the provided parameters are converted to JSON when the job is pushed to Redis to be run later.

This issue can be missed if testing of a new Sidekiq job is done only against the perform method, and not the asynchronous methods. Sidekiq provides methods for testing new jobs here, along with examples on how to use them. In particular, this example:

HardWorker.perform_async(1, 2)
HardWorker.perform_async(2, 3)
assert_equal 2, HardWorker.jobs.size
HardWorker.drain
assert_equal 0, HardWorker.jobs.size

is sufficient to test that HardWorker can be successfully called without issues arising from Sidekiq failing to properly pass the arguments from perform_async (or its equivalents) to the user defined perform method.

Errors from invalid parameters are exacerbated by Sidekiq’s retry mechanisms, which will repeat invalid calls until the retry limit is reached or the affected code is fixed by a new deployment.

Retries

By default, Sidekiq will retry a job that fails 25 times, with increasing delays. This protects jobs from being lost when they fail due to errors that are resolved before the retry limit is hit. Retries come with a number of tradeoffs, especially if the job is not idempotent.

Ideally, each job would be idempotent and transactional, avoiding cases where only part of the job is completed before failing and cases where the retries lead to duplicate outcomes. In practice some tasks, such as network requests to external services require extra care to avoid complications arising from Sidekiq’s default retry strategy.

Network requests are generally not transactional, as they are reliant on external services which might fail part way through their changes, and requests will eventually timeout. In the event of a read timeout where the external service does not respond quickly enough, it is uncertain if the request succeeded on the other end.

By default, Sidekiq will catch exceptions caused by these timeouts, and retry the job, potentially leading to a duplicate call. Polling the status of the external service before making a call each time can prevent this issue, but will be costly. Additionally, catching the error and managing any retries yourself can prevent Sidekiq's default retries. Some APIs offer idempotency guards, such as Stripe’s idempotency keys, which allow Stripe to identify duplicate requests and ignore them. If all else fails, setting the retry count for a job to 0 may seem like an option, but there are other cases where Sidekiq will retry jobs, mainly when Sidekiq is shut down, intentionally or due to a crash.

Deployment Retries

During a deployment, an old Sidekiq process needs to be shutdown so it can be replaced by a new one. As a result, any jobs still active on the old process will stop running when it is terminated, possibly killing them before they are done their work.

Sidekiq outlines its standard (unpaid) solution to this issue here. Sidekiq will push jobs to Redis to be retried later, provided they do not finish between Sidekiq being notified about an incoming shutdown and the actual shutdown. If Sidekiq is shutdown while it is pushing jobs back to Redis, theunexpected shutdown could trigger super_fetch retries after jobs have already been pushed to Redis for deployment retries. This can lead to duplicate jobs, and is another reason to assume that any job can be retried or duplicated and add guards against that in their code.

The Enterprise version of Sidekiq comes with the option to do rolling restarts, where the old process runs until it clears its current jobs while taking no new ones on. New jobs would go to the new process, and this would prevent deployment retries, at the cost of needing to avoid large database changes while rolling restarts are active, and delays in updating code on long running processes.

super_fetch

Paid versions of Sidekiq come with a feature called super_fetch. When a job fails due to issues with Sidekiq itself (such as out of memory errors), Sidekiq might permanently lose the job. This can also occur if the Sidekiq process is killed before restoring all of its jobs to the queue, typically during deployments. These issues can be mitigated by using multiple queues, spreading the risk of a crash, however this involves additional resources and Sidekiq’s documentation recommends against using too many queues.

With super_fetch, when Sidekiq recovers from a crash or unexpected shutdown it will retry the affected jobs. On the current version of Sidekiq Pro, super_fetch will recover a job this way 3 times. On an older version of Sidekiq Pro (before 5.2), it will have no limits on how many times a job is retried.

These retries come with the same tradeoffs as normal ones, with one additional tradeoff. If a specific job is the cause of Sidekiq crashing, then retrying it can cause all the jobs currently being run by Sidekiq to crash and be retried themselves. This can happen in a loop until the retry limit is hit on the problematic job or the job is manually cancelled (this is the only option on versions without the retry limit).

Unlike normal retries, these must be cancelled manually, by redeploying a version of the code that will not crash, or, as Sidekiq recommends, having the job check Redis for a flag indicating that the job has been cancelled before running. In the wiki’s FAQ, Sidekiq provides code that can be added to a job that implements this check, and allows a job to be cancelled from a console on the Sidekiq process using a job’s id.

Sidekiq only performs super_fetch retries if it shuts down unexpectedly but by default these are logged at the same level as normal job start and end logs. As a result, these can be missed, either buried within less important logs or silenced if Sidekiq’s logging level is set high enough. As of Sidekiq Pro 5.2, Sidekiq will fire a callback (details here) on a super_fetch retry, or a job being killed after reaching its super_fetch retry limit. These can be used to log, alert, and track super_fetch retries as needed.

Race Conditions

Sidekiq jobs run concurrently with the rest of the app they are part of and with other instances of the same job. These asynchronous tasks must account for the presence of others in order to avoid race conditions, between jobs doing similar actions, multiple instances of the same job and between the jobs and the rest of the application outside of Sidekiq.

Problematic race conditions often follow Murphy’s Law; if one can happen, it does, especially since Sidekiq’s retries and job scheduling can lead to jobs running at unexpected times. Sidekiq does not guarantee that jobs will be processed sequentially if it is configured to run multiple jobs at the same time. With some concurrency jobs created back to back can be run at the same time, in a non sequential manner. Each job’s code should account for the possibility of duplicate jobs on top of other parts of the application affecting the same database entries and external services.

Job Scheduling

Sidekiq allows jobs to be scheduled at specific times (with a few seconds delay based on how often Sidekiq will check Redis for scheduled jobs & retries, which is outlined here). These scheduled jobs are not guaranteed to be run immediately once polled, but instead enter the queue at the bottom. If there is a backlog of jobs, the scheduled jobs will be delayed until the jobs in front of them are cleared.

This can cause jobs to run at unexpected times, and can cause race conditions for periodic jobs scheduled close enough together.

For example, if a Sidekiq job is scheduled to run every 10 minutes, and the queue is backed up enough to delay them by 10 minutes, then multiple instances of the job can exist in the queue at the same time. When the issue is resolved and the queue clears, these jobs can be run back to back, or concurrently depending on how many processes are running on a Sidekiq queue. If a job does not account for this possibility, this can lead to unexpected or duplicate results.

Locking

Locking is a common solution for most potential race conditions. Only one active thread in an application can hold a properly implemented lock (such as the database row level locking strategies provided by Rails for ActiveRecord objects), removing any potential for race conditions. A common issue with locks is with race conditions occurring between checking the state of an object and updating it, often called a Time of Check vs Time of Use (TOCTOU) race condition.

If a developer is unaware of the need to lock before reading important data, it can lead to code like:

def unsafe_change_state_from_not_started_to_started
  if state == 'not_started'
    self.with_lock do
      update(state: 'started')
    end
  end
end

If another, nearly identical method called unsafe_change_state_from_not_started_to_cancelled exists, a race condition can happen if one method is still holding the lock while the other loads the object. This can lead to the object’s state being unintentionally changed to started after it was changed to cancelled, or vice versa.

Checks made after calling Rails’s lock! or with_lock methods are safe from this issue, as they will reload the locked object as part of seizing the lock. Similarly, Rails' optimistic locking accounts for this by rejecting an update if the database object has been updated elsewhere since it was originally loaded from the database. Rails locks only protect locked objects, leaving any unlocked objects within a lock exposed to race conditions without additional locks or protections. Non-Rails locks that do not reload important objects within the protected code will be similarly exposed.

Conclusion

This is not a full account of every possible issue that can arise while using Sidekiq. Many that are not covered here are detailed in the very helpful Sidekiq wiki. On top of that, more recent issues might only be present in the Sidekiq issues page, and paid versions of Sidekiq come with the option of emailing Sidekiq’s creator for support.