Google Cloud PubSub background job queue with Cloud Function workers

--

At Hubware we needed to schedule and queue background jobs and service them in bulk at arbitrary times in the future. We chose the Google Cloud PubSub framework as a message queue and Google Cloud Functions as workers to service the queued jobs. This unique combination created a reliable, scalable, and distributed cloud service.

Requirements

The functional description of the feature at first seemed simple: queue background jobs and service the queue at regular intervals. However, there were some key requirements that quickly revealed the feature was going to have some unique technical hurtles:

  1. Jobs are queued to be serviced at a later time. This is straight forward, when requesting a job to be serviced, it must be queued now and serviced asynchronously in the future, leaving the requester free for other work.
  2. Each job has a unique time range when it can be serviced (i.e. Monday-Friday 9:00 am to 5:00 pm). This adds significant technical complexity. Functionally, the jobs need to be serviced during the “working” hours of our customers. However, each customer has unique working hours. Further complicating the requirements, given that most of our customers are in the e-commerce sector, they have unique holiday schedules (extended hours leading up to and following Christmas). A generic 9 to 5 schedule is not good enough. Instead, the schedule must be customizable for each customer.
  3. Jobs scheduled “later” should not block jobs scheduled “sooner”. This is a key reason not to use a traditional message queue and why order is not important. When workers service jobs, the entire queue should be searched for jobs that can be serviced now.
  4. Jobs should be serviced in bulk when possible. Further adding to the technical complexity, we work with rate-limited API’s. Fortunately, the API’s support “bulk” operations. Any opportunity to combine API requests into bulk operations saves us money and reduces the chance that we exceed the rate limit when the feature starts to scale. It also saves in computing cost… working with a server-less architecture means each execution cycle has an associated cost. Servicing multiple jobs per execution is more efficient than servicing a single job with multiple executions.

Architecture

Our architecture is inspired by Google’s article describing how to use PubSub for long running tasks. We added a couple of unique twists: let the messages be queued in an existing PubSub subscription and service the jobs in bulk with Cloud Function workers (triggered periodically by Kubernetes CronJobs).

Our background jobs are neither long running nor resource intensive (in contrast to the example presented in Google’s article). Instead, they access rate limited API’s and therefore allowing them to queue to be processed in bulk is desirable. In turn, by triggering the workers periodically at a reasonable interval (i.e. every 5 minutes), the jobs are serviced relatively quickly making it feel responsive.

There are several benefits to using PubSub with Cloud Functions:

  • Reliable, distributed, and secure (thanks Google Cloud!)
  • Zero server maintenance (go serverless!)
  • Scalable (see below)
  • Monitoring (see below)

The following diagram illustrates the architecture. The numbered arrows illustrate the flow of jobs through the system:

Clock icon by Situ Herrera from www.flaticon.com
  • A service publishes a message (job) to the PubSub topic TopicEnqueue (arrow 1) whenever a new job needs to be serviced.
  • The jobs are queued by the PubSub subscription SubDequeue (arrow 2).
  • A cronjob, regularly publishes messages to the topic TopicCron. In our case the cronjob is configured in Kubernetes. However, it could be scheduled by Google App Engine task scheduling or another cron service (such as cron-job.org).
  • A worker Cloud Function is triggered by the PubSub topic TopicCron. The message contents (and attributes) are ignored. The PubSub message is simply the trigger for the function. The worker is effectively triggered at regular periodic intervals.
  • The worker Cloud Function pulls messages from the PubSub subscription SubDequeue (arrow 3).
  • The worker services the jobs without acknowledging the pulled messages. Upon successfully servicing a job (arrow 4), the worker acknowledges the messages. Jobs not ready to be serviced are left unacknowledged to be serviced during a future execution cycle.

Scaling

As more jobs are requested, it is possible that jobs are queued faster than they are serviced. There are multiple ways to increase the throughput of the service:

  • The simplest solution is to increase the frequency that the CronJob is triggered, resulting in the worker executing more often. The trade-off being that this type of scaling is not dynamicly based on the load. The function will execute more often even when there is low load.
  • The Cloud Function can also be deployed several times, resulting in parallel executions every time the CronJob is triggered. This will also result in static scaling.
  • A dynamic scaling solution could be implemented by monitoring the SubDequeue subscription (via the monitoring API) by an external process (possibly another Cloud Function) and scaling the number of deployed instances dynamically based on the number of messages (jobs) in the queue.

Monitoring

The PubSub framework and Cloud Functions can be monitored directly in Google’s Stackdriver tools. No new tooling or thought required. With 1 hour of work a dashboard showing the rate of messages being enqueued and dequeued and the number of messages in the queue can be created.

With 1 hour of work a dashboard showing the rate of messages being enqueued and dequeued and the number of messages in the queue can be created.

Stackdriver Dashboard monitoring incoming and waiting jobs

Furthermore, the Stackdriver Alerting Polices provide great ways to send emails and/or slack notifications when the system enters an unhealthy state. The Cloud Function has not executed for 10 minutes? The SubDequeue subscription has too many messages? The oldest message is approaching the hard deadline of 7 days? Stackdriver lets you easily send emails and slack notifications before these problems become catastrophic.

Caveats

There are few caveats when working with PubSub:

  1. Jobs not serviced are deleted after 7 days from being published. Period. While this is configurable (to be less), 7 days is the max lifetime of messages in PubSub.
  2. The message acknowledgement timeout is by default 10 seconds and configurable up to 10 minutes. Turns out you can extend this limit at runtime for messages with some extra effort on the client side via an API call (not exported in the node.js pubsub client).
  3. Message order is not guaranteed. According to Google, “messages are delivered as fast as possible, with preference given to delivering older messages first, but this is not guaranteed.” Google has a long article that explains that in most use cases, the order of message delivery is in fact not important and presents a solution to add order in the client.
  4. Messages are guaranteed to be delivered at least once. They state that under rare circumstances, message could be duplicated. Better to write idempotent functions on the client. Not really a problem because idempotent functions are already a principle in functional programming, we can add that principle at the architectural level also.

The node.js pubsub client…

The design requires the worker Cloud Function to “pull” queued messages from the SubDeqeueue subscription. We found, and later confirmed by Google’s developers, that the official node.js PubSub client from Google is not compatible with Cloud Functions. This seems mostly due to the fact that the client does a lot of it’s work in the background. Google’s own documentation explicitly states to not start background activities in Cloud Functions, but the official node.js PubSub client does just that!

Google’s own documentation explicitly states to not start background activities in Cloud Functions, but the official node.js PubSub client does just that!

Photo by Rob Schreckhise on Unsplash

The solution is to use Google’s service REST API to pull messages from the subscription. This works, but as discussed here and here, the API is not intended to fetch all queued messages. A “hacked” undocumented algorithm is required to work-around how PubSub delivers the messages. This is because the API does not guarantee to deliver all waiting messages in a single request.

Conclusion

Our experience revealed that PubSub is intended for either immediate (push) processing of messages (i.e. triggering a Cloud Function) or for long running process that pull messages asynchronously and process them immediately. There are some unique challenges involved to read (pull) the entire queue and determine which messages should be serviced in a short lived process (like Cloud Functions). However, we were able to work-around these challenges to create a new innovative cloud service built on a combination of Google Cloud Platform services.

The result of our innovation using PubSub in combination with Cloud Functions allowed us to quickly add an MVP (minimal viable product) feature with some great advantages: reliability, scalability, and monitoring.

Some improvements to the architecture are already foreseen. On the technical side, our challenges pulling PubSub messages in a Cloud Function seems theoretically stable, but an unproven use of PubSub. Functionally, the 7 day max lifetime of messages and no easy way to modify/delete messages in the queue may force us to move to a more traditional storage method (like a DB).

Leave your questions and comments below.

--

--

Cloud Architect and university instructor with a passion for developer relations. Currently consulting for Nearform.