Speeding ROBLOX.com with Message Queuing

June 19, 2012

Archive

In an April blog post, Supporting Millions of Players at ROBLOX, we mapped out ROBLOX’s back-end web technology. Today, Gigi Sayfan discusses message queuing, a single component nestled in our greater web infrastructure, and how it helps us accelerate updates to our huge catalog and the user-account search function. He also offers practical advice on setting up a message queuing application for a specific set of goals.

In a common, distributed web application architecture, multiple web servers talk directly to a database and/or to back-end services that relay messages to the database. ROBLOX has millions of users, who constantly bombard our many web servers with requests that eventually funnel into the database. If the database is busy, it will be slow to respond to requests – both from the web servers and back-end services – and the user experience will suffer. Worst case scenario, requests time out.

To avoid slow responses and time-outs, ROBLOX has implemented a message queue, which currently manages roughly 15 million requests a day.

RabbitMQ bears some of the brunt on behalf of the web servers.

A message queue is a generic component with the sole purpose of receiving messages from producers, storing the messages, and delivering messages to consumers that process them. Consider this situation: we make an update to the ROBLOX catalog. The web server (producer) sends an update to the message queue, which stores the message. At some point, a special service (consumer) will process the message from the queue and update the catalog’s search index on Solr (our search platform). This means the web server doesn’t have to wait for the update to complete; it drops off the message and moves on to the next request.

Another benefit of a message queue is that there is very loose coupling between producers and consumers, so one side can completely change without affecting the other.

Messaging with RabbitMQ

We at ROBLOX picked RabbitMQ as our messaging solution. RabbitMQ provides a lightning-fast server that can be clustered, client libraries in many languages, an administration API and a slick web dashboard. It’s also free (using open protocol AMQP), fast, cross-platform, stable (based on Erlang’s Open Telecom Platform), field-tested and supported by a large, active community.

There are many other messaging products out there, but none provide all these benefits with the same degree of robustness and support.

Gentle Introduction to RabbitMQ

Earlier, we established that messaging involves a producer sending a message and a consumer receiving it. This is a very simplified picture of messaging.

Let’s start with exchanges and queues. When a message is sent to RabbitMQ, it is sent to an exchange with a target route. When the message reaches the RabbitMQ server, the server checks what matching queues are bound to the exchange and delivers the message to all of them. There could be zero, one or many queues bound to the exchange.

There are many possible messaging workflows and RabbitMQ accommodates most all of them, including generic work queues, publish/subscribe and topic routing. You can read more about these workflows here.

Queues can also be durable or transient. Undelivered messages will survive a server going down in durable queues, but not in transient queues. Messages can get lost between producers and consumers and, if a consumer fails to acknowledge a message, it – or a different consumer – may receive the same message again. There are various mechanisms to report back to producers and consumers about the state of messages and many ways to control what happens in different situations. The art of using a message queue correctly is to understand your requirements in various areas: performance, integrity, fidelity, availability, and whether message loss is acceptable. Then, find the proper combination of features and a workflow that satisfies your requirements.

How ROBLOX Uses RabbitMQ

A single RabbitMQ node can often serve as a messaging broker for a very large distributed system. But, for ROBLOX, where we have roughly 15 million messages to queue on an average day, the real power of RabbitMQ comes from using it in a cluster. This allows the message queue to scale with larger volumes of traffic, and eliminates the RabbitMQ node as a single point of failure – always important in highly available, distributed systems.

Currently, we see the most substantial benefit from message queuing with our catalog of virtual goods. ROBLOX uses a single cluster of three dedicated Linux boxes running RabbitMQ 2.8.2. Our web servers function as producers and send catalog update messages to a dedicated exchange. The exchange is bound to a single dedicated queue. On the consumer side we have a battery of back-end services that consume queued messages. In a queue workflow, each message is consumed by one consumer (service instance), which updates the catalog’s search index.

We don’t verify that each message arrives (publisher confirms) or that it was processed successfully — this requires a custom response queue. We use non-durable queues because the information the messages are based on persists in the database and we plan to have a background process update any lost messages. This design affords us a relatively simple and nimble RabbitMQ workflow. Duplicate messages are not a problem because they will cause an update that is identical to the current value. (Updates in this case are “idempotent,” if you’d like to increase your vocabulary.)

Our back-end code uses C#. RabbitMQ provides client libraries for many languages, including C#. However, we don’t want our producers and consumers to use RabbitMQ client directly. There are several reasons for that:

The RabbitMQ client exposes the entire spectrum of RabbitMQ features. It can be overwhelming and requires a lot of knowledge to use properly.
The RabbitMQ client doesn’t handle connections to a cluster, recovery and other important aspects out of the box.
It allows us to codify certain cross-cutting concerns, such as error handling and logging, without requiring every producer and consumer to do it.

RabbitMQ allows ROBLOX to use many diverse tools and skills across a range of environments and operating systems. It also enables cool features in search domain. For example, we use C# to write the Roblox.RabbitMQ library, Python + IronPython to run the RabbitMQ admin interface, and PowerShell + Fabric to control the Linux RabbitMQ cluster remotely from a Windows laptop.

Future Plans

Messaging is a cornerstone of modern, distributed web applications and we plan to use our message queue for many other services in the near future. Today, some of our services use the database as a makeshift message queue. It works, to a degree, but it is sub-optimal. Databases are not designed to serve as message queues; the messaging layer must be coded on top of the database, which can become a bottle-neck because it is unable to shuttle messages as fast as a real message queue.

Message queuing and RabbitMQ are great technologies. If you develop a large-scale distributed system, you more than likely have some use-cases that would benefit from a message broker. If you plan to integrate RabbitMQ, take the time to understand its many features and capabilities, and how they match your application. For more information on deploying a message queuing system, check out this article.

Blog Archive

Speeding ROBLOX.com with Message Queuing

Messaging with RabbitMQ

Gentle Introduction to RabbitMQ

How ROBLOX Uses RabbitMQ

Future Plans