May 20, 2007

Scaling Twitter

Preconceived notions about twitter, because a month ago, the “5 questions” explosion. Didn’t say that scaling on rails is hard. He said that they run rails and scaling is hard.

The story of twitter. Twitter went from being a small side project of ODEO’s company that was the brainchild of Jack Dorsey to the largest rails app in the world. He was fascinated with AIM status messages. Problem was that you had to be in front of the computer. Wanted to bring it to the rest of the world.

Started last march with very little traffic until SxSW last year, where there were visualizers and it blew up. Huge traffic spike. Then there was a lull. Then the media picked up the story and there was a bigger spike.

So… how do you scope for that?

More boxes.- still just one DB with master / slave replication.
- 32 cores of sun machines at joyent
- 120 mongrels across 19 coress
- Message processing across 16 cores
- Jabber across 2 cores.
- MySQL on one big 8 core box.
- 16 GB + of memcache.

Why the need it?

Average 200-300 connections per second
Spiking to 800 connections per second
They’ve done 11,000 connections
MySQL did 2,400 per second.
Alexe (take with a grain of salt) says they have a ton of traffic, and that doesn’t even count API traffic which is a lot of their traffic. API was 20x their web traffic last time they checked.

Memcache is their savior (cache_fu / acts_as_cached)

They use the extended cache_stats plugin to get the stats.

They wrote a custom API caching plugin that is implemented as before and after filters. They are planning on sharing it once they can extract it. The hard part is caching invalidation. They do both time based expiration and event driven expiration.

Denormalize. Lists of IDs and such. To cache lists of ids. this takes a ton of pressure of their DB by preventing them from doing joins. Select from where id LIKE (List)

Showing their code that splits off everything into code. Actually method name is fugly_dist_idx. Replaced with Starling. Is distributed queue written in Ruby. Sat down with Paul (the creator of GMail) and tried to figure out how to make the distributed queue robust in the face of system caches. They were overarchitechting it. How many messages a second? 9. How many in a 6 months? factor of 10. Just pipe it to disk.

They hope to open source their queueing system. It speaks the memcache system. Any language can push to the queue via memcache so you can use any language with a memcache client.

Community

One of their most helpful thing has been talking to the community. You can’t bunker down in your office and go from there. The community is helpful. Talk to them Treat scaling plan like a business plan and you need to get a board of advisors. So many brilliant people in the community there is no way you need to use them.

More thorough slides from another presentation.

Quesitons

Where does the logic live to do event driven invalidation in the cache?

(In Mongrel)

Rails and XMPP integration?

Jabber server, completely separate ciient that they wrote in ruby that processes them.

When do they participate partitioning and how will they handle arbitrary relationships between users?

Users isn’t the key. The reason they haven’t partitioned yet is that they want to partition based on time because most requests are very temporally local. They plan on doing this in the next 3 to 4 months. Oracle does this out of the box but Alex says he’ll eat his hat before he buys an oracle license. So when they figure it out they’ll push it out.

Trying to figure out how to effectively open source things. Trying to be more

SMS gateway? Capacity? do you have to pay per message?

Bunch of companies that do this for you. They use their APIs. Capacity isn’t a bottleneck. They used SimpleWire and starting to use new people. It’s “hella expensive” and the carriers hate you.

What is your business model? How do they plan to make money?

Everyone asks this question. Short answer is Next Question (Laughter). They have some ideas, and brilliant people working on something. It won’t be traditional.

How many eJabberd nodes do they have?

A lot. They beat the hell out of it.

Any trouble syncronizing between ejabberd and rails users?

No.

With all those mongrels, how do you do a deploy and keep everything running?

They want to know from the audience. They push a lot. They do a review and push. If you’re connected when they do a deploy it 500s the users. This is terrible. They don’t have a graceful way to handle this yet. Queue in the mongrels, they were blown away by Ezra’s talk yesterday.

So when you restart do you kill all your mongrels, or do a “rolling blackout”? Questioner’s company does a rolling blackout.

They drop all theirs because the queue is in the mongrels and if they do that then all the queues get filled up. They’ve tried it, but they’ve had mixed results.

Comment on “Twitter is stupid because they don’t do ETags” (Tim Bray called them out in the keynote)?

They do use ETags, but not valid because they have a stamp at the bottom that shows message request time. Web traffic is such a small portion that it doesn’t really matter. Yahoo High Performance group put out slides yesterday that it doesn’t matter.

Wouldn’t ETags help with API Caching?

Only if the client obeys them, and most don’t.

What’s next?

Callback API has a lot of potential. They want to do it, the 3rd party developers want it. Look for it.

We love XMPP and are working with the people in the Jabber community to push it forward. Working with them on some cool new ways to use Jabber.

pragmatist
Patrick Joyce

Scaling Twitter

Quesitons

More Articles on Software & Product Development