“Scaling” isn’t just a matter of horizontally adding more servers

Written by Kamil Arli

The Forbes is published an article on Snapchat. 

Could Snapchat have saved money by building its own cloud infrastructure instead of paying Google? originally appeared on Quora – the place to gain and share knowledge, empowering people to learn from others and better understand the world.

Answer by Yishan Wong, former Reddit CEO, on Quora:

I think it is potentially the sign of a fatal flaw in their business. There are two main reasons:

The first is that it affects the cost structure of their business. At scale, it is always cheaper to build and run your own core infrastructure. Cloud services have extended the runway of how long it’s a good idea to rely on them (it’s a very good idea during the early period when your growth and product-market fit are uncertain) but once you have hit critical mass and know you will always need a certain amount of sustained computing resources, it is cheaper to establish and run your own servers. This is especially important when your business is reliant on advertising to large numbers of users, so the per-user cost of your business (thus per-server cost) needs to be as low as possible in order to keep your business viable.

SEE ALSO:   Which one is the Samsung's biggest challenge? Google or Apple?


The second reason is more subtle: when you are running a world-class-sized user-facing consumer business, you are often operating at the limits of what current technology can do. “Scaling” isn’t just a matter of horizontally adding more servers, it’s a matter of identifying bottlenecks that are specific to the behavior of your particular application and then debugging or growing your infrastructure in order to handle that bottleneck.

The nature of confronting this problem is non-obvious so I’ll try to explain: depending on exactly what your application does, once millions of people start using it daily, it will place highly unique stresses on your computer infrastructure. One application may be very heavy on CPU usage. Another may require a lot of network traffic. Those are the simple things. More advanced issues could be specific patterns of cache accesses that happen to be adversarial to the default cache replacement algorithms of the OS or underlying hardware. Or network traffic of a particular rate and packet size that happens to jam up routers in a particular way. Or disk access patterns that hop around randomly and then occasionally concentrate on one area, thus thwarting both default caching strategies and time-based archiving strategies. These behaviors derive from the specifics of the product usage patterns and are different from every business.

SEE ALSO:   Facebook teases Snapchat by revealing Instagram Stories user numbers

All of these things only arise once your application is operating at a scale where it is larger than any other similar application has gone, i.e. the problem is unique to the handful of businesses that are category leaders and that have millions or billions of interactions a day.

Read the rest of the article

About the author


Kamil Arli

Editor of Digital Media Consultant

Leave a Comment