Fundamentals of Massive Linux Scaling
by Mark Rais, senior editor ReallyLinux.com



This introductory article encapsulates four of the top things I learned about major enterprise Linux scaling. There are many other essential ingredients, and I express some of them in my Scaling Linux Servers article. However, for the time being I think these four convey the first rung, or foundational aspects for a robust scaled server complex using Linux. They also address issues I see fairly consistently in some Linux enterprises.

My hope is that these fundamental principles will help some administrators overcome obstacles with their massive Linux scaling projects.


CACHING IS NOT THE SOLUTION
Notice that it can be part of the solution. I propose, however, that caching can not be the crux of the solution. I'm referring to multi-tiered infrastructure. If you have a single tier server complex and you place caching boxes into the foray of server requests, then perhaps it will work; simple socket requests for instance.

But when you start to see latency issues across the network routers, the database I/O, the in-bound pipes, emphasis on caching is just too single sided in my book.

Instead, I propose that where some architects literally put all the eggs in the caching basket, others are wise enough to scale where known latencies already exist, rather than try to offset these with caching schemes that often fail.

How do I know this? When my team tried to use caching as a foundational component of scaling for the "AOL/CBS Big Brother" website, it just failed at first. We saw over half million in-bound requests per second come through the pipes and the whole thing melted. Melted so bad that our miscalculation began choking out routers way down the pipe. In reality we had failed to understand two very important aspects regarding caching - besides learning that a prime time TV show can really juice up webhits.



"We saw over half million in-bound requests per second come through the pipes"

First, caching for content driven websites is often problematic since certain content pieces are unique and dynamic to the last minute. So where we should normally have had the final content delivered and cached, some of the editorial staff were still working on bits and pieces of content. Some of the content was essential and caching was of absolutely no use. In more news oriented sites this is especially an issue. Obviously this is more a human factor than a hardware factor, but nevertheless important.

Second, caching works when properly configured. However, configuring caching for major scaling endeavors also introduces complexities you ordinarily do not notice. Your caching schema for an existing infrastructure may simply not address new peak load characteristics.

In our case, the caching failed because major schema changes were needed and ultimately proved to be untestable without true peak load. Even the best testing tools can not present true peak load evaluations, and thereby fail to uncover certain key nuances in the schema. The result in our case was that the caching systems began passing the ball to one another and at such a high rate that they actually took each other down. The pressures to introduce a new scheme along with many new components in a ridiculously short period of time contributed to the failure.

I have found that effective mass scale caching requires time and thorough testing. Two things that are sometimes hard to come by during a "scaling crisis."


KEEP DNS ROUND ROBIN FOR THE BIRDS
Well, we can all probably agree on one thing. Using the DNS server as a primary mechanism for load balancing sucks. Perhaps my conclusion isn't as eloquent as some may prefer, but it's based on some hard facts.

First, I've met so many people who use simple DNS round robin configurations to deal with peak load balancing and simple scaling, it scares me. If you're getting 1,000 hits per second, maybe this will work - for the time being. But when you move to serious peak loads, DNS Round Robin falls flat on its face because it can not address the failure points.

Second, based on the first point, DNS can't even identify when the server fails on load. Instead, it keeps passing the requests. And, unless you've written some fail-over code to pull a downed server off the DNS, it just keeps failing.

I guess the one positive for using DNS, if the server chokes on a request the DNS just returns a fail rather than pass the request on or stack it in an exponentially increasing queue.

The main point is that using DNS round robin isn't a great solution for you whether you hit peak or not. It simply can not address re-balancing or cutoffs without manual intervention.

If you have the money then solutions with expensive switches like the ones Foundry offers, as an example, makes a whole lot better sense. I just prefer not to see people dealing with peak load failures on Friday nights because they have to manually adjust the DNS.

Someone asked if this includes general round robin approaches. It all depends on the situation. We once used a "stupid" round robin that was based on simple code that alternated which servers would get requests by time. Every few seconds requests were processed to alternating servers. It actually worked because we were passing very little data per request. We used this to do a simple and rather inexpensive load balance for shopping requests during peak holiday shopping times. The infrastructure stayed relatively the same with only the addition of the front caching server managing requests.

But let me say that this scheme fails when the latency for retrieving something such as large images goes to an already loaded server and you see requests slow down. Then every other system on the round robin takes the load right? WRONG. The next server in line receives the off loaded requests and it too goes down. Yet again you can see a nice domino affect.

In general round robins work well in smaller scale environments. But not when dealing with major peak loads - emphasis on peak. I just as soon keep my distance unless it's a mid-sized environment.


NFS AND BIG DISK DRAMA
For one reason or another, NFS mounts are everywhere in the Linux world. There's absolutely nothing wrong with NFS in most situations. But add up those mount points, spread them across a server farm, add cross site connectivity or other complexity and you get into a bit of a muddle.

The biggest issues I've personally seen on mass scale NFS mount infrastructure is that we either die because the mount points start failing or because the disk I/O chokes out when implemented on striped disks.

If you use NFS mount points, there's nothing bad about it.

If you use RAID 5 striping, there's nothing bad about it.

If you use large disks, which most people do to save costs on their storage solutions, there's nothing inherently bad.

But combine large disks, complex striping with lots of NFS mount points and you reach that wonderful realm of death by positional latency. It is one of the biggest pains to uncover and then resolve during a scaling crisis.

Instead, I've consistently shared with people that if you're looking to do some serious scaling with NFS mounts, please stay away from those huge cheap drives.

Getting several smaller GB drives costs more but reduces the chances that disk latency combined with striping algorithms will become a culprit for NFS mount point failures. This may be old news to some, but there are still many shops mixing these three ingredients in their daily Linux scaling work.



"stay away from those huge cheap drives"


HYPER DYNAMIC
I love MySQL. I love seeing PHP and other dynamic web design. It provides for a great tool set and makes most websites valuable. But there is no question that some great websites have choked because the architecture placed everything into dynamic page loads.

I still believe, if you're serving serious web hits, that reducing the dynamic requests and increasing the static HTML content is a sure way to beat the peak load scaling problem. I know this from experience because the solution that the outstanding Operations staff used to bring back alive the Big Brother website was to flatten the highly dynamic site into basic html, rdist the pages to about 220 web servers and run those servers at 105% peak load.

There's nothing wrong with designing webpages and implementing dynamic websites. But on spike loads the key hit pages need to have far less dynamic elements that must be called out of caches, out of databases, out of other server file stores, and far more html. Yes, good old HTML. The good major websites utilize this nice balanced approach. Unfortunately, with today's emphasis not only on dynamic content but also dynamic ads, the slowdown is perceivable even to non-technical savvy web surfers.

One key ingredient is to isolate non-essential elements of a web page and use static HTML. Another is to ensure that the page design itself contributes to load balancing during peaks. For instance, offering navigation on the site that does not channel all readers into single highly dynamic pages. Instead, the navigation can create parallel paths, reducing overall requests on single pages, while still granting users content they require. Finally, if you must keep certain elements as dynamic, then ensure that you have normalized your database. Simplifying your DB, cleaning up tables and keys that are redundant or bottlenecks, and avoiding BLOBS (binary large objects) like the plague, will further contribute to peak performance.

Finding the right balance between dynamic and static is not easy and often leads to feuds between the operations department trying to scale for peak loads and the design department looking to create even more compelling content. But I'm of the opinion finding such a balance is essential for good scaling.


Ultimately, the decisions for scaling will fall on the shoulders of the operations personnel who are likely to be called in at the last minute, during crisis, and under pressure. By keeping these four foundational aspects in mind during the design phase, you can make even the most severe load spikes sustainable.


Mark Rais serves as senior editor for reallylinux.com, and has written a number of books and articles related to enterprise Linux and UNIX implementation. He also served as a senior technology manager for America Online, Inc. and Netscape, as well as task force leader for several International non-profit organizations. Today, Rais promotes Linux use worldwide as an integration consultant.