Version 1.4; Updated 2014-11-26
1.4 - updated to include fiber paths and more info regarding change controls. Clarify email policies
1.3 - added routing info, control panel info------------------------------------------------------------------
I. Servers are hosted in a cluster that contains these components:
------------------------------------------------------------------A. Cluster Farm Components
B. Richweb Operates 4 production clusters at 2 different datacenters
- nginx load balancer
- nginx CDN (content delivery network node for static files)
- apache2 backend web servers
- php5 (3 different build options depending on php version needed)
- MySQL (percona mysql+xtradb (innoDB enhancements)
- memcached for session sharing and data object caching
- pure-ftpd (FTP+SSL suported) for file management
- OpenBSD firewalls with carp in failover pair n+1 redundnant
- Cisco 4948 layer 2 switched infrastructure
C. Richweb operates 4 Clusters currently:
- PixelFactory DataCenter (Ashland Va)
- Level3 Carolina Ave (near the RIR Racetrack Richmond, VA)
- A. High end Sports Sites (PixelFactory)
- B. Mixed Use Cluster (Level3)
- C. High end Sports Sites (Level3)
- D. Business Sites (PixelFactory)
These clusters are dedicated to hosting customer web applications. End user email is separate from these web clusters for multiple reasons.
Richweb operates several additional clusters for other operations:
- MailScanner Incoming Email filtering cluster (6 nodes)
- SMTP GW outbound Email filtering cluster (6 nodes)
- Authoritative DNS cluster (3 nodes)
- Recursive DNS Cache (3 nodes)
- NTP (Network time) servers (3 nodes)
- OpenVPN remote access (2 nodes)
- Customer Email server (Smartermail stand alone server)
We do not install control panel ware for customers. This creates a lot of problems when customers try to run DNS, EMAIL, mail scanning, etc all on their virtual server with a web based control panel. Richweb approach is to split services into silos with hardened appliances or servers/virtual servers dedicated for each service.
For example the Authoritative DNS servers run ONLY AUTH DNS - nothing else is enabled or installed on those systems. This prevents an issue from one service from affecting another service or platform.
If a customer has an existing VPS solution we extract the website and load it onto our web platform and the email accounts are loaded onto Smartermail and the mail filtering is done on the MailScanner platform. This means that if we have a security update that affects a certain service we know exactly where the update needs to go and what service(s) will be affected.
Note: Customers are not granted DNS management access; all DNS change requests are handled by the net engr team which is provided DNS training as part of their job training. We do not want web developers or customers causing outages by breaking DNS.
Smartermail and MailScanner both have web management portals; customers are granted access to fully manage their email filtering and delivery / storage policies.D. Technical Details:
1. In production use non-stop since fall 2011; we have averaged bringing on a new cluster about once every 10 to 12 months. Each cluster has between 60 to 120+ DEDICATED CPU cores to run applications. These are not virtual cpu cores but actual physical cores. For example, on a dual quad core server, we count these as 8 physical cores towards our total cluster capacity. Most other virtual environments sell cpu cores as time slices of a core. For example a VPU (virtual processor) is 1/4, 1/8, 1/16, or sometimes 1/32 of a physical cpu core.
2. We centralize mysql db storage: We build 3 dedicated MySQL storage nodes for each cluster:
- Master Node
- Replicated Node
- Non-Replicated Node
Many customer applications written in php do NOT make use of the ACID complaint properties of MySQL and in fact break when they are replicated because they use SQL queries that are NOT guaranteed to return the same data on the master and on the slave (think selects with limits based on criteria that vary from second to second). These applications need to be installed on the Non-Replicated node. All other apps are installed on the Master Node which replicates to the Replicated Node. In the event of a hardware failure the secondary can be promoted to the master. This is a manual change-control event.
3. We use dual quad core IBM 3550 servers to power the apache nodes and RAID-10 Dell storage servers to power the backend NFS arrays. Typically an IBM 3650 will power the database nodes though we will also use Dell Storage appliances here as well. Our policy is to use Raid 10 on the SAN and the DB Storage nodes; the remaining Raid5 nodes are being upgraded to Raid 10.
4. We use gigabit cisco 4948s (16 MByte deep-buffered top of rack switches) that have reundant power supplies to power the cluster switching. Each server has a pair of bonded NICs for HA (failover) in a vlan trunk group on the switch.
5. We virtualize all services on the cluster. Linux-vserver is our virtualization platform. This commits a minium amount of cpu to hypervisor tasks; in fact there is no hypervisor. Linux-vservers are simply chroot chails with some kernel structure that implement BSD style rlimits (resource limits) on the guests. This means that the vservers are highly stable (no vmware security patches, bugs, or resource contention). BSD-style rlimits give us strict controls on RAM, cpu and process/file descriptor usage over all of the physical servers to protect themselves from runaway/rogue guests.
6. Scalability of the clusters typically is dependent on (1) the number of MySQL nodes and (2) the total amount of centralized storage. We typically add new storage to the clusters twice a year. Adding cpus is easy and can be done on short notice.------------------------------------------------------------------
II. Internet connectivity overview
A. We have multiple gigabit and 10G internet uplinks and 2 different datacenters connected via commercial and private (dark) fiber.
B. We have a total of 6 edge bgp routers that provide routing services and basic DOS attack protection, 2 core routers and 2 route reflectors for route table scaling. The bgp edges can run wirespeed ACLs if/when needed. We use svn (subversion) to keep the configs (firewall tables, access control lists) in sync on each edge. Change control for the internet egdes and core routers is limited to the manager and lead engr of the network team. Look (read-only) access is enabled via the looking glass software and SNMP monitoring for authorized admins such as the colo operations noc staff, and director of web development, and Richweb operations manager.C. We have dual firewalls that protect each cluster.
The dual firewalls run the very secure OpenBSD firewall OS and are setup in CARP/failover master/backup status. All firewall links are gigabit.
Change control for the firewalls is also limited to the manager and lead engineer of the network team.
The firewalls filter all traffic inbound to your server farm except for http, ftp and ssh (which is monitored for excessive connects, and offending ips are blacklisted). The firewalls have a x-cable between them that syncs the packet state so if one box dies or is demoted for maintenance the other box can take over w/o dropping any sessions.
The firewalls have dual power supplies each connected to a unique UPS power source for minimizing single points of failure. (After disk failures, Power Supply failure is the most common source of issues). Cluster A is scheduled to get new FW this fall, Cluster B already has dual-PSU fw nodes.
The firewalls (and load balancers) redirect ssh traffic to the pool of backend nodes. If a customer has 4 backend nodes, she will be able to ssh into 4 different ports to access any node as desired. Typically the customers will access just the first node via ssh. If changes need to be made to the configuration, Richweb will ensure that the changes are replicated to each node either via our automated build tools or a manual package install as needed.
We use the relayd OpenBSD load balancer on the firewalls for ssh port forwarding and rinetd on the load balancer nodes for ssh port forwarding. Both are transparent to the end users/customers. If a customer has 6 backend nodes with us, the customer will typically ssh/scp into the first node to perform any maintenance tasks but if that node is down any other node can be chosen if needed.D. Private network / SSL Acceleration
Web Browser -> Firewall -> Load Balancer-> Private vLAN -> Backend Node
Since the backend servers can talk via jumbo frames on the private network, we have a routed (firewall) interface and a load balancer on a public IP that handles all requested incoming traffic. They are then reverse-http proxied to the backend nodes. The Load balancer also performs SSL acceleration duties so all communications between the LoadBal and backend nodes are HTTP.
This has the added benefit that SSL certs and keys are kept on the load balancer in a restricted access location for maximum security. An intruder would not be able to compromise your ssl cert even with root access on the backend virtual server (container) as the ssl does not live or touch there. This does mean that your application will need to be told it is behind an SSL accelerator.
The SAN controller communicates with the backend nodes on a dedicated, private, non-routed network. MySQL traffic is on the same private vlan. The backend nodes are NAT-ed out on a single public ip but the firewall restricts where they can go and how many TCP connections can be open at one time. This helps prevent an infected vserver from blasting out spam or malware attacks to other victims.
Walled garden: Another benefit of private lans. Some of our clustered customers are configured in a server Walled garden. This means the servers have no internet access; they cannot leave their private subnet. We have ftp and OS update proxies (as well as local dns listeners) so that the boxes can run OS patches. But all other access is blocked. This helps if your application is attacked a lot; it makes it harder for intruders to damage the site, or steal data or installed backdoors, etc.
Change control on the load balancers is limited to the manager and lead engr of the network team and development team lead(s) as a backup.
Typically the majority of the system changes made for customer move/add/change requests are on the load balancers. Firewalls changes are normally not needed.E. NFS / Disk setup implementation
The SAN node exports a dedicated filesystem to each client (each Richweb customer has a dedicated mount point). Each physical backend node mounts the client filesystems needed to run its client workload.
(All filesystems are mounted in /data or /home).
The physical servers then export (share) the SAN filesystems with their guests. Each guest will mount the client filesystem as /data/www typically, and the websites would be under /data/www/website.net for example. This limits each guest to only see the authorized data.
Guests are not allowed to put network adapters into a mode where they can snoop on raw packets rcved by the host controller, run dhcp, change ip addresses, or otherwise attempt to spoof another ip address.F. Controller details
The job of the central load balancing node is to:
1. run on a public facing ip for management and ftp services and provide a facility for terminating ALL of the public ip addresses for http/https traffic. Backend websites do NOT have public ip addresses in any use case.
2. run nginx cdn accelerator / load balancer software.
3. run pure-ftpd ftp server that allows file access to the san. pureftpd controls access to chroot ftp dirs via a mysql database driven config. This is handy for setting up sub folders on a site that only a contributing editor, photo blogger, etc should be able to upload to.
If the user application requires apache read/write access to the full site (bad) we can map the user account so that it matches the user account the website runs as using mod_itk.
If the customer can use a more secure configuration (good) we map each customer to a unique unix account (uid).
FTP change controls are managed via Richweb's EMS portal. Managers (users with EMS manager privileges are allowed to make FTP changes.
4. run a memcached server to handle caching of data objects and php sessions. memcache is faster than a file system based cache for php session management. Each backend server needs a coherent view of the web sessions; 40K session files living on an nfs file store will perform poorly which is why use memcache on the load balancer node for php sessions.
Memcached configuration is not changed.
5. smtp concentrator services: The load balancer runs a postfix relay instance that aggregates incoming website emails sent fron the local boxes into a single feed that is pushed to the outbound SMTP gateways.
Change control on the smtp engines is limited to the manager and lead engr of the network team.
6. MySQL replication: We use MySQL replication on each cluster to keep a backup live copy of the MySQL instance. Failover is CURRENTLY not automatic. For db node failover to happen the slave instance needs to be re-addressed with the master node ip, and restarted. MySQL clustering has several problems that we have experienced - active/active requires a move to NDB storage engine (no myISAM which is what most sports sites are comfortable with) and the reliability is not as good as with the percona mysql instance we are running now. We are also investigating MariaDB active/active too as a possible alternative. This seems like it would work, though 1 major caveat is that transactions of any size (large numbers of queries inside a TX) dont scale.
NOTE: Most customers are running LAMP applications which rarely if ever use true DB level transactions of any kind. We do have customers running custom web apps and we advise and support DB transaction support wherever possible.
Database change control is managed by customers that have their private username/password to access the database via phpMyadmin SSL protected management portal. Developers will make DB changes on behalf of customers in most cases.7. CDN mechanism
The load balancer will take an incoming http request and route it to one of the backend nodes in the cluster (as long as its reporting its alive) and the backend node will process the request and send the data back to the load balancer.
Many clients connect from slow internet circuits which can take many seconds to read a tcp data stream for a large file or object (web page). This slow reading client ties up a tcp socket and server thread/process the WHOLE time its reading. Apache can only fork between 20 and 80 php processes before it starts to destabilize depending on the php code its running.
The load balancers easily handle 50000+ connections in its current config (and can be scaled up if needed). This greatly increases the number of web browsers a site can handle at one time.
The load balancers are configured to route all static assets (by a regular expression match against the url string) to the cdn The cdn serves only static images, css, html zips, xml, flash files, and it complete offloads these files from apache. This feature alone has given some of our clients a 10 to 50 fold increase in their website scalability.
Change control on the CDN is the same as for the load balancer; its configured along with the load balancer as part of nginx configuration.
FREQUENTLY ASKED QUESTIONS:FAQ1. How does this compare this architecture to "CLOUD" - i.e. datacenters running elastic services with vmware on shared public infrastructure.
During periods of low utilization a single server could do all of this work. But during periods of heavy load a single quad core box just cant do all of the work. Richweb's cluster configuration can bring many more cores to bear to handle the bursty spikes by simply adding additional back end nodes running the same configuration. In other words horizontal scaling is built into the solution from the ground up. This is not the case with a standalone VM in a typical "cloud" setup. Horizontal scaling is not integrated; the customer must allocate VMs and cluster them on his/her own.
Even in a cloud config that could had 20 virtual vmware cpus on a single VM to run a site, the problem would be apache itself and memory consumption. apache2 cant scale to managing the amount of ram you would need to run all of these processes. So scaling onto multiple VMs (horizontal scaling) is a must sooner or later.
Remember, vmware is not a magic bullet for scaling websites like this (by itself). Instead we need the ability to have multiple parallel nodes that are running the exact same configuration all serving exact same copies of the site(s).
Thats exactly how we run our clusters - as a set of parallel backend apache nodes talking to a MySQL instance with load balancer and SAN to centralize network and storage requests. So the Richweb clusters are the same as a well designed "cloud" infrastructure with the exception that since we operate all of our own hardware we know exactly where the instances live, what they are doing, and why. This would be much harder to do (really impossible) if we were leasing hundreds of random virtual instances from a cloud provider.FAQ2: How to the multiple backend nodes stay in sync?
Manually editing the configs of the nodes to keep them in perfect sync is not feasible (too error-prone). So we have a set of tools that generates the config files in /opt/tools/
These tools build /etc/hosts (so each node thinks he IS the website being hosted on a local private ip) and they build the cache, apache, php and cron configs.
The templates used to build these sites are in /opt/config
When a node starts up it reads /etc/profile, learns its NODE_ID from the FW_ROLE variable and procedes to auto-build its configs and restart web services (apache).
This means that to spin up a brand new node to expand your capacity is very easy/quick. It takes about 10 minutes to allocate more capacity. Steps are simple:
- allocate node id, ip address
- clone/sync a copy of the node os from an existing node
- start the new node vserver
- edit the sites.cfg config file and /etc/profile to match the new node id
- edit load balancers to add new node to the rotation
- document new node
- setup mail relay
The last step involves postfix MTA on the load balancer controller for outbound mail delivery.
We tell postfix to do custom smtp header inspection:
header_checks = regexp:/etc/postfix/header_checks
And we use a generics table to re-write the FROM address for any emails that go out as root@vsXXX.richweb.com
We need to make sure that the spf record of yrdomain.com lists the smtp relays that the load balancer node transmits the email via. That is done via a simple spf include in dns.FAQ3: But I want to use AMAZON SES for outbound email delivery. Will that work?
Yes, we deliver many thousands of emails per day via SES. You just need to sign up for an SMTP relay account and we configure our relays to use the SMTP AUTH settings for the SES relay account you purchased.KNOWN Caveats with Richweb cluster setup:
1. AUTO-FAILOVER for the SAN and MYSQL nodes is hard to do correctly. We are working on implementing a cluster for the SAN node using network based disk (block level) replication. Currently we rsync the SAN filesystems (file/directory level replication) to a warm standby node, but this is not as good as having working HOT/AUTO failover for the SAN and MYSQL nodes.
2. Inside the vservers, /data/www is the shared portion of the filesystem that is shared between ALL of your yrdomain.com nodes. So the instant vserver vsX11 writes a file to /data/www/__anything__ that same change will immediately be available to the other nodes (vsX12, vsX13, etc).
This means that 3rd party plugins and modules need to write to /data/www/tmp NOT /tmp/
/tmp/ is not shared. Its generally not a good idea to share /tmp/ between servers which is why a symlink to "share" /tmp has not been done.
Plugins that write to memcache and mysql would be more ideal perhaps, though file i/o is easy and fast in many cases.
Note that ANY plugin or module that uses memcache needs to use a unique hash key - site prefix __yrdomain__ to make its key unique.
3. Maintenance to the web files can be done from any node as the changes will be made on the san. Each node OS has to be manually patched and updated though. This also means that when the next OS is released, we can take a node offline, upgrade it, bring it back online, test the site on the new node, and revert easily if problems are seen.
4. Richweb has a Wordpress plugin to implement Site Lockdown which is a feature that implements Read-Only locks for a website's php files and folders. This prevents a vulnerable website from hacks until the website can be patched. Very helpful if a 0-day vulnerability happens and your site cannot be patched. Instead of getting hacked, we can lock all of your files to be Read Only, so the exploit cannot write to any files. Your database may be able to be attacked though. This technique prevents 95% of all website damage though. You should leave your site LOCKED at all times, the only time you ever UNLOCK it is when you need to update a theme or plugin.
5. The load balancers will detect a dead node and remove it from the rotation. In theory a disk condition could occur that renders a node slow, but still functioning. We have nagios plugins watching the server across many performance data / metrics, so we should be able to quickly spot a troubled node and remove it via power down, cable removal, or terminating the OS via the RSA supervisor card. To date this has not been an issue, but as with any failover situation, its possible.
6. The load balancer is a reverse http proxy, which means that by default the php scripts will think the REMOTE_ADDR is the load balancer inside private ip (since that is what connects to the backend apache node). This break vb4 security since it thinks all the logins are from the same ip. We use a module called mod_rpaf to rewrite the REMOTE_ADDR to the actual remote client. This module interacts differently with nginx and pound so we need to remember to test what php thinks the REMOTE_ADDR is after doing a site upgrade.
7. Installing new php modules via apt-get needs to be done in concert on each backend. When making changes to the site cfg file, svn commit the file and update the other nodes and rebuild the configs. Otherwise the nodes can get out of sync with each other.
8. When debugging a problem with the live site remember that the actual backend node you get routed to may be different than another user. pound has tools to fix a client ip on one node, or we can disable all but one node when testing to rule out problems with making a tweak to a live server and you want to see immediately if it fixed an issue or not and not have to make the same change on other backends.
9. What happens if the mysql instance gets too busy? Our db servers are currently 16 and 8 core machines. Most of our customers that have database issues with performance fall into one of these categories:
- serious nested query (queries inside loops) issues
- poor schema (inefficient query, no indexing) issues
- wordpress meta table usage (have to select the whole table into php and loop over records to find what you want instead of asking MySQL directly and only getting the records needed. This becomes a huge problem once the meta table has a more than a 1000 rows.
We try to point out what the customer is doing wrong. For example selecting 5 million records from the database as part of a query to create a web page may take 15 seconds. That means that web page will take AT LEAST 15 seconds to render. AND, if that query locks tables, or causes other reads of tables needed by other requests to have to wait then this single query can bring the website down in busy times with just 1 user clicking on that page in question. Large forum sites in particular have this problem with search tools.
10. What happens if the load balancer crashes - the sites would go down until the lb is restored. The lb typically uses less cpu (but more disk and certainly more network i/o) than the db and apache nodes. Adding another load balancer is an easy enough fix of course.
Since all of the public ips go onto the load balancer, its a single place to add ips for each site. Its a policy we have to use (1) ip for each website. Sharing multiple websites on a single ip is possible, but the problems with moving that website later are much worse if the site does not have a dedicated ip address. Plus for customers that want to use SSL certs, the sites should have a unique ip unless they use a multi-domain (like a family of domain-names) cert.
11. Is the nfs san server a bottleneck ? It can be if not configured correctly. We use 2.5in 15K SAS drives that are very fast. Clients that have websites with huge photo blogs or image albums we like to move to the nas. The nas has larger, slower disks. That way the san is only used for running the dynamic pieces of the site.
Some of our sports sites that have large legacy media libraries will be setup to run via nginx on one of the NAS boxes as archive.yrdomain.com for example. This keeps backups of the production site much easier to manage and production traffic faster as the seldomly used (but still useful) archive data can be served off its own website.
We also have tuned the san environment to minimize caching so that reads and write from different backends are consistent at all times. If you have code that writes very small files very quickly to the shared storage and then tries to delete or rename them and check for their existance then that code may not work optimally (it should be using a database table or memcache instance for its locking/semaphores anyway, not the filesystem).