Implementation Considerations for Web Farms

Web Farm environments differ from normal web site installations in some important ways that require consideration.  For our purposes we will define a web farm as an installation where more than one machine may be handling web requests at the same time. So a failover cluster where only one node is active at a time is not a web farm in our definition and that kind of environment is already supported with mojoPortal, but as soon as you have more than one machine handling web requests at the same time then web farm considerations must be addressed. This article will highlight issues both general to ASP.NET and specific to how mojoPortal works and is implemented and how those things relate to web farm architecture and environments. While quite a few community members have been running mojoPortal in various web farm configurations, it has not historically been officially supported and to date we have never claimed that mojoPortal supports web farms. So the purpose of this article is for our own notes and ideas about things that need to be tackled so that we can officially support web farms. Once we identify and implement solutions for these issues then we will be able to officially support web farms. 

File System Access

In a web farm environment you would have a copy of the mojoPortal application files on each node of the farm, but when it comes to other files created on disk while the application is running you need to consider how and if those files should be synchronized on the other nodes. 

For small web farms one possible solution for file system access is using a shared network drive for the writable portion of the mojoPortal file system with each node configured to have a virtual directory mapping to the network drive. Ideally a raid environment would be used so the disk drive is not a single point of failure. This solution requires using some special shared configuration. Other solutions include File Replication Service and Distributed File Systems, storing and serving files from the database or from Windows Azure Blob storage. These last two choices would require architectural changes in mojoPortal to support plugging in different file system providers that abstract the storage into a common API.

In mojoPortal we need to consider how to handle files uploaded by the users such as images used in content, images in the Image Gallery features, and files in the Shared Files feature, product files in the WebStore feature etc. So if a user uploads an image and uses it in an article on one node, we have to consider what happens when another user requests that article on a different node and the image does not exist on disk on that node.

Historically we have always logged errors and other information items to a file on disk and in a web farm one of two things can happen. If the nodes share the same file on a shared network drive it might be problematic with multiple nodes contending to update that file. The log4net documentation says the FileAppender does not support multiple processes. If the nodes each have their own copy then it means we never see a holistic view of the log unless we are harvesting those files from each node and aggregating them somehow. What you would see in the log viewer of mojoPortal is just the log of the current node. 

This problem is now solved as of mojoPortal as we introduced a custom log appender that can be configured to log to the database so all nodes can log to the same place.

Cache Dependency Files

Cache Dependency files are used as a way to invalidate a cached item so it will be removed from the cache. It is basically a file system watcher so if you modify a file on disk you can clear an item from the cache. So for example the SiteMap is a tree shaped structure of nodes representing all the pages in the site in the hierarchical position. This is what we bind the Menu and Treeview to to create the main menu in the site. So when a new page is created or when a page is deleted or when the view permissions of the page have changed we always need to clear the site map from the cache. Unfortunately this caching is done by the ASP.NET runtime, so we don't have control over it and we don't even know the cache key so we can't easily remove it from the cache ourselves. What we do have is a way to cache another object at the time when the site map is retrieved by the runtime the first time (ie just before it gets cached), and we can setup a callback method and a cache dependency for this object so that when the object is removed from the cache it fires a callback event the clears the site map. The problem with this is that it is only going to hit that callback method on the current web farm node so it would not clear the cache on the other machines. So you need to look into using a distributed cache or some other strategy to clear the cache on the other machines. In a small web farm where a common disk drive is used by the nodes they can all be cleared by the same dependency file but a shared/distributed cache is needed for a more scalable web farm that could have lots of nodes.

Caching and Clearing the Cache Across Machines

The built in caching techniques of ASP.NET are in process so clearing the cache only affects the current process or machine. When you have a web farm environment you usually need to use an out of process 3rd party caching solution such as AppFabric Caching or Memcached so that the cache can be shared across machines and processes and even more importantly so the cache item can be removed on all machines when the object has been modified such as when site settings get updated and we need to make sure the cached copy gets refreshed and is up to date on all machines. Currently we do have support for AppFabric cache.

In  a small web cluster that uses a shared network drive accesses by all the web nodes you can enable cache dependency files by adding this to user.config:

<add key="UseCacheDependencyFiles" value="true" />

The Lucene.NET Search Engine which writes to index files on disk, and is specifically known to not support multiple processes so it can be problematic if a common file system is shared across nodes and multiple processes are trying to update or access those files. One solution would be to figure out a way for each node to maintain its own separate copy of the search index. Another option is to disable the search index and use the option for Bing or Google for site search. The only downside is they can only index the public facing content and the index won't we up to date until they start crawling your site consistently which may take a while for a new site.

Multiple Processes vs Multiple Threads

In a standard non web farm installation of a web site, we already have to consider that a web site or application is multi threaded. There is a thread pool which by default typically has 100 threads and a single thread can handle one request at a time then it goes back to the thread pool and is available to handle another request. Because multiple threads can execute at the same time web applications must be implemented in a thread safe way. But in this scenario typically all the threads run in a single process and a single AppDomain. Once we get to a web farm there are multiple process and multiple AppDomains running on different machines and these processes each have their own thread pool. So we have to consider that when multiple processes can be trying to update the same resource at once it can be more difficult to do it safely and we can't rely on the typical locking approaches that can be used for thread safety in a single process.

In mojoPortal one area where I think we will need some changes to better support web farms is in the processing of the TaskQueue. Our implementation works reasonably well in a multi threaded environment already so in theory there may not be any problem but in practice I think we would not really want multiple nodes all trying to process the queue so we may need to look for potential problems there. For example, we would not want multiple web nodes processing the task that sends the newsletter, but task progress is updated in the database already in order to keep from having multiple copies of the task run even on a single web server, so theoretically it may not be a problem even with multiple nodes because only one node would process the task and the others would not because they would see that the task status has been updated within the configured threshold of time. However more testing is needed to verify this, and ideally we may wan to have an option where none of the web nodes processes the task queue but instead have a Windows Service do that.


Created 2011-08-15 by Joe Audette