Update April 2011: This post and part 2 are now irrelevant for Sitecore 6.x See this post for details of important changes since this post was first published
This has been a bugbear for some time now. I’ve had firsthand experience of all the issues I’m going to describe and it was not an easy journey to figure out what was actually happening. The reason for this post is to have some kind of searchable record of the problem online that is indexed. This is because Sitecore’s own forums (where this issue has appeared many times) appear to not be indexed by any search engines, and to help other developers who run into this problem.
For the record, and in case anyone does this see this post in that way, I’m not anti-Sitecore, I like the product (otherwise I wouldn’t bother doing this), I like the level of support too (they must have some patience!), I just think the staging module is pretty flawed in some areas. Finally, some of these issues are already known, but don’t appear in any list of ‘known issues’. This is immensely frustrating!
The Sitecore staging module is an ‘official’ (i.e. supported) module that allows you to separate your live, public-facing servers from your CMS server that editors directly create content on. From now I will refer to the internal CMS server as ‘master’ and all public facing web servers as ‘slave’ – this is the same naming convention used in the staging module documentation. A typical configuration will consist of one master server located locally within your office, and one or more slave servers located externally with a hosting company.
There are three issues that occur when using the staging module that I’m intending to eventually highlight, these are:
- Cache clearing
- Extranet users are logged-out after each publishing operation.
- Publishing delay
Your live website slows to a halt at regular intervals, depending on how often you publish. CPU utilization of the w3wp.exe process will often suddenly ramp up to 100% for some time (probably somewhere within 5-90 seconds) and the SQL server process will be doing the same, or be worse. After this time, your site returns to normal. Page requests to your site may hang for the entire duration and if they don’t close the browser, will eventually get a response. This happens after each publishing operation + staging task (scheduled) publishing delay.
After publishing content in a staging environment, the master server calls a webservice on each slave server to clear caches. The recommended setting for this is “Full” cache clearing rather than “partial”. You can verify if staging is the cause of all this by performing the following steps:
- On the master server, check the folder /sitecore modules/staging/workdir/cache for the presence of any .xml files. If you haven’t published recently, this should be empty. (If there are loads of .xml files you probably have another problem).
- Publish something from the master server to a slave server
- Immediately check the /workdir/cache folder again (it should now contain one .xml file)
- Wait for the .xml file to disappear (it’s now being processed)
As soon as the .xml file disappears, the master server should hit the webservice on the slave server to clear caches. If at this point, your site slows down and the CPU and SQL load jumps unacceptably, then this is very likely to be a problem caused by staging and the aggressive cache-clearing invoked by the ‘Full’ setting.
The documentation does state that ‘full’ method is slow, however I think it underplays how much impact this can have. This setting is configured on the master server and is used in the webservice call to the slave after a publishing operation to clear the caches on the slave. The main difference is that ‘partial’ clears the front-end HTML cache (i.e. your rendered outputs) and the Data cache. Using this mode, your site will probably hardly blink unless you have very intensive renderings. The ‘full’ mode however does what it says in the documentation, and clears all caches, and Sitecore has a lot of them. You can see a list of these in /sitecore/admin/cache.aspx . It also does some other additional clearing operations and then does all this all again for the shell website (the sitecore shell at /sitecore is just another website). The whole effect is that web server and SQL server load can spike in a big big way. In my particular case, the resulting server load and time-delay is comparable to the same load and delay that occurs after recycling the application pool, or in other words, every time I publish the site feels like it’s rebooting, and that seems to be pretty much what is happening. The ‘rebooting’ analogy also lends itself to the #2 point about extranet users (post to follow). If you publish frequently, have a large number of items, or heavy load then this makes things much worse.
The simple solution is the try the “partial” caching mode instead. Sitecore say that in some scenarios items will not be published correctly and therefore only the “full” setting is recommended. This is the crux of the problem. You’re damned if you do, and damned if you don’t. I really think that Sitecore needs to come up with a better way to handle staging, either by fixing this module, or a different approach altogether. So far we’re stuck between a method that isn’t recommended and doesn’t always work, and a method that is recommended but can easily cripple your performance whenever you publish.
More to follow…. in part two.