Update April 2011: This post and part 2 are now irrelevant for Sitecore 6.x See this post for details of important changes since this post was first published
This has been a bugbear for some time now. I’ve had firsthand experience of all the issues I’m going to describe and it was not an easy journey to figure out what was actually happening. The reason for this post is to have some kind of searchable record of the problem online that is indexed. This is because Sitecore’s own forums (where this issue has appeared many times) appear to not be indexed by any search engines, and to help other developers who run into this problem.
For the record, and in case anyone does this see this post in that way, I’m not anti-Sitecore, I like the product (otherwise I wouldn’t bother doing this), I like the level of support too (they must have some patience!), I just think the staging module is pretty flawed in some areas. Finally, some of these issues are already known, but don’t appear in any list of ‘known issues’. This is immensely frustrating!
The Sitecore staging module is an ‘official’ (i.e. supported) module that allows you to separate your live, public-facing servers from your CMS server that editors directly create content on. From now I will refer to the internal CMS server as ‘master’ and all public facing web servers as ‘slave’ – this is the same naming convention used in the staging module documentation. A typical configuration will consist of one master server located locally within your office, and one or more slave servers located externally with a hosting company.
There are three issues that occur when using the staging module that I’m intending to eventually highlight, these are:
- Cache clearing
- Extranet users are logged-out after each publishing operation.
- Publishing delay
Cache clearing
The symptom:
Your live website slows to a halt at regular intervals, depending on how often you publish. CPU utilization of the w3wp.exe process will often suddenly ramp up to 100% for some time (probably somewhere within 5-90 seconds) and the SQL server process will be doing the same, or be worse. After this time, your site returns to normal. Page requests to your site may hang for the entire duration and if they don’t close the browser, will eventually get a response. This happens after each publishing operation + staging task (scheduled) publishing delay.
The problem:
After publishing content in a staging environment, the master server calls a webservice on each slave server to clear caches. The recommended setting for this is “Full” cache clearing rather than “partial”. You can verify if staging is the cause of all this by performing the following steps:
- On the master server, check the folder /sitecore modules/staging/workdir/cache for the presence of any .xml files. If you haven’t published recently, this should be empty. (If there are loads of .xml files you probably have another problem).
- Publish something from the master server to a slave server
- Immediately check the /workdir/cache folder again (it should now contain one .xml file)
- Wait for the .xml file to disappear (it’s now being processed)
As soon as the .xml file disappears, the master server should hit the webservice on the slave server to clear caches. If at this point, your site slows down and the CPU and SQL load jumps unacceptably, then this is very likely to be a problem caused by staging and the aggressive cache-clearing invoked by the ‘Full’ setting.
The cause:
The documentation does state that ‘full’ method is slow, however I think it underplays how much impact this can have. This setting is configured on the master server and is used in the webservice call to the slave after a publishing operation to clear the caches on the slave. The main difference is that ‘partial’ clears the front-end HTML cache (i.e. your rendered outputs) and the Data cache. Using this mode, your site will probably hardly blink unless you have very intensive renderings. The ‘full’ mode however does what it says in the documentation, and clears all caches, and Sitecore has a lot of them. You can see a list of these in /sitecore/admin/cache.aspx . It also does some other additional clearing operations and then does all this all again for the shell website (the sitecore shell at /sitecore is just another website). The whole effect is that web server and SQL server load can spike in a big big way. In my particular case, the resulting server load and time-delay is comparable to the same load and delay that occurs after recycling the application pool, or in other words, every time I publish the site feels like it’s rebooting, and that seems to be pretty much what is happening. The ‘rebooting’ analogy also lends itself to the #2 point about extranet users (post to follow). If you publish frequently, have a large number of items, or heavy load then this makes things much worse.
The solution:
The simple solution is the try the “partial” caching mode instead. Sitecore say that in some scenarios items will not be published correctly and therefore only the “full” setting is recommended. This is the crux of the problem. You’re damned if you do, and damned if you don’t. I really think that Sitecore needs to come up with a better way to handle staging, either by fixing this module, or a different approach altogether. So far we’re stuck between a method that isn’t recommended and doesn’t always work, and a method that is recommended but can easily cripple your performance whenever you publish.
More to follow…. in part two.
Dude,
I haven’t yet worked with the Staging Module but I’m sure that if ever I get to work with it someday, I’ll get in touch with you. It’s quite clear that you must have really been frustrated whilst working with this module. Look at the bright side… No pain, No gain !!!
Hi Paul,
Thanks for your thoughtful analysis of the Sitecore Staging Module in this and in Part II. If you are using Sitecore 6, you might consider trying the Sitecore Stager, which can be found in Sitecore’s Shared Source Library. The Stager provides item-level cache clearing and, as a result, does not trigger the full cache clearing that you describe in your article.
Two more notes: 1) The Sitecore Stager does not replace the FTP/SOAP file transfer functionality that is part of the Staging Module. 2) The Sitecore Stager does clear the HTML cache, but not the item cache. So, if you use XSLT, the transformations will be processed again; but the database won’t be hit as with the standard Staging Module.
Best wishes,
Derek
Pingback: Publishing strategies « Molten Core
Hi Paul George,
Really nice blog!
We are also facing same kind of issues sometime our front end servers shows “504 – Gateway Timeout” error. But we have configured something different which is as below:
1. Staging Module : – For upload and download sublayout files.
2. Stager Module :- For clearing cache partially. http://trac.sitecore.net/SitecoreStager
Now, our cache clearing stuff is totally managed by SiteCoreStager module. But sometimes our published changes dosen’t reflects up on front end servers which forces us to iisreset — which we would like to avoid.
Can you pls help us out. How it should be managed?
Thanks a lot,
Kiran Patil
Hi Kiran,
To simplify your implementation, you may consider the latest version of the staging module. It incorporates the partial cache clearing and the movement of files.
Since Sitecore Stage is Shared Source (and, hence, unsupported), getting on track with the latest version of the Staging Module may be the way to go.
Best wishes,
Derek
Thanks for this – you’re not wrong – we have seen the staging module full cache clearance cripple a site with a full cache clearance.
However Sitecore have now assured me that the latest version of staging module works correctly with partial cache clearance. The updated documentation is here:
http://sdn.sitecore.net/upload/sdn5/modules/staging/staging_module_installation_and_configuration_guide_sc6_a4.pdf
I’ve not tried it yet but will be trying it out soon, I’ll keep you informed.
Pingback: Publishing and Cache Clearing Basics « Sitecore basics!
Nice blog, you say that if w”If there are loads of .xml files you probably have another problem” – If this is the case what is the problem?
Note: This from memory – if there are loads of xml files (each representing a job to be processed), it probably means there is no scheduled task reading & deleting them. Or perhaps they are being generated (by a publishing action) faster than they are being processed (by a scheduled task) i.e. task interval is too slow/long or publishing interval too fast…
Much better: just upgrade to a point where you can use the in-built scaling solution.