Clustering your file server provides fantastic failover features, but does not prevent the shared resources from becoming corrupt. It can happen that the NTFS table of a shared network resource becomes corrupted. When this file share is placed on a single file server, we would simply run Checkdisk to solve the problem. But do we do when this share exists on a file cluster? Well the answer is easy, Checkdisk will do the trick, but you need to follow a specific procedure to complete the operation.
Typical errors that show disk corruption is event 1066 from the cluster service shown in the system log. The event is quite clear and even tells you to run ChkDsk /F from command prompt to the shared resource. But what it doesn't tell you is that you need to putt the shared resource in maintenance mode first.
A typical symptom is that failover fails or takes very long (can be several hours). When opening the task manager on the node where you are failing to, you'll see checkdisk running which is initialized by the cluster service (remember to check the "show processes of all users"). When closing this process, the cluster will fail over emmidiatly.
Prior to running CheckDisk on the share resource, you need to place that resource in maintenance mode.
So how do you do it then?
- Stop the Cluster service on the passive node.
- Bring the cluster resource group offline on the active node.
- Bring the shared resource which has been reported corrupt back online.
- Open the command prompt, and type: Cluster "servername" res "Diskresource" /maint:on. Example: Cluster Clusterserver01 res "Disk h:" /maint:on
- Now you can run Checkdisk on the shared cluster resource. Example: Chkdsk H: /F /R
- When checkdisk completes you'll need to put the shared cluster resource back in normal mode. This is done by typing "Cluster "Servername" res "DiskResource" /maint:off" at the command prompt. Example: Cluster Clusterserver1 res "Disk h:" /maint:off
- You can check the state of the resource by running Cluster "Servername" res "Sharedresource". Example: Cluster Clusterserver1 res "Disk H:"
- Bring the clustergroup back online.
- Restart the cluster service on the passive node.
If everything went well the cluster resources will now fail over smoothly as expected.