Have you ever experienced some kind of performance gaps with your DFSN running on powerful (hardware) Windows 2008 R2 server? Maybe this is the right post for you.
Let’s get back to where everything began. One cloudy day we and our users experienced some heavy gaps while accessing the list of all connected drives (also known as My Computer). When they access “My Computer” the explorer was frozen for a couple of seconds until everything went back to normal. These gaps happened the whole day. Sometimes they come and go within the whole day, sometimes these gaps only appear in the morning or in the evening and everyone using a software from a DFS share experienced the same problem at the same time as well. After checking the defaults within our AD and the file servers we tried to dive deeper, to become a feeling why this happens and what is the cause.
Our Environment
Let’s explain, how our environment looks like without telling you too much. - 8000 clients - 8 file server ( 4 x 2 Clustered ) - 4 DCs (two of them hold the Domain Based DFSN Namespace) -- Hardware: Every Server: 16 Cores / 32GB Ram -- Network - strong enough - Single site! - DFSN structure looks like this -- \\local.domain\homefolder (~8000 DFS Links) # mapped home directory -- \\local.domain\profilefolder (~8000 DFS Links) # for roaming profiles -- \\local.domain\poolfolder (100 DFS Links) # file graveyard -- \\local.domain\datafolder (~3000 DFS Links) # folder to store files that is accessed by groups of people
Every DFS Root Namespace uses ABE (Access Based Enumeration) to hide all folders for users without access rights.
The first thing we checked was, if our DFSN is within the recommendations of Microsoft. We found some information about the maximum DFS Link recommendation, but this belongs to Windows Server 2000 DFSN and not to Windows Server 2008. Here is the knowledge base entry (http://support.microsoft.com/kb/232613/en-us), that recommend a soft limit of 5000 links per DFS. The really neat information you can get from the DFS Replication scalability guidelines is the following:
You can successfully deploy configurations that exceed any or all of these tested limits. However, it is important to test large configurations and verify that there is adequate space in the staging folders before using them in production environments. In addition, you might experience increased latency during replication. See the following Web pages for additional information: Understanding DFS Replication "Limits"
Keep this in mind, because it turns out that you have to deploy a really really big test environment to be sure everything works fine. Also keep in mind that we don’t use replication (DFSR)
After reading all the DFS information around the internet / books / etc… we were sure, that our environment is within every recommended limit we can find.
The next thing that came to our mind was the network. Our experience is, that if it isn’t wsus and av is has to be a network issue ( ;) ). That, indeed wasn’t the problem here. Thanks to our network guys for tracing everything down!
The next problem we had was, that we can’t reproduce the failure and the event logs on every server doesn’t give us any hint that there is a problem with the AD or the DFS or our file servers. We started to eliminate everything we can to border our problem.
What we have done
- We patched the firmware of our servers -- (First question at the hardware suppliers hotline: Have you updated to the actual firmware?… Ähhhm, nope, who runs updates in a productive environment ever week for the firmware of something called everything?! ) -- Disabled teaming to reduce complexity -- Disabled network offloading (maybe a hardware bottleneck) -- Checked DFS with the DfsDiag / and other Microsoft diag tools -- Read a lot about dfsdnsconfig (but didn’t change the setting, because it seemed to not solve our problem) -- Sniffed traffic while the problem occured on our DCs -- Run procmon to check if it is a problem with the local filesystem -- Applied every DFS and Fileserver patch around
Nothing gives us a clue. Every time when we awaited the problem with our analyze tool nothing happens. This was a hard time for us, because without the ability to reproduce the problem we could not solve or even find the problem.
One evening we had some gaps for several hours. They came in waves with a small time where everything went back to normal. So I checked what I did at that time and found out, that a script of mine created some Dfs links. This isn’t unusual, because this script runs now nearly for 2 1/2 years to create shares when someone requests it. This script kills our DFSN for some kind of reason. We investigated, what this script does and checked, at which point the DFSN problem occurred.
So we investigated our seemingly perfect running DFS.
Everything was running, perfmon, procmon, wireshark and so an (on every DC). After we walked through our script step by step the access problem occurred after we create a new DFSN Link. This was really surprising to us, because something so fundamental should work without any problems, as it did for years now. With all the information of our log files we put together the whole picture of our problem.
Procmon’s last entry before the problem occurred was an NTFS information about the USN (Update Sequence Number Journal) at the c:\DFSRoot\xxx folder. After this record our wireshark showed a huge list of directory request from thousand clients for the particular DFSN Root Namespace where the folder was created. After all these request hit our DFSN servers the CPU usage of our servers hit 60%. Summary: The amount of clients take out our DFSN servers.
Where does this come from? The reason why this happens is, that every change to the DFSN Namespace affects the c:\DFSRoot\xxx folder. This USN change notification is recognized by samba. Samba checks, if someone has an active session to the c:\DFSRoot\xxx folder. Every client that is using DFS (possibly every client) via an open doc file, or open explorer (tree listing at the left windows side) has this open connection. Samba notifies all clients to reload the information (called notify change request) instantly and the server stops responding for a couple of seconds / the server have to queue requests until every client has its information.
That’s it, this is our problem and there is no way around at the moment. It’s possible to deactivate the response to this change notification at client side, but this has an huge impact, because the users have to refresh the explorer too often. This, in case is not an option for our productive environment.
Back to the point, where I wrote you should keep this in mind (test environment): I think it is nearly impossible to build such a big test environment, with thousand of active clients and active sessions to your DCs. To test this, you should have a huge amount of DFS links, 1000+ logged on clients with active sessions to the server and so on.
The only way, to reduce this problem is to add more DFS Namespace server to spread the request to more servers. Disabling ABE could resolve the situation, but this has a big impact to the users as well.
Since we attend TechEd Europe for 10 years now, we send this to some contacts we made at several TechEds and get a response from Microsoft. They are creating a hotfix for this problem, because this should not happen. The release date will be middle of February 2014. Hope this will bring help us to solve this problem forever.
Hopefully someone else helps this posting to get closer to the solution.
[addition]
We opened a call at Microsoft Premier Support for this issue and hope to get some Information on how this issue can be solved.
[Update] Meanwhile we spent a lot of time with the open case and talked to some Microsoft Product Manager at the TechEd at Barcelona. We have a workaround for this issue, but no solution.
Microsoft confirmed this issue. The DFS Servers are too busy with evaluating the ACLs on our ABE enabled shares and a gap of unresponsibility will occur, that lead client Explorers and applications to hang and loose connectivity to the affected shares for some seconds / minutes. We prove some fixes from Microsoft to solve this issue, but nothing worked (hotfixes, registry values, etc…). We had this problem over and over again. Another idea was to restructure out DFS Namespace, but this isn’t anything we want to do to our users, because they are busy enough finding their files in the new environment. Restructuring everything will result in a huge user frustration… Nothing an Administrator wants to be responsible for :).
After a couple of weeks and lots of discussion we decided to implement a workaround and only add new shares at night, when we can ensure, that the server load for adding new shares is low enought, that the ABE ACL evaluation won’t kill our servers.
We have the possiblility to forward this case to a Microsoft process ( DCR - Design Change Request), that maybe leads to a real solution in Windows Server, but this is time expensive (create perfmon logs, etc…) and needs more than one company with this issue and an open case at Microsoft until someone will work on a solution for this.
Here are some ideas to fix this issue. Maybe someone can solve it with this list (we couldn’t).
- KB2920591
- Enable RootScalability Mode (KB2937429)
- Enable RSS (TCP ReceiveSideScaling, s. regkey EnableRSS, KB951037)
- Raise MaxThreadsPerQueue - HKEY_LOCAL_MACHINE\CurrentControlSet\Services\LanmanServer\Parameters\MaxThreadsPerQueue = 256 (default=20)