Have you ever experienced some kind of performance gaps with your DFSN running on powerful (hardware) Windows 2008 R2 server? Maybe this is the right post for you.
Let’s get back to where everything began.
One cloudy day we and our users experienced some heavy gaps while accessing the list of all connected drives (also known as My Computer). When they access „My Computer“ the explorer was frozen for a couple of seconds until everything went back to normal. These gaps happened the whole day. Sometimes they come and go within the whole day, sometimes these gaps only appear in the morning or in the evening and everyone using a software from a DFS share experienced the same problem at the same time as well.
After checking the defaults within our AD and the file servers we tried to dive deeper, to become a feeling why this happens and what is the cause.
Let’s explain, how our environment looks like without telling you too much.
– 8000 clients
– 8 file server ( 4 x 2 Clustered )
– 4 DCs (two of them hold the Domain Based DFSN Namespace)
— Hardware: Every Server: 16 Cores / 32GB Ram
— Network – strong enough
– Single site!
– DFSN structure looks like this
— \\local.domain\homefolder (~8000 DFS Links) # mapped home directory
— \\local.domain\profilefolder (~8000 DFS Links) # for roaming profiles
— \\local.domain\poolfolder (100 DFS Links) # file graveyard
— \\local.domain\datafolder (~3000 DFS Links) # folder to store files that is accessed by groups of people
Every DFS Root Namespace uses ABE (Access Based Enumeration) to hide all folders for users without access rights.
The first thing we checked was, if our DFSN is within the recommendations of Microsoft. We found some information about the maximum DFS Link recommendation, but this belongs to Windows Server 2000 DFSN and not to Windows Server 2008. Here is the knowledge base entry (http://support.microsoft.com/kb/232613/en-us), that recommend a soft limit of 5000 links per DFS. The really neat information you can get from the DFS Replication scalability guidelines is the following:
You can successfully deploy configurations that exceed any or all of these tested limits. However, it is important to test large configurations and verify that there is adequate space in the staging folders before using them in production environments. In addition, you might experience increased latency during replication. See the following Web pages for additional information: Understanding DFS Replication "Limits"
Keep this in mind, because it turns out that you have to deploy a really really big test environment to be sure everything works fine.
Also keep in mind that we don’t use replication (DFSR)
After reading all the DFS information around the internet / books / etc… we were sure, that our environment is within every recommended limit we can find.
The next thing that came to our mind was the network. Our experience is, that if it isn’t wsus and av is has to be a network issue ( 😉 ). That, indeed wasn’t the problem here. Thanks to our network guys for tracing everything down!
The next problem we had was, that we can’t reproduce the failure and the event logs on every server doesn’t give us any hint that there is a problem with the AD or the DFS or our file servers. We started to eliminate everything we can to border our problem.
What we have done
– We patched the firmware of our servers
— (First question at the hardware suppliers hotline: Have you updated to the actual firmware?… Ähhhm, nope, who runs updates in a productive environment ever week for the firmware of something called everything?! )
— Disabled teaming to reduce complexity
— Disabled network offloading (maybe a hardware bottleneck)
— Checked DFS with the DfsDiag / and other Microsoft diag tools
— Read a lot about dfsdnsconfig (but didn’t change the setting, because it seemed to not solve our problem)
— Sniffed traffic while the problem occured on our DCs
— Run procmon to check if it is a problem with the local filesystem
— Applied every DFS and Fileserver patch around
Nothing gives us a clue. Every time when we awaited the problem with our analyze tool nothing happens. This was a hard time for us, because without the ability to reproduce the problem we could not solve or even find the problem.
One evening we had some gaps for several hours. They came in waves with a small time where everything went back to normal. So I checked what I did at that time and found out, that a script of mine created some Dfs links. This isn’t unusual, because this script runs now nearly for 2 1/2 years to create shares when someone requests it. This script kills our DFSN for some kind of reason. We investigated, what this script does and checked, at which point the DFSN problem occurred.
So we investigated our seemingly perfect running DFS.
Everything was running, perfmon, procmon, wireshark and so an (on every DC). After we walked through our script step by step the access problem occurred after we create a new DFSN Link. This was really surprising to us, because something so fundamental should work without any problems, as it did for years now.
With all the information of our log files we put together the whole picture of our problem.
Procmon’s last entry before the problem occurred was an NTFS information about the USN (Update Sequence Number Journal) at the c:\DFSRoot\xxx folder.
After this record our wireshark showed a huge list of directory request from thousand clients for the particular DFSN Root Namespace where the folder was created. After all these request hit our DFSN servers the CPU usage of our servers hit 60%.
Summary: The amount of clients take out our DFSN servers.
Where does this come from?
The reason why this happens is, that every change to the DFSN Namespace affects the c:\DFSRoot\xxx folder. This USN change notification is recognized by samba. Samba checks, if someone has an active session to the c:\DFSRoot\xxx folder. Every client that is using DFS (possibly every client) via an open doc file, or open explorer (tree listing at the left windows side) has this open connection. Samba notifies all clients to reload the information (called notify change request) instantly and the server stops responding for a couple of seconds / the server have to queue requests until every client has its information.
That’s it, this is our problem and there is no way around at the moment. It’s possible to deactivate the response to this change notification at client side, but this has an huge impact, because the users have to refresh the explorer too often. This, in case is not an option for our productive environment.
Back to the point, where I wrote you should keep this in mind (test environment): I think it is nearly impossible to build such a big test environment, with thousand of active clients and active sessions to your DCs. To test this, you should have a huge amount of DFS links, 1000+ logged on clients with active sessions to the server and so on.
The only way, to reduce this problem is to add more DFS Namespace server to spread the request to more servers. Disabling ABE could resolve the situation, but this has a big impact to the users as well.
Since we attend TechEd Europe for 10 years now, we send this to some contacts we made at several TechEds and get a response from Microsoft. They are creating a hotfix for this problem, because this should not happen. The release date will be middle of February 2014. Hope this will bring help us to solve this problem forever.
Hopefully someone else helps this posting to get closer to the solution.
We opened a call at Microsoft Premier Support for this issue and hope to get some Information on how this issue can be solved.
Meanwhile we spent a lot of time with the open case and talked to some Microsoft Product Manager at the TechEd at Barcelona. We have a workaround for this issue, but no solution.
Microsoft confirmed this issue. The DFS Servers are too busy with evaluating the ACLs on our ABE enabled shares and a gap of unresponsibility will occur, that lead client Explorers and applications to hang and loose connectivity to the affected shares for some seconds / minutes.
We prove some fixes from Microsoft to solve this issue, but nothing worked (hotfixes, registry values, etc…). We had this problem over and over again.
Another idea was to restructure out DFS Namespace, but this isn’t anything we want to do to our users, because they are busy enough finding their files in the new environment. Restructuring everything will result in a huge user frustration… Nothing an Administrator wants to be responsible for :).
After a couple of weeks and lots of discussion we decided to implement a workaround and only add new shares at night, when we can ensure, that the server load for adding new shares is low enought, that the ABE ACL evaluation won’t kill our servers.
We have the possiblility to forward this case to a Microsoft process ( DCR – Design Change Request), that maybe leads to a real solution in Windows Server, but this is time expensive (create perfmon logs, etc…) and needs more than one company with this issue and an open case at Microsoft until someone will work on a solution for this.
Here are some ideas to fix this issue. Maybe someone can solve it with this list (we couldn’t).
- Enable RootScalability Mode (KB2937429)
- Enable RSS (TCP ReceiveSideScaling, s. regkey EnableRSS, KB951037)
- Raise MaxThreadsPerQueue – HKEY_LOCAL_MACHINE\CurrentControlSet\Services\LanmanServer\Parameters\MaxThreadsPerQueue = 256 (default=20)
Hi thanks for this info very helpful!! Any update to hotfix?
Yes, I have an update an added it to my post.
Please check the point [Update].
Let me know if you have any further question.
Thanks we seeing somewhat the same issue lots of delay and pausing with our dfsn. We implemented dnsdfsconfig which helped a bit but I see we could do more. What is the case you opened? I would like to reach out to Microsoft as well to give it a push. Thanks
I’ll write you an E-Mail with the Advisory Performance Request Number
thanks a lot for this very helpful information.
We are responsible for an environment with max. 10000 users. We migrate to Win10 an want to use DFS. Every user should have a separate link target to his Profile. The design is basically the same you use:
– 4 DC’s, one Domain, one Forrest
– 3 Sites (Site1=5000 users; Site2=2000 users; Site3=1000 users)
– 3 Domain based Root Servers
– Example \\ourdomain.ch\USERS$\username (8000-10000 Users)
– 6-8 File-Servers
– ABE not necessary for \\ourdomain.ch\USERS$\
– all Servers 2012 R2
– Domain and Forest Functional Level: 2012 R2
Whats your experience now? Do you have any updates or recommendations?
I am designing a DFS Namespace similar to yours (number of users, profile folders, home folders and group shares).
I can’t seem to find best practices for this.
Can you share your thoughts/reasons/best pactices for designing the DFS-N structure like you did?
thanks for your comment.
I think the main part of this problem is the operating ABE on fileshare level. We don’t experience these problems on home / profile / terminalserver profiles path, because they are mapped directly at the userlevel like h: -> \\ad\home\user1 and so on. This mapping didn’t do any problems at our site, because a change on the \\ad\home root dfs namespace didn’t cause a smb change notification to a huge list of hosts.
Only our mapped „common“ directory on Q: caused this, because we map Q: to the dfsn root namespace \\ad\XXX\. Every change to \\ad\XXX\ results in an smb notification storm and takes out our dfsn servers (commonly DC’s).
I don’t really have best practices on how to design your infrastructure. I think you should keep in mind, that this is a problem and you should only create new DFSN links at night and design your infrastructure to best fit your administration organisation. We currently handle this problem by nightly creation and wouldn’t change anything in our directory design.
If you have further questions, or need some details, please let me know!