Wednesday, September 02, 2015

Disturbed Cache

The Scenario
Sharepoint 2013 developed publishing site using adfs. AppFabric with CU5.

The problem
In a developed Sharepoint Site we were seeing a lot of strange issues.
Sometimes some requests weren’t finished, some things might not get loaded.

The troubleshooting
It was very hard to troubleshoot since there wasn’t any patterns to the errors initially.
In an attempt mitigate the issue we raised the cookie-length for the adfssessions from default 60min to 720.
This gave us even more issues, now the frequency of the errors were more often. While navigating for 10min an issue usually occurred twice.
When monitoring the connections with fiddler while browsing we noticed a lot of requests going to the adfs-server. When really it should only happens twice a day.
I ran a search for errors in the ULS log for same period and noticed some errors relating to tokens
Common errors were:
Token Cache: reverting to local cache to add the token….
Unexpected Exception in SPDistributedCachePointerWrapper::InitializeDataCacheFactory for usage ‘DistributedLogonTokenCache’…. Request time out
And some others in dump below

clip_image002
Now obviously I’ve dangled with the cache before so really not surprised the nasty distributed cache is involved in this mess.

Ok, so now we’ve learned that timeouts occur on Distributed Caches Logon Token Cache. Who do we fix it?
Luckily the error message actually contains a part of the solution. Blablabla … MaxBuffersize must be greater or equal….
To actually find these errors I used my LogSearcher
-----------Logsearcher.ps1-------------
Add-PSSnapin microsoft.sharepoint.powershell
$errorcategory = "DistributedCache"
$string = "*token*"
# calculates hours forward and backward.
$endtime = (get-date).addhours(1)
$starttime = (get-date).addhours(-2)
$events = Get-SPLogEvent -StartTime $starttime -EndTime $endtime |where-object {$_.message -like $string -and $_.category -like $errorcategory }
$events |select timestamp,level,area, message,category | format-table -wrap
write-output "Total: "($events |measure).count
----------------eof------------------

To check servers timeout-settings
------------------getTokenSettings.ps1--------------
$sts = Get-SPSecurityTokenServiceConfig
write-host "SecurityTokenServiceconfig settings: "
$sts |select cookielifetime, logontokencacheexpirationwindow,WindowsTokenLifetime,MaxServiceTokenCacheItems,MaxLogonTokenCacheItems,MaxServiceTokenOptimisticCacheItems |format-list
write-host "DistributedLogonTokenCache settings:"
$dltc = Get-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache
$dltc | select maxbufferpoolsize,maxbuffersize,requesttimeout,channelopentimeout,maxconnectionstoserver
------------------------eof-----------------------

The solution
1. Increase timeouts on DistributedLogonTokenCache and items on SecurityTokenServiceconfig
------------------------SetTokenTimeouts.ps1------------------
$sts = Get-SPSecurityTokenServiceConfig
$dltc = Get-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache
$dltc.MaxBufferPoolSize = "1073741824"
$dltc.MaxBufferSize = "33554432"
$dltc.RequestTimeout = "3000"
$dltc.ChannelOpenTimeOut = "3000"
Set-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache $dltc
$sts.MaxServiceTokenCacheItems = "1500"
$sts.MaxLogonTokenCacheItems = "1500"
$sts.update()
----------------------eof-----------------

2. Stop all distributed cache services gracefully
Stop-SPDistributedCacheServiceInstance -graceful

Make sure to wait 5-10 minutes between first cache restart to the last. This so the cache actually have time to reconnect to the cachecluster and make friends with other caches. Note that cache need to be restarted from Manage Services after above command.

3. Run IISreset on all hosts.

Conclusions
These errors started occurring pretty might day one in the environment. Though it was when users actually started using the site as the problem became apparent.
The SecurityTokenServiceConfig doesn’t seem to be used the same way when using Windows Authentication. This made it impossible for the developers to troubleshoot in their environment without having their own adfs-solution.
Now a part of the problem was solved only by increasing cacheitems,channelopentimeout and requesttimeout. I would say that these settings solved the exceptions and fails.
While maxbuffersize and maxbufferpoolsize solved the Failed to add/get token errors below
clip_image004

First part I assume was due to the many API-connections that happened on the starter pages, big page with ajax-loading a lot of the components with rest. These might have maxed out the ItemsCache or it was due to the fact that tokens didn’t get saved in the cache properly or couldn’t be acquired in time.
Second part too many claims at a time, or big or improperly formatted.
But Im just guessing here.

References:
https://social.msdn.microsoft.com/Forums/en-US/e13871ce-5b6f-4a3c-bc04-fbce47870856/sharepoint-web-app-randomly-gives-302-redirect-to-reauthenticate-even-with-valid-fedauth-cookie?forum=sharepointgeneralprevious
https://habaneroconsulting.com/insights/sharepoint-2013-distributed-cache-bug
https://technet.microsoft.com/en-us/library/jj219613.aspx#finetune
https://danielthenerd.wordpress.com/2015/05/08/comprehensive-distributed-cache-and-security-token-service-config-review-for-sharepoint-2013/

No comments: