Feeds:
Posts
Comments

Interesting article today on troubleshooting replication errors.

A few weeks ago, on production, we received an alert with on replication failures. Upon further inspection the error looks like this:

Command attempted:
if @@trancount > 0 rollback tran
(Transaction sequence number: 0x001031C200001B06000700000000, Command ID: 1)

Error messages:
The row was not found at the Subscriber when applying the replicated command. (Source: MSSQLServer, Error number: 20598)
Get help: http://help/20598
The row was not found at the Subscriber when applying the replicated command. (Source: MSSQLServer, Error number: 20598)
Get help: http://help/20598

Identify Root Cause:

To understand the course of action, we need to first understand the underlying issues. Go to Replication Monitor >> Open Details Window on the subscription that has errors. Go to ‘Distributor to Subscriber’ tab for more details. See the image below:

replication_error_20598

Now we see that, replication is attempting to perform an action on the Subscriber, but the row does not exist. So, lets find out more.

Find the exact command that is being replicated (or executed on Subscriber as part of replication) that throws this error. Use replication procedure sp_browsereplcmds for that.

Query the Distribution agent ID from dbo.MSdistribution_agents and use it in the query below.

--
-- Uncover the replication command throwing error
--
EXEC sp_browsereplcmds @xact_seqno_start = '0x001031C200001A620004000000000000'
                     , @xact_seqno_end = '0x001031C200001A620004000000000000'
                     , @agent_id = 49
                     , @publisher_database_id = 3
GO

You’ll see something like this:

replication_error_20598_investigation

Under the command column, we’ll see the exact command that is running into this error.

--
--  Error 20598 occurring in
--
{CALL [mtx_rpldel_ReportCriterion] (908236,71357,250,-1)}

Now, lets go to that stored procedure ‘mtx_rpldel_ReportCriterion’ and see what tables are involved in manipulation. In my scenario, the table ReportCriterion does not have the record with ID = 908236

Resolution

Once you understand the root cause, we have a few options.

  1. Data Integrity: Looks like we have synchronization issues between Publisher and Subscriber. If it is a non-production environment or an environment where reinitializing is an option, then we could take that route to sync up the data first.
    1. Once data integrity issues are resolved, all subsequent replication commends would be successful.
  2. Manual fix: Manually insert the missing record at Subscriber and then allow replication to perform its operations on the new record.
    1. With this option, the more records we uncover as missing, the more manual operation would be required. Its not ideal, but it is a workaround to get things going again.
  3. Ignore, for now: In some situations, until further solution is identified, we may need to ignore this one record and move forward with rest of the replication commands.
    1. Take necessary precautions to make sure there are no more such missing records. Or gather a list of all missing ones.
    2. Configure replication to ignore error 20598 using skiperrors parameter. There are a couple of ways to achieve this; here we’ll look at one.
    3. Go to the Agent Profile for this particular Distributor Agent. One of the profiles allows us to skip certain errors. See the image below.
    4. replication_error_20598_resolution

For more information, please refer to Microsoft Support article on similar issue.

Hope this helps,
_Sqltimes

Interesting one today:

For the last few months, on and off, there have been opportunities to run some interesting tests in our lab environment. this resulted in some good posts in the last few weeks. Adding to that tradition is another interesting topic Uniquifier.

Context:

Imagine a table with multiple records and a Clustered and bunch of NonClustered indexes. In the non-clustered indexes, the b-tree is structured based on the index keys and at the bottom of the tree, the leaf points back to clustered index using Clustering Key. Now imagine the same scenario with a non-unique clustered index, so there could be multiple records with same clustering key values. The dependent, non-clustered indexes now will need a way to uniquely identify between the identical looking entries. Enter Uniquifier column !!

Solution

An extra 4-byte column called uniquifier is added to all non-clustered indexes to uniquely distinguish between multiple index entries that result in pointing to the same clustering key.

Let’s take an example:

We’ll re-use some of the code from previous posts for this.

--
-- Create a dummy table to test DBCC IND
--
IF EXISTS (SELECT * FROM sys.objects WHERE name = 'Uniquifier_Test' AND type = 'U')
 DROP TABLE dbo.Uniquifier_Test
GO

CREATE TABLE dbo.Uniquifier_Test (
    ID INT NOT NULL DEFAULT (1)
  , Name VARCHAR(5) NOT NULL DEFAULT('aa')
)
GO

--
-- Create Clustered and NonClustered indexes
--
CREATE CLUSTERED INDEX CI_Uniquifier_Test_ID
    ON dbo.Uniquifier_Test(ID ASC)
GO

CREATE NONCLUSTERED INDEX nCI_Uniquifier_Test_Name
    ON dbo.Uniquifier_Test(Name ASC)
GO

--
-- Let's insert some dummy records
--
INSERT INTO dbo.Uniquifier_Test (ID, Name)
VALUES (1, 'aa')
     , (2, 'bb')
     , (3, 'cc')
     , (4, 'dd')
GO

SELECT *
FROM dbo.Uniquifier_Test
GO

Now, lets look at the contents of the non-clustered index pages. Fr more details on retrieving PageID and query page contents, please refer to the previous posts.

--
-- Retrieve PageID of nCI
--
DBCC IND (test, Uniquifier_Test, -1)
GO

--
-- Retrieve contents of nCI page
--
DBCC TRACEON (3604)
DBCC PAGE(test, 1, 34535, 3)
GO

As you can see, along with the nCI key column, Name, we also have the clustering key (ID) added to the nCI b-tree structure. Along with that there is a new column called UNIQUIFIER added to the non-clustered index pages. Since we did not add any duplicate values, the UNIQUIFIER column is set to zero.

Uniquifier (before duplicate entries)q

Uniquifier (before duplicate entries)

Now, lets add some duplicate entries.

--
-- Let's insert some duplicate records
--
INSERT INTO dbo.Uniquifier_Test (ID, Name)
VALUES (1, 'aa')
     , (2, 'bb')
GO

As you can see, where there are duplicate entries, the UNIQUIFIER column adds a unique value to distinguish between them. This incremental number is just within the same set of duplicate entries. Two duplicate rows for aa & 1 have the UNIQUIFIER value set to 0 and 1 respectively. For the next set of duplicates the incremental value starts over from 0. So it can accomodate a lot of duplicate entries.

Uniquifier after adding duplicate entries

Uniquifier after adding duplicate entries

So, the uniquifier column helps Sql Server distinguish between two identical entries in the non-clustered index.

Hope this helps,
_Sqltimes

Quick one today:

In a previous post, we covered one of the techniques used to generate text version of cluster Log files using command prompt. Today, we’ll cover another technique, a more common one going forward; Using PowerShell.

Context:

In Windows Server 2012, looks like the cluster.exe command prompt interface is not installed by default, when you install FailOver Cluster.

failovercluster_commandlineinterface

PowerShell:

So, we’ll use PowerShell cmdlets to generate these cluster logs.

#
#  Generate cluster log from SampleCluster and save in temp folder.
#
Get-ClusterLog -Cluster SampleCluster -Destination "c:\temp\"

When you run in PowerShell window, the response looks something like this:

powershell_clusterlog

Hope this helps,
_Sqltimes

Quick one today:

In a recently post, we saw the value and usage examples for another undocumented command DBCC IND. Here, we’ll cover another such valuable command DBCC PAGE. The output of DBCC IND often is used as the input parameter for DBCC PAGE command.

This command allows us to read the contents of the given page.

Let’s look at an example:

--
-- Create a dummy table to test DBCC IND
--
IF EXISTS (SELECT * FROM sys.objects WHERE name = 'Uniquifier_Test' AND type = 'U')
     DROP TABLE dbo.Uniquifier_Test
GO

CREATE TABLE dbo.Uniquifier_Test (
      ID    INT         NOT NULL    DEFAULT (1)
    , Name  VARCHAR(5)  NOT NULL    DEFAULT('aa')
)
GO

--
-- Create Clustered and NonClustered indexes
--
CREATE CLUSTERED INDEX CI_Uniquifier_Test_ID
    ON dbo.Uniquifier_Test(ID ASC)
GO

CREATE NONCLUSTERED INDEX nCI_Uniquifier_Test_Name
    ON dbo.Uniquifier_Test(Name ASC)
GO

--
-- Let's insert some dummy records
--
INSERT INTO dbo.Uniquifier_Test
DEFAULT VALUES
GO 2

SELECT *
FROM dbo.Uniquifier_Test
GO

The results look something like this (nothing unusual):

DBCC IND

DBCC IND

Using the technique mentioned in the previous post on DBCC IND, we’ll get PageID for non-clustered index as 34538 where PageType is 2.

Now, lets look at the syntax for DBCC PAGE:

DBCC PAGE (Database Name, Page File ID, PageID, PrintOptions)

  • Database Name is the name of the database where the object exists
  • Page File ID is the file id in the target database where this page ID exists
  • PageID is the ID of the page that we want to retrieve
  • PrintOptions:
    • 0 says to only display the page header information.
    • 1 says to display page header with some row hex details
    • 2 says to display the entire page hex
    • 3 says to display page header and details on each row
      • This is the option used to retrieve data in tabular fashion.

Note: By default, when you run this command, the output is sent to the Sql Server Error Log. If you want the output to be displayed in SSMS, enable this trace flag : 3604

--
-- Enable trace flag
--
DBCC TRACEON (3604)
GO

Now, lets move to retrieving the page contents:

--
-- Retrieve page contents in tabular fashion
--
DBCC TRACEON (3604)
DBCC PAGE(test, 1, 34538, 3)
GO

dbcc_page

As you can see, we see the contents of the page (the actual records stored) a.k.a. 1, aa & 1, aa in the page 34538. Also shows other details like PageID, File ID, etc. We’ll talk about UNIQUIFIER column in a separate post.

Hope this helps,
_Sqltimes

Quick one today:

Though rarely used, there are some hidden DBCC commands that come in handy in unexpected situations. Today, we’ll look at one such undocumented command called DBCC IND. Output from this is used as input for DBCC PAGE command.

DBCC IND

It allows us to find the PageID of a given object. In Sql Server, data is stored in pages and each page has an ID. This command allows us to uncover the list of PageIDs that are used by an object to store data; Either table data or index data.

Let’s look at an example:

--
-- Create a dummy table to test DBCC IND
--
IF EXISTS (SELECT * FROM sys.objects WHERE name = 'Uniquifier_Test' AND type = 'U')
     DROP TABLE dbo.Uniquifier_Test
GO

CREATE TABLE dbo.Uniquifier_Test (
      ID    INT         NOT NULL    DEFAULT (1)
    , Name  VARCHAR(5)  NOT NULL    DEFAULT('aa')
)
GO

--
-- Create Clustered and NonClustered indexes
--
CREATE CLUSTERED INDEX CI_Uniquifier_Test_ID
    ON dbo.Uniquifier_Test(ID ASC)
GO

CREATE NONCLUSTERED INDEX nCI_Uniquifier_Test_Name
    ON dbo.Uniquifier_Test(Name ASC)
GO

--
-- Let's insert some dummy records
--
INSERT INTO dbo.Uniquifier_Test
DEFAULT VALUES
GO 2

SELECT *
FROM dbo.Uniquifier_Test
GO

The results look something like this (nothing unusual):

DBCC IND

DBCC IND

Now, lets figure out what pages are used to store the Clustered Index and NonClustered Index. This command takes 3 parameters:

  • Database Name
  • Object Name
  • Index ID

DBCC IND (‘Database Name’, ‘Object Name’, ‘Index ID’)

Clustered Index:

For clustered index, use 1 for the IndexID.

--
-- List of pages used by Clustered Index
--
DBCC IND(test, Uniquifier_Test, 1)
GO
DBCC IND results for ClusterIndex

DBCC IND results for ClusterIndex

NonClustered Index:

For nonclustered indexes, specify the IndexId to return pertinent results.

--
-- List of pages used by Non-Clustered Index
--
DBCC IND(test, Uniquifier_Test, 2)
GO
DBCC IND Nonclustered Index

DBCC IND Nonclustered Index

All Indexes at Once:

With -1 as the parameter value for IndexID parameter, we could return PageID list for all indexes on this table.

--
-- All in one
--
DBCC IND(test, Uniquifier_Test, -1)
GO
DBCC IND For All

DBCC IND For All

Most of the column names are self-explanatory.

  • PageFID is the File ID of the page
  • PageID is for PageID
  • ….
  • ….
  • PageType displays the type of page
    • 1 – data page
    • 2 – index page
    • 3 and 4 – text pages
    • 8 – GAM page
    • 9 – SGAM page
    • 10 – IAM page
    • 11 – PFS page
Hope this helps,
_Sqltimes

Interesting post today:

Earlier this week, there was an interesting error in replication of our production environment:

select spid, blocked, lastwaittype from sysprocesses 
where '~68~~619~~1017~~1277~' like '%~' + cast (spid as nvarchar)+ '~%' 
and '~68~~619~~1017~~1277~' like '%~' + cast (blocked as nvarchar)+ '~%'

We have a dedicated distributor that delivers data from publisher to a remote subscriber. Before this error ocured, the data was flowing well, but once this error occured, the agent kept retrying, but there was no movement in delivery.

Reasons:

On further investigation, it became apparent that this agent has multiple threads configured to deliver the payload to subscribers a.k.a. SubscriptionStreams.

On a previous post on SubscriptionStreams, an important point is highlighted, which will be presented here again:

  • If any one threads runs into problems and rolls back, all other threads also rollback.

With the above point in context, it looks like there is an issue in the thread-scheduling logic in the Distribution Agent. When distribution agent is a single thread or multiple threads, ACID properties need to be maintained; so when one thread timesout (or runs into a problem), the rest of the threads need to be rolled back.

My guess is, distribution agent keeps trying with multiple threads and it runs into some scheduling conflict when one of the threads timesout or unable to divide the workload to be carried out in parallel.

Let’s say, if the replication payload is too small for parallel operation, Sql Server allows only one thread to do the work; In this scenario, the other threads need to be maintained active — but in some weird situation they might timeout — resulting in the whole batch to be rolledback.

Solution:

  • Microsoft confirmed that this is a bug and released a patch (for Sql Server 2005).
  • In our situation, we are in Sql Server 2008 R2, so this patch does not apply; The trick that worked was to remove parallel operation : SubscriptionStreams.

Once parallelism is removed, the traffic started flowing through.

Important notes:

  • Make sure you play with CommitBatchSize & CommitBatchThreshold before you advance to SubscriptionStreams.
  • SubscriptionStreams is a good option that improves replication performance . But it is an advanced option that is only good for certain situations.
  • So, when replication traffic patterns vary, the usage of this parameter needs to vary.
  • For heavy workloads, we set paralleism; For lower than normal workloads, we remove or reduce parallelism.
Hope this helps,
_Sqltimes

Interesting topic today:

In Sql Server, as we all know, concurrency is maintained through locks & latches. When a particular row is being retrieved for any activity, a lock is requested on that row a.k.a. Shared lock or Exclusive lock, depending on the nature of the activity performed. Once activity is completed, the lock is released. So, when a large number of records in a table are retrieved, then something interesting happens. Sql Server, rather than issuing individuals locks on each row, it perform Lock Escalation to lock the entire table.

This behavior is helpful in some scenarios and detrimental in others. Each lock takes some resources (memory, etc). So establishing a large number of smaller locks (ROWLOCKs) aggregates to a lot of resources. So, Sql Server escalates the lock to either PAGE level or TABLE level (depending on the scenario). Sometimes this could result in longer waits for the PAGE (or entire table) to be freed from other locks, before this new lock request could be granted.

To avoid all of this, sometimes, developers use ROWLOCK to force Sql Server to use ROWLOCK even when performing operations on larger number of records (to avoid waiting for entire PAGE/TABLE to be free from other locks). There are several pros and cons to this approach. As always, the details of the situation guide the best approach in each scenario. One exception to this scenario is, even though we use ROWLOCK, sometimes Sql Server will force lock escalation.

Today, we’ll look at this one aspect of this automatic lock escalation:

Question: At what point does Sql Server, force lock escalation even when ROWLOCK is used?

Let’s run DELETE on a table and see how lock escalation happens. Three diferent levels:

  • 2000 ROWLOCK requests
  • 4000 ROWLOCK requests
  • 6000 ROWLOCK requests

Step 1: At 2000 ROWLOCKS

--
-- Let's run a large DELETE operation
--
DELETE TOP (2000) FROM dbo.SampleTable WITH (ROWLOCK) GO

Now, in a different SSMS window, let’s check the number of locks on the table.

--
-- Check locks on the table
--
EXEC sp_lock 56
WAITFOR DELAY '00:00:01'    -- keep running every second, to capture locks from other window
GO 10

 

Row Level Locks Granted

Row Level Locks Granted

As you can see ROWLOCKS on keys are GRANTed. Now, lets increase the batch size and see where the force lock escalation happens.

Step 2: At 4000 ROWLOCKS

--
-- Let's run a large DELETE operation
--
DELETE TOP (4000)
FROM dbo.SampleTable WITH (ROWLOCK)
GO

Let’s check the lock situation:

ROWLOCKs granted at 4000

ROWLOCKs granted at 4000

So, even at 4000, something interesting happens. Some ROWLOCKs are issues and some PAGELOCKS are issued. Looks like for some rows, the lock is escalated to the PAGELOCK level for efficiency. Let’ continue the effort

Step 3: At 6000 ROWLOCKS

--
-- Let's run a large DELETE operation
--
DELETE TOP (6000)
FROM dbo.SampleTable WITH (ROWLOCK)
GO

Let’s review the locks situation:

ROWLOCKs at 6000

ROWLOCKs at 6000

BINGO !!! As you can see the lock escalation occurs to TABLOCK level at 6000.

Initially, ROWLOCKS are issued; Then just a second later, some locks are escalated to PAGE and subsequently to TABLE level. This is interesting.

Another Nuance: Percentage

Question: Does this escalation occur based on a strict number (or range) or is it based on percentage of records being accessed in a table?

  • From the tests, it seems like the LOCK ESCALATION occurs based on the number, and not percentage of records being manipulated.
Table Size Total Record Count Count of records Lock Requested Type of Lock Granted
Small 2,100 2000 ROWLOCK
Medium 35,000 2000 ROWLOCK
Large 500,000 2000 ROWLOCK

Conclusion:

  1. Sql Server makes intelligent estimations on what is better at each level and makes best decisions to escalate locks as needed. Keep the 2000, 4000 & 6000 as general rule in mind and not as a set in stone rule.
  2. At any given point, Sql Server makes the best judgement call on what is more efficient.
  3. Lock Escalation is based on the number of records on which lock is requested and not on the percentage of records relative to the total records in the table.

Important Note:

  1. This behavior applies only to Sql Server 2012 as this behavior varies from version to version.
  2. In the past, for Sql Server 2008 and before, the number was at 1800 – 2200 before TABLE lock escalation occurs.
  3. If this continues, may be for Sql Server 2016, the number would be slightly higher, as Microsoft improves the lock efficiency by reducing the amount f resources required for each lock.
  4. This does not mean to go ‘free range’ crazy and use ROWLOCK on every query (obviously) & use 4000 everytime. Keep this information handy in making data retrieval decisions. Each ROWLOCK takes resources (memory & CPU); So we need to use caution and minimize the overhead. This a refined technique to be used in infrequent occassions that seem the most suitable for such techniques. Perfom tests before using in production.

 

Hope this helps,
_Sqltimes