Percy Reyes: Replication

Showing posts with label Replication. Show all posts

Wednesday, 17 January 2024

Looking deeper into the physical & logical architecture - Transaction Log File

Beyond all doubt, it is essential to have a good understanding of the Transaction Log (T-Log) so that we can diagnose unforeseen performance issues related to it and I am sure that almost everyone had at least one.

The T-Log is basically a record of all transactions happening to the database. All these transactions are actually first written to the physical T-Log file, and then after a CHECKPOINT, is written to the Data File via the Lazy Writer process. Some of the uses of T-Log are: as a point in time recovery (full recovery model), to record the Start and End of each transaction, every data modification (insert, update, delete) including system SP's, DDL statements to any table including system tables, every extent and page allocation and de-allocation operation, and creation or drop of tables and indexes.

MSSQL_ENG003165: An error was encountered while replication was being restored/removed. The database has been left offline

While restoring a replicated database without KEEP_REPLICATION option, SQL Server will remove replication settings by executing sp_restoredbreplication at the end of the process. The 'sp_restoredbreplication' system stored procedure will delete all replication metadata, that is, deletion of 'tr_MStran_alterschemaonly', 'tr_MStran_altertable', 'tr_MStran_altertrigger' and 'tr_MStran_alterview' tiggers (which were created to validate alterations on the replication of tables, triggers, views), disable user tables for replication, and deletion of subscription/publications/articles. Nevertheless, there might be some cases where 'sp_restoredbreplication' cannot be executed successfully and ends up leaving the database OFFLINE. I personally experienced that case and the error was something like this:

Msg 3165, Level 16, State 1, Line 1
Database ‘MyDB’ was restored, however an error was encountered while replication was being restored/removed. The database has been left offline. See the topic MSSQL_ENG003165 in SQL Server Books Online.
Msg 3167, Level 16, State 1, Line 1
RESTORE could not start database ‘MyDB’.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.

Looking into this case, I could see that the cause was a DDL database trigger which existed inside the database. Let me expand on what I am saying. The database had that trigger to audit some schema changes which were supposed to save into an auditing table. Unfortunately, that auditing table did not exist in the server where the database was being restored, and the deletion of objects of replication settings were not completed, which means that 'sp_restoredbreplication' was not executed correctly. Consequently, the restoration was stopped and SQL Server decided to leave the database OFFLINE.

In order to restore a copy of this database, we need to disable all DDL database triggers before taking its backup. Only then will the database be restored successfully. The other method to deal with this issue is to change the status to ONLINE manually after the restoration finishes unsuccessfully and also execute 'sp_restoredbreplication'.

To sum up, we need to proceed with more cautiousness while working with databases linked to replication. That is all for now. Let me know any remarks you may have. Thanks for reading. Stay tuned.

Thursday, 29 September 2016

Using ‘sp_browsereplcmds’ to Diagnose SQL Replication Issues

When diagnosing transactional replication issues in SQL Server, we may need to examine pending commands within the distribution database. In other words, we not only have to monitor these pending commands but also take necessary actions to ensure the replication continues to function. For instance, at times, we may need to terminate specific commands due to errors that prevent other commands from being replicated to subscribers. Before doing so, we must first identify which commands have to be removed from the queue, using the sp_browsereplcmds system stored procedure. This procedure accepts several input parameters, such as article_id.

EXEC SP_BROWSEREPLCMDS @article_id = 1

After executing it, we are going to filter only the pending commands for the Article in question. (Remember that an article in replication is directly related to a table. You can query ‘sysarticles’ system table inside the published database.)

Another parameter we can use in order to get more specific information is the transaction sequence number which is essentially the identifier for the transaction. Luckily, when reading some errors, we can see the sequence number and command ID which allow us to identify exactly the root cause we need to work on with ease.

EXEC SP_BROWSEREPLCMDS @xact_seqno_start = '0x00000027000000B50008',@xact_seqno_end = '0x00000027000000B50008'

There are other parameters like command ID to get only the command we need to look into, and also the database ID to get all commands for that database.


EXEC SP_BROWSEREPLCMDS @xact_seqno_start = '0x00000027000000B50008',@xact_seqno_end = '0x00000027000000B50008' ,
@publisher_database_id = 33, @article_id = 1,@command_id= 1

Be cautious, do not execute ‘sp_browsereplcmds’ without any parameter on production database environments as they can have millions of commands inside Distribution database and as a result of this we will not get what we need rapidly and at the same time we will affect the database server performance. I hope you can find this post interesting when it comes to troubleshooting replication issues. Let me know any remark you may have. Thanks for reading.

Thursday, 21 July 2016

The 'SkipErrors' parameter for the Replication Distribution Agent

Having encountered numerous errors in SQL Server Replication, I can confidently say that the majority occur at the replication agent level, such as with the Distribution Agent, Log Reader Agent, Snapshot Agent, Queue Reader Agent, and others. Unfortunately, many of these issues are related to Primary and Foreign Key conflicts, which can take significant time to resolve, either individually or by reconfiguring the replication. To avoid downtime and allow operations to continue while we work on fixing these errors, we can use a helpful option: the 'SkipErrors' parameter. I've successfully used this parameter many times to bypass such issues, and it can also be applied to skip other types of errors as needed.

Today's post will show how to use the' SkipErrors' parameter which allows to skip specific errors so that the data synchronization process is not stopped. This parameter is configurable in the profile of the Distribution Agent and has as input the error number we may want to skip.

The following picture shows an error (with code 547 related to an Foreign Key issue) in the Distribution Agent process and we see how the transactions are being queued due to this error, consequently, there is a need to fix it so as to allow the rest of pending transactions are moved on. (Distribution Agent reads sequentially the table 'msrepl_commands' to get the command to execute in the subscribers, this means that First in the Queue is the First Out to be moved to subscribers.)

Other common errors where you can use 'SkipErrors' parameter is when there are not some rows in the subscriber to apply the changes (The row was not found at the Subscriber when applying the replicated command). The error code for this case is 20598.

I mentioned before that the 'SkipErrors' parameter is configurable inside the Distribution Agent profile and that is what we are going to do right now. Firstly, we need to create a customized profile based on the Default profile and write in the 'Value' column the numbers of the errors (to be skipped) separated by colons as we can see in the following picture 20598:547.

Having done that, we may have to restart the Distribution Agent. Next time the Distribution Agent starts up, it will load the new customized profile with the error codes to be skipped (ignored). Finally, not only will we verify that many errors are skipped, but also the Distribution Agent is running with no problem.

Just to finish writing this post, keep in mind that the transactions with these errors were missed, which means that you will not be able to recover them and may affect your data consistency. In other words, we must use 'SkipErrors' parameter with extreme caution. Let me know any remark you may have. Thanks for reading.

Thursday, 30 June 2016

Error 20598: The row was not found at the Subscriber when applying the replicated command

Having transactional replication environments with read-only subscribers (which means that changes are not propagated back to the publisher), it is very important to understand that rows on subscribers must NOT be modified directly. Otherwise, we will get a big problem. Let me expand on what I am saying. For instance, if any row on the subscriber (which was replicated from publisher) is deleted and then when this same row on published is modified, the following error will be raised:

The row was not found at the Subscriber when applying the replicated command

Clearly, this issue is because the row to be updated on subscriber does not exist any longer while Distribution Agent is trying to propagate it. Therefore, there is a need to fix it as soon as possible to prevent the replication queue from growing so much. To solve this case, we must review the pending commands inside the 'distribution' database by using 'sp_browsereplcmds' in order to identify the affected transaction(s) and row(s), and then insert the missing row manually in the subscriber (or delete the command from queue, however, this recommendation can be taken into account only if someone deleted the row by mistake or you do not need it anymore).

Another technique we have is to use the 'SkipErrors' parameter which allows to skip errors of a certain type (for this issue the error number is 20598), which means that the affected transaction is simply ignored and skipped. Keep in mind that these sorts of error must be treated with extreme caution and a correct understanding of the situation.

That is all for now, let know any remark you may have. Thanks for reading.

Friday, 3 June 2016

How to move the files of database which has Replication, Mirroring, Log Shipping or AlwaysOn Settings

One of the challenging tasks in the life of a DBA is definitely moving all or some of the files of a database from one physical location to another one because of performance issues, maintenance requirements, disk space issues, etc. We usually move database files to another location by using Backup/Restore or Detach/Attach procedures. They are the most proper methods for most of the business cases but not for all. Let me expand on what I mean, for instance, those methods will not work with databases which have Replication, Mirroring, Log Shipping or AlwaysOn Settings because you will have to remove these settings before move them and then you should set up every setting again which could waste your time and have your database service stopped further than necessary. In this situation Backup/Restore or Detach/Attach simply is NOT an option because we need to make the database available as soon as possible. So, what we must do in order to move files of this type of database is by modifying the physical name of each database file we want to move. For instance, in the following code I will move 4 files (3 Data Files and 1 Log File):

ALTER DATABASE SalesDB MODIFY FILE (NAME=N'SalesDB_Data01', FILENAME= N'D:\SQLData\SalesDB\SalesDB_Data01.mdf')

ALTER DATABASE SalesDB MODIFY FILE (NAME=N'SalesDB_Data02', FILENAME= N'D:\SQLData\SalesDB\SalesDB_Data02.ndf')

ALTER DATABASE SalesDB MODIFY FILE (NAME=N'SalesDB_Data03', FILENAME= N'D:\SQLData\SalesDB\SalesDB_Data03.ndf')

ALTER DATABASE SalesDB MODIFY FILE (NAME=N'SalesDB_Log', FILENAME= N'E:\SQLLog\SalesDB\SalesDB_Log.ldf')

It is very important to verify that new database file folders already exist, if so, this should be the output results:

The file “SalesDB_Data01” has been modified in the system catalog. The new path will be used the next time the database is started.
The file “SalesDB_Data02” has been modified in the system catalog. The new path will be used the next time the database is started.
The file “SalesDB_Data03” has been modified in the system catalog. The new path will be used the next time the database is started.
The file “SalesDB_Log” has been modified in the system catalog. The new path will be used the next time the database is started.

What’s next? We must stop the SQL Engine Service and then manually move every database file to the new location we indicated in the code above. Finally, we have to start the SQL Engine service which will load the files from the new location. With this method you do not need to remove any setting mentioned before. This is extremely effective and there is no doubt that it will work. Having these files moved to the new location, the database will start without any problem. If not, you should make sure that the SQL Service account has Full Control permission on database files from the new location.

I hope this tip helps you to save time and it will ensure that your database will be available quickly. I will be pleased to answer any question you may have. Thanks for reading!.

Monday, 15 February 2016

Transactional Replication and Change Data Capture: The Log Reader Agent Conflict

Behind close doors of SQL Server, the following issue may be raised when Transactional Replication and Change Data Capture (CDC) are deployed and running together in the same database server, and because something was done incorrectly managing CDC jobs. We do know that two SQL Jobs are created for the CDC process when CDC is deployed which are 'cdc.MyDB_capture' and 'cdc.MyDB_cleanup'.
Looking into the first one, I would like to say that the 'cdc.MyDB_capture' job executes 'sys.sp_MScdc_capture_job' system stored procedure and it invokes 'sp_cdc_scan' to read internally the Transaction Log and capture the changes done in the database via the Log Reader Agent (created initially for Transactional Replication purposes). In other words, the 'cdc.MyDB_capture' job is the agent of CDC process which reads the Transaction Log by using the Log Reader Agent. Therefore, Transaction Replication and CDC running for the same database cannot use the same Log Read Agent at the same time. Otherwise, we will get this error:

The capture job cannot be used by Change Data Capture to extract changes from the log when transactional replication is also enabled on the same database. When Change Data Capture and transactional replication are both enabled on a database, use the logreader agent to extract the log changes.

The error message is really clear. Put differently, it is not possible that two Log Reader Agent instances are running on your database at the same time. When transactional replication is configured then the cdc.MyDB_capture job is (or should have been) dropped automatically and, if you uninstall Transactional Replication then cdc.MyDB_capture job is created again. To be perfectly honest, this behaviour is because Transactional Replication has the highest priority to use the Log Agent Reader. So, if you have transactional replication running for your database and cdc.MyDB_capture job is still enabled (and running) then you will have to disable or drop it manually since it will be failing and raising the error above. Thanks for reading.

Saturday, 16 August 2014

SQL Server Replication Error – The specified LSN for repldone log scan occurs before the current start of replication in the log

My latest tip has been published today at mssqltips.com about “SQL Server Replication Error – The specified LSN for repldone log scan occurs before the current start of replication in the log” and you can read it at http://www.mssqltips.com/sqlservertip/3288/sql-server-replication-error–the-specified-lsn-for-repldone-log-scan-occurs-before-the-current-start-of-replication-in-the-log . Thanks for reading!

Friday, 25 July 2014

SQL Server Transactional Replication Error: Could not find stored procedure error and how to recover it by using sp_scriptpublicationcustomprocs

Today my tip about how to fix the “SQL Server Transactional Replication Error: Could not find stored procedure” has been published online in mssqltips.com, you can read it at http://www.mssqltips.com/sqlservertip/3287/sql-server-transactional-replication-error-could-not-find-stored-procedure-error-and-how-to-recover-it-by-using-spscriptpublicationcustomprocs/ . Thanks for reading!

Wednesday, 5 May 2010

Geo-Replication Performance Gains with Microsoft SQL Server 2008 Running on Windows Server 2008

MSCOM Ops Team has done a tremendous and excellent work in testing the performance of SQL Server 2005 replication environments on Windows Server 2003 and SQL Server 2008 on Windows Server 2008, and good benefits were achieved as a result of the performance enhancements in the next generation TCP/IP stack. If you open the link, you will read that the team concludes the following:

“… the team discovered that SQL Server 2008 running on Windows Server 2008 yielded up to 100 times faster performance without requiring any expensive wide area network (WAN) acceleration hardware”

Furthermore, scalability and performance have been improved in SQL Server 2008 Native Client, exactly, in the way of invoking stored procedures with ODBC call syntax and OLE DB remote procedure call (RPC) syntax. To find out more details about it, check out Geo-Replication Performance Gains with Microsoft SQL Server 2008 Running on Windows Server 2008. It is also found best practices and baselines, here are some of the them:

"Testing showed that using transactional replication with SQL Server 2008 running on Windows Server 2008 dramatically outperformed SQL Server 2005 running on Windows Server 2003. As illustrated in Table 2, the most substantial performance gains occurred when the Publisher and Subscriber were both running SQL Server 2008 on Windows Server 2008…
…Testing also showed that the scope of the performance gains correlated with the type of replication and the type of data. Push subscription replication of character data with SQL Server 2008 running on Windows Server 2008 yielded a 104 percent increase over SQL Server 2005 running on Windows Server 2003, and pull subscription replication of the same data yielded a 1,298 percent gain.
Both lab and real-life testing by the MSCOM Ops team indicate that highly trafficked Web sites can gain the benefits of geo-replication most effectively when the site is built on SQL Server 2008 running on Windows Server 2008. Based on solid evidence of the feasibility of WAN–based geo-replication, MSCOM Ops plans to expand its implementation of this solution.
In addition, the MSCOM Ops team learned several valuable lessons because of its extensive performance testing of SQL Server 2008 running on Windows Server 2008, including:

Windows Server 2008 and SQL Server 2008 with the TCP/IP stack improvements and partnering with application development teams can bolster global user experiences, produce higher availability, higher scalability, and better resiliency for sites, services, and applications through WAN-based geo-replication.
Replication performance is significantly better for pull subscription scenarios than push subscriptions.
The solution identified in this paper will not work for all applications, particularly applications that cannot handle the inherit latency involved with replicating data between geographically dispersed data centers."

To sum up, there are significant gains in terms of performance, scalability and disaster recovery implemented in SQL Server 2008 on Windows Server 2008. Not only will it be much faster than SQL Server 2005 on Windows Server 2003, but also much secure and cheaper. Therefore, SQL Server Replication technology will be considered a stronger solution inclusive for high availability purposes.
I hope you enjoy reading both document as they are a good read. That is all for now, let me know any remark you may have. Thanks for reading.

Percy Reyes

Pages