[SQL Server]-BUG造成错误3624

SQL Server管理人员最怕看到错误823, 824, 1788x, 3624.......等重大错误,
一旦发生了总是要花相当多的心力来处理, 甚至有数据丢失的可能性.
不过今天踩了个雷, 发生了不是3624的错误3624......


Colin Colin 2 53 2017-11-22T17:01:00Z 2017-11-22T17:01:00Z 5 855 4876 40 11 5720 15.00 Clean Clean false 0 2 false false false EN-US ZH-TW X-NONE

PS. 本文的问题已在SQL Server 2016修正.

https://connect.microsoft.com/SQLServer/feedback/details/2936151/assertion-failure-when-using-option-recompile-with-an-invalid-offset-clause

相信在经常性管理SQL Server的人员, 对于SQL Server几个重大的错误代号是不会陌生的, 例如823, 824, 605, 1788x, 3624等, 这些都表示著, 要嘛SQL Server Instance有问题, 要嘛数据库的资源 (Memory、Physical Disk) 发生异常, 或是数据库保存对象发生错误. 多数的情况下, 这些错误是必须透过DBCC CHECKDB、SQL DUMP来进一步分析问题, 并且多数的情况是得透过restore backup来抢救数据的.

今天踩到一个很大的雷, 情境是这样的……

正准备迎接下班的同时, 消息框跳出一个来自应用程序的错误截图, 主要是说明应用程序在执行某支Procedure时, 会不固定的发生SQL exception ERROR 3624. 看到这个错误, 心中当下毛毛的, 这…….不会是数据库异常吧!

当大家在Google这个错误时, 不过因为过去遇过数次这个错误, 索性直接找上了DBA, 请他们进行DBCC CHECKDB的检查 – 对的, 多数文章都是这么说, 先检查DBCC CHECKDB. 然而, DBCC CHECKDB是没有异常的…… 很奇妙的说.

之后再一个一个数据表做DBCC CHECKTABLE, 竟然也是都正常的...... 好在错误3624是会产生dump的, 相信这个错误是与DB本质无关 (虽然心中还是毛毛的). 回头去看dump, 发生点是在procedure执行阶段, 就检查了程序, 左看右看都不像有问题的.

直到…… 一个一个参数去带入后, 找到原因是下列组合引发的错误

* 进程中有使用option (recompile)

* 同时使用offset … fetch next … rows

* 并且offset带入的值是负值.

直接来看reproduce这个问题吧

use DEMO;

go

--建立测试数据表

create table tbl_test

(c1 int, c2 varchar(10));

go

--写入测试数据

insert into tbl_test values (1, 'a'),(2,'b'),(3,'c');

go

--一个常见的案例, 分页显示

--建立procedure

create procedure page_list

@page int,

@size int

as

begin

    select * from tbl_test

    order by c1 asc

    offset (@page - 1) * @size rows fetch next @size rows only;

end

go

--测试正常回传

exec page_list 1,2;

/*

c1  c2

1   a

2   b

*/

exec page_list 2,2;

/*

c1  c2

3   c

*/

--测试负值在OFFSET上时

--错误是10724

exec page_list 0,2;

/*

消息 10742,层级 15,状态 1,进程 page_list,行 7 [批次开始行 34]

The offset specified in a OFFSET clause may not be negative.

*/

至此的错误是一个可预期的, 在OFFSET指定负值会产生10724的错误. 不过加上option (recompile) 后就不一样了.

--对procedure加入option (recompile)

alter procedure page_list

@page int,

@size int

as

begin

    select * from tbl_test

    order by c1 asc

    offset (@page - 1) * @size rows fetch next @size rows only

    option (recompile);

end

go

--再次带入负值

--此时会停顿一下 => 打dump

--然后报错

exec page_list 0,2;

/*

Location:   op_ppqte.cpp:12267

Expression: llSkip >= 0

SPID:       51

Process ID: 652

消息 3624,层级 20,状态 1,进程 page_list,行 6 [批次开始行 52]

A system assertion check has failed. Check the SQL Server error log for details. Typically, an assertion failure is caused by a software bug or data corruption. To check for database corruption, consider running DBCC CHECKDB. If you agreed to send dumps to Microsoft during setup, a mini dump will be sent to Microsoft. An update might be available from Microsoft in the latest Service Pack or in a Hotfix from Technical Support.

消息 596,层级 21,状态 1,行 52

Cannot continue the execution because the session is in the kill state.

消息 0,层级 20,状态 0,行 52

在目前的命令上发生严重错误。如果有任何结果,都必须舍弃。

*/

百思不得其解, 为什么OFFSET配上option (recompile) 后会引发corruption, 而且还打dump出来.

(部分dump)

Memory                              

MemoryLoad = 30%                    

Total Physical = 2047 MB            

Available Physical = 1413 MB        

Total Page File = 2431 MB           

Available Page File = 1714 MB       

Total Virtual = 134217727 MB        

Available Virtual = 134212073 MB    

Dump thread - spid = 0, EC = 0x00000000F7F3AC60                                                               

*Stack Dump being sent to C:Program FilesMicrosoft SQL ServerMSSQL12.MSSQLSERVERMSSQLLOGSQLDump0001.txt 

* *                               

*                                                                                                                

* BEGIN STACK DUMP:                                                                                             

*   11/23/17 00:04:28 spid 51                                                                                   

*                                                                                                               

* Location: op_ppqte.cpp:12267                                                                                 

* Expression:   llSkip >= 0                                                                                      

* SPID:     51                                                                                                     

* Process ID:   652                                                                                              

*                                                                                                                

* Input Buffer 62 bytes -                                                                                       

*             exec page_list -4,2; 

然后从各方资讯, 就是要做DBCC CHECKDB的检查, 可是测试数据库才建好, 里头也就这么一张表……. 怎么检查都没有错啊…….

--从ERRORLOG可以看到dump部分内容

--与错误发生的建议处理方式

sp_readerrorlog

/*

Error: 17066, Severity: 16, State: 1.

SQL Server Assertion: File: , line=12267 Failed Assertion = 'llSkip >= 0'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.

Error: 3624, Severity: 20, State: 1.

A system assertion check has failed. Check the SQL Server error log for details. Typically, an assertion failure is caused by a software bug or data corruption. To check for database corruption, consider running DBCC CHECKDB. If you agreed to send dumps to Microsoft during setup, a mini dump will be sent to Microsoft. An update might be available from Microsoft in the latest Service Pack or in a Hotfix from Technical Support. 

*/

本想发个BUG或是Design issue, 这个行为不像ERROR 3624, 最终取得BUG FIX的文档, 也确认相同的进程在SQL Server 2016已经被修复.


原文:大专栏  [SQL Server]-BUG造成错误3624


猜你喜欢

转载自www.cnblogs.com/petewell/p/11490090.html