vendredi 7 octobre 2011

MS SQL Monitoring / Performance



1] Real Time activity through Nagios

It’s essential to install an efficient monitoring of SQL Server instances. My client tool is Nagios.
Here are the areas we’re looking at :

-       Windows Instance + Agent Services : UP or Down
-       Disk Drives Free Space Instance & Backup Path
-       Error Log : trap key word « Gravité : nn » and send alert if nn >= 16
      2011-08-16 14:07:14.92             Erreur : 3041, Gravité : 16, État : 1.
-       Plugin SQL Server / Nagios : ( see following chapter ) http://exchange.nagios.org/directory/Plugins/Databases/SQLServer/check_mssql_health/details

1.1 Plugin SQL Server Health Check

1.1.1 Standard

  • Frequency 5 minutes
  • Connect to Instance using SQL Server user ( needs Windows & SQL Server Authentication ).

Type
Mode
Unit
GOOD
WARNING
CRITICAL
Comment
Status
connection-time
sec
< 0.5
0.5 < X < 1
> 1

CPU
cpu-busy
%
< 85
85 < X < 95
> 95

IO
io-busy
%
< 80
80 < X < 90
> 90

MEM
mem-pool-data-buffer-hit-ratio
%
> 95
90 < X < 95
< 90

MEM
page-life-expectancy
sec
> 300
150 < X < 300
< 150

MEM
lazy-writes
# / sec
< 15
15 < X < 20
> 20

Backup
database-backup-age
hours
< 50
50 < X < 74
> 74
Exclude tempdb / 50 heures = 2 days+ ( weekend )
Blocked process
sql
# process
= 0
= 1
> 1
select cntr_value from sys.dm_os_performance_counters where counter_name like '%Processes blocked%'
% log used
sql
%
< 90
90 < X < 95
> 95
select max(cntr_value) as "LogFullPct" from sys.dm_os_performance_counters where counter_name like 'Percent Log Used%' and instance_name not in ('_Total', 'mssqlsystemresource', 'model') and cntr_value > 0

% Log Used : is interesting to be aware of happening Log growth.

1.1.2 Add-on - monitoring MS SCCM 2012, we added :

Type
Mode
Unit
GOOD
WARNING
CRITICAL
Comment
CPU
batch-requests
# /sec
< 800
800 < X < 1K
> 1K
number of SQL Statements that are being executed per second
CPU
sql-initcompilations
# / sec
< 80
80 < X < 100
> 100
The first execution of a query on SQL Server generates a compilation and stores the execution plan in the procedure cache.

1.1.3 Add-on : monitoring Always-On configuration see AO post

1.2 Plugin uses SQL Server User ex : Nagios

CREATE LOGIN Nagios WITH PASSWORD = '<password>' , CHECK_POLICY = OFF ;
GRANT VIEW SERVER STATE to Nagios ;

Note :

When Performance counter measure is reaching the Limit -> bounce the instance
Ex :
select object_name,counter_name,instance_name,cntr_value
from sys.dm_os_performance_counters
 where object_name like '%Statistics%' and counter_name like '%batch req%' -- 1 074 540 515


2] Real Time activity through DMVs

2.1 Running SQL, Top Logical Reads

SELECT -- qt.TEXT,
r.session_id,
r.[status] ,
r.wait_type ,
r.command,
r.cpu_time,
r.total_elapsed_time,
r.logical_reads ,
--r.plan_handle,
r.start_time,
SUBSTRING(qt.[text], r.statement_start_offset / 2,
            ( CASE WHEN r.statement_end_offset = -1
                   THEN LEN(CONVERT(NVARCHAR(MAX), qt.[text])) * 2
                   ELSE r.statement_end_offset
              END - r.statement_start_offset ) / 2) AS [statement_executing]
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS qt
WHERE   r.session_id > 50
ORDER BY logical_reads desc ;

2.2 Memory

2.2.1 Instance level - Physical memory REALLY used

-- sql server 2008 physical memory really used dmv
select physical_memory_in_use_kb/1024 as phys_mem_in_use_MB from sys.dm_os_process_memory


Memory in use may be a bit more than the « max server » memory parameter ( CLRs are added ).

-- Real Time, Full report
DBCC MEMORYSTATUS


2.2.2 Instance level - Buffer Hit Ratio

-- Buffer Hit Ratio
-- The calculation is : total number of cache hits \ total number of cache lookups from the previous few thousand page requests.
SELECT ceiling(round(CAST(A.cntr_value1 AS NUMERIC) / CAST(B.cntr_value2 AS NUMERIC),3)*100) AS Buffer_Cache_Hit_Ratio
FROM ( SELECT cntr_value AS cntr_value1 FROM sys.dm_os_performance_counters WHERE object_name like '%Buffer Manager%' AND counter_name = 'Buffer cache hit ratio' ) AS A,
(SELECT cntr_value AS cntr_value2 FROM sys.dm_os_performance_counters WHERE object_name like '%Buffer Manager%' AND counter_name = 'Buffer cache hit ratio base' ) AS B;


2.2.3 Database level - Consumption


-- Note: querying sys.dm_os_buffer_descriptors
-- requires the VIEW_SERVER_STATE permission.

DECLARE @total_buffer INT;

SELECT @total_buffer = cntr_value
   FROM sys.dm_os_performance_counters
   WHERE RTRIM([object_name]) LIKE '%Buffer Manager'
   AND counter_name = 'Total Pages';

;WITH src AS
(
   SELECT
       database_id, db_buffer_pages = COUNT_BIG(*)
       FROM sys.dm_os_buffer_descriptors
       --WHERE database_id BETWEEN 5 AND 32766
       GROUP BY database_id
)
SELECT
   [db_name] = CASE [database_id] WHEN 32767
       THEN 'Resource DB'
       ELSE DB_NAME([database_id]) END,
   db_buffer_pages,
   db_buffer_MB = db_buffer_pages / 128,
   db_buffer_percent = CONVERT(DECIMAL(6,3),
       db_buffer_pages * 100.0 / @total_buffer)
FROM src
ORDER BY db_buffer_MB DESC;


2.2.4 Note : how to reset Used Memory – taken from Internet.

I add a one minute Delay between both commands.

EXEC sys.sp_configure N'max server memory (MB)', 1024
GO
RECONFIGURE WITH OVERRIDE
GO
WAITFOR DELAY '00:01';
GO
EXEC sys.sp_configure N'max server memory (MB)', 6144
GO
RECONFIGURE WITH OVERRIDE
GO

2.3 Sum-Up

-------------
-- 1] OVERALL
-------------

-- tlog size
dbcc sqlperf('logspace')

sp_who2 -- 200 sessions

-- Get Transactions/sec ( cumulative counter ) from http://www.databasejournal.com
-- Collect first sample, then wait for delay, collect again and diff
DECLARE @old_cntr_value INT;
DECLARE @first_sample_date DATETIME;
SELECT @old_cntr_value = cntr_value , @first_sample_date = getdate() FROM sys.dm_os_performance_counters
WHERE (counter_name = 'Transactions/sec' and instance_name = '_Total');
WAITFOR DELAY '00:00:10'
SELECT (cntr_value - @old_cntr_value) / DATEDIFF(ss,@first_sample_date, GETDATE()) as "Transactions/sec"
FROM sys.dm_os_performance_counters
WHERE (counter_name = 'Transactions/sec' and instance_name = '_Total') ;

-- Get Batch Requests/sec ( cumulative counter ) -- 150 , threshold < 800
DECLARE @old_cntr_value INT;
DECLARE @first_sample_date DATETIME;
SELECT @old_cntr_value = cntr_value , @first_sample_date = getdate() FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch Requests/sec';
-- Time frame to wait before collecting second sample
WAITFOR DELAY '00:00:10'
-- Collect second sample and calculate per-second counter
SELECT (cntr_value - @old_cntr_value) / DATEDIFF(ss,@first_sample_date, GETDATE()) as "Batch Requests/sec"
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch Requests/sec' ;

-- Real Time activity through DMVs, Running SQL, Top Logical Reads
SELECT -- qt.TEXT,
r.session_id,
r.[status] ,
r.wait_type ,
r.command,
r.cpu_time,
r.total_elapsed_time,
r.logical_reads ,
--r.plan_handle,
r.start_time,
SUBSTRING(qt.[text], r.statement_start_offset / 2,
            ( CASE WHEN r.statement_end_offset = -1
                   THEN LEN(CONVERT(NVARCHAR(MAX), qt.[text])) * 2
                   ELSE r.statement_end_offset
              END - r.statement_start_offset ) / 2) AS [statement_executing]
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS qt
WHERE   r.session_id > 50
ORDER BY logical_reads desc ;

-------------
-- 2] CPU
-------------

-- CPU bottleneck, one row per CPU : look at the runnable count
SELECT scheduler_id,       cpu_id,       current_tasks_count,       runnable_tasks_count,
       current_workers_count,       active_workers_count,       work_queue_count
FROM   sys.dm_os_schedulers WHERE  scheduler_id < 255; -- Get Transactions/sec ( cumulative counter ) from http://www.databasejournal.com

-- Get CPU Utilization History for last 120 minutes (in one minute intervals)
-- This version works with SQL Server 2008 and SQL Server 2008 R2 only
-- from http://www.sqlserverspecialists.com/2013/05/script-to-get-cpu-utilization-history.html

-- % CPU Used
select MIN(100-[System Idle Process]) MIN_CPU_USED_PCT,MAX(100-[System Idle Process])MAX_CPU_USED_PCT,AVG(100-[System Idle Process])AVG_CPU_USED_PCT
 from (
SELECT TOP(120)
SQLProcessUtilization AS [SQL Server Process CPU Utilization],
               SystemIdle AS [System Idle Process],
               100 - SystemIdle - SQLProcessUtilization AS [Other Process CPU Utilization],
               DATEADD(ms, -1 * ((SELECT cpu_ticks/(cpu_ticks/ms_ticks)FROM sys.dm_os_sys_info) - [timestamp]), GETDATE()) AS [Event Time]
FROM (
                  SELECT record.value('(./Record/@id)[1]', 'int') AS record_id,
                                                record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int')
                                                AS [SystemIdle],
                                                record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]',
                                                'int')
                                                AS [SQLProcessUtilization], [timestamp]
                  FROM (
                                                SELECT [timestamp], CONVERT(xml, record) AS [record]
                                                FROM sys.dm_os_ring_buffers
                                                WHERE ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'
                                                AND record LIKE N'%<SystemHealth>%') AS x
                  ) AS y
-- ORDER BY record_id DESC OPTION (RECOMPILE)
) AS A

-------------
-- 3] MEMORY
-------------

-- sql server 2008+ physical memory really used by Instance dmv
select physical_memory_in_use_kb/1024 as phys_mem_in_use_MB from sys.dm_os_process_memory -- 2350

SELECT
(physical_memory_in_use_kb/1024) AS Memory_usedby_Sqlserver_MB,
(locked_page_allocations_kb/1024) AS Locked_pages_used_Sqlserver_MB,
(total_virtual_address_space_kb/1024) AS Total_VAS_in_MB,
process_physical_memory_low,
process_virtual_memory_low
FROM sys.dm_os_process_memory;


-- Buffer Hit Ratio
-- The calculation is : total number of cache hits \ total number of cache lookups from the previous few thousand page requests.
SELECT ceiling(round(CAST(A.cntr_value1 AS NUMERIC) / CAST(B.cntr_value2 AS NUMERIC),3)*100) AS Buffer_Cache_Hit_Ratio
FROM ( SELECT cntr_value AS cntr_value1 FROM sys.dm_os_performance_counters WHERE object_name like '%Buffer Manager%' AND counter_name = 'Buffer cache hit ratio' ) AS A,
(SELECT cntr_value AS cntr_value2 FROM sys.dm_os_performance_counters WHERE object_name like '%Buffer Manager%' AND counter_name = 'Buffer cache hit ratio base' ) AS B;

-- db buffer per database -- 850 MB
DECLARE @total_buffer INT;
SELECT @total_buffer = cntr_value
   FROM sys.dm_os_performance_counters
   WHERE RTRIM([object_name]) LIKE '%Buffer Manager'
   AND counter_name = 'Total Pages';
;WITH src AS
(
   SELECT
       database_id, db_buffer_pages = COUNT_BIG(*)
       FROM sys.dm_os_buffer_descriptors
       --WHERE database_id BETWEEN 5 AND 32766
       GROUP BY database_id
)
SELECT
   [db_name] = CASE [database_id] WHEN 32767
       THEN 'Resource DB'
       ELSE DB_NAME([database_id]) END,
   db_buffer_pages,
   db_buffer_MB = db_buffer_pages / 128,
   db_buffer_percent = CONVERT(DECIMAL(6,3),
       db_buffer_pages * 100.0 / @total_buffer)
FROM src
ORDER BY db_buffer_MB DESC;


3] Find Missing Index

unique_compiles—shows the number of compilations and recompilations that would have benefited from this missing-index group. These compilations don't all have to be the same query; many different queries can potentially make use of the suggested indexes.
user_seeks—shows the number of seek operations performed in user queries for which the recommended index in the group could have been used.
user_scans—shows the number of scan operations performed in user queries for which the recommended index in the group could have been used.
avg_user_impact—shows the average percentage improvement that user queries might have if the missingindex group were implemented. On average, all user queries (seeks and scans) that use an implemented missing-index group would show in a decrease in this percentage in their execution cost.

On selected tables as follow : btw it removed a big deadlock issue caused by slow triggers …

SELECT object_name(mid.[object_id],mid.database_id) as objectname,avg_user_impact,unique_compiles,user_seeks,user_scans,last_user_seek,
 CONVERT (decimal (28,1), migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans)) AS improv_measure, 'CREATE INDEX missing_index_' + CONVERT (varchar, mig.index_group_handle) + '_' + CONVERT (varchar, mid.index_handle) + ' ON ' + mid.statement + ' (' + ISNULL (mid.equality_columns,'') + CASE WHEN mid.equality_columns IS NOT NULL AND mid.inequality_columns IS NOT NULL THEN ',' ELSE '' END + ISNULL (mid.inequality_columns, '') + ')' + ISNULL (' INCLUDE (' + mid.included_columns + ')', '') AS create_index_statement -- , migs.*
, mid.database_id
FROM sys.dm_db_missing_index_groups mig
INNER JOIN sys.dm_db_missing_index_group_stats migs ON migs.group_handle = mig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle
WHERE CONVERT (decimal (28,1), migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans)) > 1000
/*
and (mid.statement like '%BASE_EVO04%' or mid.statement like '%BASE_CONF04%')
and (mid.statement not like '%SCHEMA05%')
*/
and (migs.user_seeks>50 or migs.user_scans>50)
and datediff(hh,last_user_seek,getdate())<=24
and avg_user_impact>30
and object_name(mid.[object_id],mid.database_id) IN ('TABLE_AM','TABLE_SD')
ORDER BY migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) DESC

Table
avg_user_impact
improv_measure
create_index_statement
TABLE_SD
94,22
16656713.9
CREATE INDEX missing_index_140009_140008 ON [BASE_EVO04].[SCHEMA04].[TABLE_SD] ([STATUS_ID]) INCLUDE ([TABLE_SD_ID], [RECIPIENT_ID], [REQUESTOR_ID], [URGENCY_ID])
TABLE_AM
99,99
4178276.0
CREATE INDEX missing_index_140269_140268 ON [BASE_EVO04].[SCHEMA04].[TABLE_AM] ([ORIGIN_ACTION_ID]) INCLUDE ([ACTION_ID])
TABLE_AM
99,18
1363032.1
CREATE INDEX missing_index_140151_140150 ON [BASE_EVO04].[SCHEMA04].[TABLE_AM] ([DONE_BY_ID]) INCLUDE ([ACTION_ID], [REQUEST_ID], [EXPECTED_END_DATE_UT], [START_DATE_UT], [END_DATE_UT], [EXPECTED_START_DATE_UT], [ACTION_LABEL_FR], [MAX_INTERVENTION_DATE_UT], [ACTION_TYPE_ID])
TABLE_AM
42,36
703360.6
CREATE INDEX missing_index_141131_141130 ON [BASE_EVO04].[SCHEMA04].[TABLE_AM] ([GROUP_ID]) INCLUDE ([ACTION_ID], [DONE_BY_ID], [REQUEST_ID], [ACTION_TYPE_ID])
TABLE_SD
99,15
396373.0
CREATE INDEX missing_index_140272_140271 ON [BASE_EVO04].[SCHEMA04].[TABLE_SD] ([REQUEST_PROJECT_ID]) INCLUDE ([REQUEST_ID])
TABLE_SD
80,49
381052.0
CREATE INDEX missing_index_140781_140780 ON [BASE_EVO04].[SCHEMA04].[TABLE_SD] ([STATUS_ID]) INCLUDE ([REQUEST_ID], [TABLE_SD_ID], [SUBMIT_DATE_UT], [RECIPIENT_ID], [REQUESTOR_ID])
TABLE_SD
69,77
11127.1
CREATE INDEX missing_index_162417_162416 ON [BASE_EVO04].[SCHEMA04].[TABLE_SD] ([PARENT_REQUEST_ID]) INCLUDE ([TABLE_SD_ID])
TABLE_AM
32,17
9754.5
CREATE INDEX missing_index_11_10 ON [EVO_DATA40000].[40000].[TABLE_AM] ([END_DATE_UT])

Table
avg_user_impact
compiles
seeks
improv_measure
TABLE_SD
98,06
3048
35457
6782659.9
TABLE_SD
97,97
3048
35457
6779946.9
TABLE_SD
96,92
3048
35457
6703807.8


4] Monitor SQL in the Plan Cache


-- Statistics on objects
use EVO_DATA50004
SELECT
OBJECT_NAME([sp].[object_id]) AS "Table",
[sp].[stats_id] AS "Statistic ID",[s].[name] AS "Statistic",
[sp].[last_updated] AS "Last Updated",[sp].[rows],[sp].[rows_sampled],
[sp].[unfiltered_rows],[sp].[modification_counter] AS "Modifications"
FROM [sys].[stats] AS [s]
OUTER APPLY sys.dm_db_stats_properties ([s].[object_id],[s].[stats_id]) AS [sp]
WHERE [s].[object_id] = OBJECT_ID(N'50004.AM_EMPLOYEE');

-- TOP 10 SQL in Plan Cache using DMV
-- order by LIOs
SELECT TOP 10 SUBSTRING(qt.TEXT, (qs.statement_start_offset/2)+1,
 ((CASE qs.statement_end_offset
 WHEN -1 THEN DATALENGTH(qt.TEXT)
 ELSE qs.statement_end_offset
 END - qs.statement_start_offset)/2)+1),
 qs.execution_count,
 qs.total_logical_reads, qs.last_logical_reads,
 qs.total_logical_writes, qs.last_logical_writes,
 qs.total_worker_time,
 qs.last_worker_time,
 qs.total_elapsed_time/1000000 total_elapsed_time_in_S,
 qs.last_elapsed_time/1000000 last_elapsed_time_in_S,
 qs.last_execution_time,
 qp.query_plan
 FROM sys.dm_exec_query_stats qs
 CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
 CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
 ORDER BY qs.total_logical_reads DESC -- logical reads
 -- ORDER BY qs.total_logical_writes DESC -- logical writes
 -- ORDER BY qs.total_worker_time DESC -- CPU time

 -- TOP SQL more detailed , per execution
 SELECT TOP 10
    GETDATE() AS "Collection Date",
    qs.execution_count AS "Execution Count",
    SUBSTRING(qt.text,qs.statement_start_offset/2 +1,
                 (CASE WHEN qs.statement_end_offset = -1
                       THEN LEN(CONVERT(NVARCHAR(MAX), qt.text)) * 2
                       ELSE qs.statement_end_offset END -
                            qs.statement_start_offset
                 )/2
             ) AS "Query Text",
     DB_NAME(qt.dbid) AS "DB Name",
     qs.total_worker_time AS "Total CPU Time",
     qs.total_worker_time/qs.execution_count AS "Avg CPU Time (ms)",    
     qs.total_physical_reads AS "Total Physical Reads",
     qs.total_physical_reads/qs.execution_count AS "Avg Physical Reads",
     qs.total_logical_reads AS "Total Logical Reads",
     qs.total_logical_reads/qs.execution_count AS "Avg Logical Reads",
     qs.total_logical_writes AS "Total Logical Writes",
     qs.total_logical_writes/qs.execution_count AS "Avg Logical Writes",
     qs.total_elapsed_time AS "Total Duration",
     qs.total_elapsed_time/qs.execution_count AS "Avg Duration (ms)",
     qp.query_plan AS "Plan"
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) AS qp
WHERE
     qs.execution_count > 50 OR
     qs.total_worker_time/qs.execution_count > 100 OR
     qs.total_physical_reads/qs.execution_count > 1000 OR
     qs.total_logical_reads/qs.execution_count > 1000 OR
     qs.total_logical_writes/qs.execution_count > 1000 OR
     qs.total_elapsed_time/qs.execution_count > 1000
ORDER BY
     --qs.execution_count DESC,
     --qs.total_elapsed_time/qs.execution_count DESC,
     --qs.total_worker_time/qs.execution_count DESC,
     --qs.total_physical_reads/qs.execution_count DESC,
     qs.total_logical_reads DESC -- /qs.execution_count DESC -- ,
     --qs.total_logical_writes/qs.execution_count DESC