1] Real Time activity through Nagios
It’s
essential to install an efficient monitoring of SQL Server instances. My client
tool is Nagios.
Here are the
areas we’re looking at :
- Windows Instance + Agent Services :
UP or Down
- Disk Drives Free Space Instance &
Backup Path
- Error Log : trap key word « Gravité
: nn » and send alert if nn >= 16
2011-08-16
14:07:14.92 Erreur : 3041,
Gravité : 16, État : 1.
- Plugin SQL Server / Nagios : ( see
following chapter ) http://exchange.nagios.org/directory/Plugins/Databases/SQLServer/check_mssql_health/details
1.1 Plugin SQL Server Health Check
1.1.1 Standard
- Frequency
5 minutes
- Connect
to Instance using SQL Server user ( needs Windows & SQL Server
Authentication ).
Type
|
Mode
|
Unit
|
GOOD
|
WARNING
|
CRITICAL
|
Comment
|
Status
|
connection-time
|
sec
|
< 0.5
|
0.5 < X < 1
|
> 1
|
|
CPU
|
cpu-busy
|
%
|
< 85
|
85 < X < 95
|
> 95
|
|
IO
|
io-busy
|
%
|
< 80
|
80 < X < 90
|
> 90
|
|
MEM
|
mem-pool-data-buffer-hit-ratio
|
%
|
> 95
|
90 < X < 95
|
< 90
|
|
MEM
|
page-life-expectancy
|
sec
|
> 300
|
150 < X < 300
|
< 150
|
|
MEM
|
lazy-writes
|
# / sec
|
< 15
|
15 < X < 20
|
> 20
|
|
Backup
|
database-backup-age
|
hours
|
< 50
|
50 < X < 74
|
> 74
|
Exclude tempdb / 50
heures = 2 days+ ( weekend )
|
Blocked process
|
sql
|
# process
|
= 0
|
= 1
|
> 1
|
select cntr_value
from sys.dm_os_performance_counters
where counter_name like '%Processes blocked%'
|
% log used
|
sql
|
%
|
< 90
|
90 < X < 95
|
> 95
|
select max(cntr_value) as
"LogFullPct"
from sys.dm_os_performance_counters
where counter_name like 'Percent Log Used%' and instance_name not
in ('_Total', 'mssqlsystemresource', 'model') and
cntr_value > 0
|
% Log Used : is interesting to be
aware of happening Log growth.
1.1.2 Add-on - monitoring MS SCCM
2012, we added :
Type
|
Mode
|
Unit
|
GOOD
|
WARNING
|
CRITICAL
|
Comment
|
CPU
|
batch-requests
|
# /sec
|
< 800
|
800 < X < 1K
|
> 1K
|
number of SQL
Statements that are being executed per second
|
CPU
|
sql-initcompilations
|
# / sec
|
< 80
|
80 < X < 100
|
> 100
|
The first execution
of a query on SQL Server generates a compilation and stores the execution
plan in the procedure cache.
|
1.1.3 Add-on : monitoring
Always-On configuration see AO post
1.2 Plugin uses SQL Server User
ex : Nagios
CREATE LOGIN Nagios WITH
PASSWORD = '<password>' , CHECK_POLICY
= OFF ;
GRANT VIEW SERVER
STATE to Nagios
;
Note :
When
Performance counter measure is reaching the Limit -> bounce the instance
Ex :
select
object_name,counter_name,instance_name,cntr_value
from
sys.dm_os_performance_counters
where object_name like '%Statistics%' and
counter_name like '%batch req%' -- 1 074 540 515
2] Real Time activity through DMVs
2.1 Running SQL, Top Logical Reads
SELECT -- qt.TEXT,
r.session_id,
r.[status] ,
r.wait_type ,
r.command,
r.cpu_time,
r.total_elapsed_time,
r.logical_reads ,
--r.plan_handle,
r.start_time,
SUBSTRING(qt.[text], r.statement_start_offset / 2,
( CASE WHEN
r.statement_end_offset
= -1
THEN LEN(CONVERT(NVARCHAR(MAX), qt.[text])) *
2
ELSE r.statement_end_offset
END -
r.statement_start_offset
) / 2)
AS [statement_executing]
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS
qt
WHERE
r.session_id > 50
ORDER BY
logical_reads desc ;
2.2 Memory
2.2.1 Instance level - Physical memory REALLY used
-- sql
server 2008 physical memory really used dmv
select physical_memory_in_use_kb/1024 as phys_mem_in_use_MB
from sys.dm_os_process_memory
Memory in use may be a bit more than the « max
server » memory parameter ( CLRs are added ).
-- Real
Time, Full report
DBCC MEMORYSTATUS
2.2.2 Instance level - Buffer Hit Ratio
-- Buffer
Hit Ratio
-- The
calculation is : total number of cache hits \ total number of cache lookups
from the previous few thousand page requests.
SELECT ceiling(round(CAST(A.cntr_value1 AS NUMERIC) / CAST(B.cntr_value2 AS NUMERIC),3)*100)
AS Buffer_Cache_Hit_Ratio
FROM ( SELECT cntr_value AS cntr_value1
FROM sys.dm_os_performance_counters WHERE object_name like '%Buffer
Manager%' AND
counter_name = 'Buffer cache hit ratio' ) AS
A,
(SELECT cntr_value AS cntr_value2
FROM sys.dm_os_performance_counters WHERE object_name like '%Buffer
Manager%' AND
counter_name = 'Buffer cache hit ratio base' ) AS
B;
2.2.3 Database level - Consumption
-- Note:
querying sys.dm_os_buffer_descriptors
-- requires
the VIEW_SERVER_STATE permission.
DECLARE @total_buffer INT;
SELECT @total_buffer = cntr_value
FROM sys.dm_os_performance_counters
WHERE RTRIM([object_name]) LIKE '%Buffer
Manager'
AND counter_name = 'Total Pages';
;WITH src AS
(
SELECT
database_id, db_buffer_pages = COUNT_BIG(*)
FROM sys.dm_os_buffer_descriptors
--WHERE database_id BETWEEN
5 AND 32766
GROUP BY database_id
)
SELECT
[db_name] = CASE [database_id]
WHEN 32767
THEN 'Resource DB'
ELSE DB_NAME([database_id]) END,
db_buffer_pages,
db_buffer_MB = db_buffer_pages / 128,
db_buffer_percent
= CONVERT(DECIMAL(6,3),
db_buffer_pages * 100.0 / @total_buffer)
FROM src
ORDER BY db_buffer_MB
DESC;
2.2.4 Note : how to reset Used Memory – taken from
Internet.
I add a one minute Delay between
both commands.
EXEC sys.sp_configure N'max server memory (MB)', 1024
GO
RECONFIGURE WITH OVERRIDE
GO
WAITFOR DELAY '00:01';
GO
EXEC sys.sp_configure N'max server memory (MB)', 6144
GO
RECONFIGURE WITH OVERRIDE
GO
2.3 Sum-Up
-------------
-- 1]
OVERALL
-------------
-- tlog
size
dbcc sqlperf('logspace')
sp_who2 -- 200 sessions
-- Get
Transactions/sec ( cumulative counter ) from http://www.databasejournal.com
-- Collect
first sample, then wait for delay, collect again and diff
DECLARE @old_cntr_value INT;
DECLARE @first_sample_date DATETIME;
SELECT @old_cntr_value = cntr_value
, @first_sample_date = getdate() FROM sys.dm_os_performance_counters
WHERE (counter_name = 'Transactions/sec'
and instance_name = '_Total');
WAITFOR DELAY '00:00:10'
SELECT (cntr_value - @old_cntr_value) / DATEDIFF(ss,@first_sample_date, GETDATE()) as
"Transactions/sec"
FROM sys.dm_os_performance_counters
WHERE (counter_name = 'Transactions/sec'
and instance_name = '_Total') ;
-- Get
Batch Requests/sec ( cumulative counter ) -- 150 , threshold < 800
DECLARE @old_cntr_value INT;
DECLARE @first_sample_date DATETIME;
SELECT @old_cntr_value = cntr_value
, @first_sample_date = getdate() FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch
Requests/sec';
-- Time
frame to wait before collecting second sample
WAITFOR DELAY '00:00:10'
-- Collect
second sample and calculate per-second counter
SELECT (cntr_value - @old_cntr_value) / DATEDIFF(ss,@first_sample_date, GETDATE()) as
"Batch Requests/sec"
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch
Requests/sec' ;
-- Real
Time activity through DMVs, Running SQL, Top Logical Reads
SELECT -- qt.TEXT,
r.session_id,
r.[status] ,
r.wait_type ,
r.command,
r.cpu_time,
r.total_elapsed_time,
r.logical_reads ,
--r.plan_handle,
r.start_time,
SUBSTRING(qt.[text], r.statement_start_offset / 2,
( CASE WHEN
r.statement_end_offset
= -1
THEN LEN(CONVERT(NVARCHAR(MAX), qt.[text])) *
2
ELSE r.statement_end_offset
END -
r.statement_start_offset
) / 2)
AS [statement_executing]
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS
qt
WHERE
r.session_id > 50
ORDER BY logical_reads
desc ;
-------------
-- 2] CPU
-------------
-- CPU
bottleneck, one row per CPU : look at the runnable count
SELECT scheduler_id, cpu_id, current_tasks_count, runnable_tasks_count,
current_workers_count, active_workers_count, work_queue_count
FROM
sys.dm_os_schedulers WHERE scheduler_id < 255; --
Get Transactions/sec ( cumulative counter ) from http://www.databasejournal.com
-- Get CPU Utilization
History for last 120 minutes (in one minute intervals)
-- This
version works with SQL Server 2008 and SQL Server 2008 R2 only
-- from
http://www.sqlserverspecialists.com/2013/05/script-to-get-cpu-utilization-history.html
-- % CPU
Used
select MIN(100-[System Idle Process]) MIN_CPU_USED_PCT,MAX(100-[System
Idle Process])MAX_CPU_USED_PCT,AVG(100-[System Idle Process])AVG_CPU_USED_PCT
from
(
SELECT TOP(120)
SQLProcessUtilization AS [SQL
Server Process CPU Utilization],
SystemIdle AS [System
Idle Process],
100 - SystemIdle
- SQLProcessUtilization AS [Other
Process CPU Utilization],
DATEADD(ms, -1 * ((SELECT
cpu_ticks/(cpu_ticks/ms_ticks)FROM sys.dm_os_sys_info) - [timestamp]), GETDATE()) AS
[Event Time]
FROM (
SELECT record.value('(./Record/@id)[1]', 'int') AS record_id,
record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int')
AS [SystemIdle],
record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]',
'int')
AS
[SQLProcessUtilization], [timestamp]
FROM (
SELECT [timestamp], CONVERT(xml,
record) AS
[record]
FROM sys.dm_os_ring_buffers
WHERE ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'
AND record LIKE N'%<SystemHealth>%') AS x
) AS
y
-- ORDER BY
record_id DESC OPTION (RECOMPILE)
) AS A
-------------
-- 3]
MEMORY
-------------
-- sql
server 2008+ physical memory really used by Instance dmv
select physical_memory_in_use_kb/1024 as phys_mem_in_use_MB
from sys.dm_os_process_memory -- 2350
SELECT
(physical_memory_in_use_kb/1024)
AS Memory_usedby_Sqlserver_MB,
(locked_page_allocations_kb/1024)
AS Locked_pages_used_Sqlserver_MB,
(total_virtual_address_space_kb/1024)
AS Total_VAS_in_MB,
process_physical_memory_low,
process_virtual_memory_low
FROM sys.dm_os_process_memory;
-- Buffer
Hit Ratio
-- The
calculation is : total number of cache hits \ total number of cache lookups
from the previous few thousand page requests.
SELECT ceiling(round(CAST(A.cntr_value1 AS NUMERIC) / CAST(B.cntr_value2 AS NUMERIC),3)*100)
AS Buffer_Cache_Hit_Ratio
FROM ( SELECT cntr_value AS cntr_value1
FROM sys.dm_os_performance_counters WHERE object_name like '%Buffer
Manager%' AND
counter_name = 'Buffer cache hit ratio' ) AS
A,
(SELECT cntr_value AS cntr_value2
FROM sys.dm_os_performance_counters WHERE object_name like '%Buffer
Manager%' AND
counter_name = 'Buffer cache hit ratio base' ) AS
B;
-- db
buffer per database -- 850 MB
DECLARE @total_buffer INT;
SELECT @total_buffer = cntr_value
FROM sys.dm_os_performance_counters
WHERE RTRIM([object_name]) LIKE '%Buffer
Manager'
AND counter_name = 'Total Pages';
;WITH src AS
(
SELECT
database_id, db_buffer_pages = COUNT_BIG(*)
FROM sys.dm_os_buffer_descriptors
--WHERE database_id BETWEEN
5 AND 32766
GROUP BY database_id
)
SELECT
[db_name] = CASE [database_id]
WHEN 32767
THEN 'Resource DB'
ELSE DB_NAME([database_id]) END,
db_buffer_pages,
db_buffer_MB = db_buffer_pages / 128,
db_buffer_percent
= CONVERT(DECIMAL(6,3),
db_buffer_pages * 100.0 / @total_buffer)
FROM src
ORDER BY
db_buffer_MB DESC;
3] Find Missing Index
unique_compiles—shows
the number of compilations and recompilations that would have benefited from
this missing-index group. These compilations don't all have to be the same
query; many different queries can potentially make use of the suggested
indexes.
user_seeks—shows
the number of seek operations performed in user queries for which the
recommended index in the group could have been used.
user_scans—shows
the number of scan operations performed in user queries for which the
recommended index in the group could have been used.
avg_user_impact—shows
the average percentage improvement that user queries might have if the
missingindex group were implemented. On average, all user queries (seeks and
scans) that use an implemented missing-index group would show in a decrease in
this percentage in their execution cost.
On selected
tables as follow : btw it removed a big deadlock issue caused by slow
triggers …
SELECT object_name(mid.[object_id],mid.database_id) as
objectname,avg_user_impact,unique_compiles,user_seeks,user_scans,last_user_seek,
CONVERT (decimal
(28,1), migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans)) AS
improv_measure, 'CREATE INDEX missing_index_' + CONVERT (varchar, mig.index_group_handle) + '_'
+ CONVERT (varchar, mid.index_handle) +
' ON ' + mid.statement + ' ('
+ ISNULL (mid.equality_columns,'')
+ CASE WHEN
mid.equality_columns
IS NOT NULL
AND mid.inequality_columns IS NOT NULL
THEN ',' ELSE
'' END +
ISNULL (mid.inequality_columns, '') + ')'
+ ISNULL (' INCLUDE (' + mid.included_columns + ')', '')
AS create_index_statement -- , migs.*
, mid.database_id
FROM sys.dm_db_missing_index_groups mig
INNER JOIN sys.dm_db_missing_index_group_stats
migs ON migs.group_handle = mig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details
mid ON mig.index_handle = mid.index_handle
WHERE CONVERT (decimal (28,1), migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks
+ migs.user_scans)) >
1000
/*
and
(mid.statement like '%BASE_EVO04%' or mid.statement like '%BASE_CONF04%')
and
(mid.statement not like '%SCHEMA05%')
*/
and (migs.user_seeks>50 or migs.user_scans>50)
and datediff(hh,last_user_seek,getdate())<=24
and avg_user_impact>30
and object_name(mid.[object_id],mid.database_id) IN ('TABLE_AM','TABLE_SD')
ORDER BY migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) DESC
Table
|
avg_user_impact
|
improv_measure
|
create_index_statement
|
TABLE_SD
|
94,22
|
16656713.9
|
CREATE INDEX missing_index_140009_140008 ON [BASE_EVO04].[SCHEMA04].[TABLE_SD]
([STATUS_ID]) INCLUDE ([TABLE_SD_ID], [RECIPIENT_ID], [REQUESTOR_ID],
[URGENCY_ID])
|
TABLE_AM
|
99,99
|
4178276.0
|
CREATE INDEX missing_index_140269_140268 ON [BASE_EVO04].[SCHEMA04].[TABLE_AM]
([ORIGIN_ACTION_ID]) INCLUDE ([ACTION_ID])
|
TABLE_AM
|
99,18
|
1363032.1
|
CREATE INDEX missing_index_140151_140150 ON [BASE_EVO04].[SCHEMA04].[TABLE_AM]
([DONE_BY_ID]) INCLUDE ([ACTION_ID], [REQUEST_ID], [EXPECTED_END_DATE_UT],
[START_DATE_UT], [END_DATE_UT], [EXPECTED_START_DATE_UT], [ACTION_LABEL_FR],
[MAX_INTERVENTION_DATE_UT], [ACTION_TYPE_ID])
|
TABLE_AM
|
42,36
|
703360.6
|
CREATE INDEX missing_index_141131_141130 ON [BASE_EVO04].[SCHEMA04].[TABLE_AM]
([GROUP_ID]) INCLUDE ([ACTION_ID], [DONE_BY_ID], [REQUEST_ID],
[ACTION_TYPE_ID])
|
TABLE_SD
|
99,15
|
396373.0
|
CREATE INDEX missing_index_140272_140271 ON [BASE_EVO04].[SCHEMA04].[TABLE_SD]
([REQUEST_PROJECT_ID]) INCLUDE ([REQUEST_ID])
|
TABLE_SD
|
80,49
|
381052.0
|
CREATE INDEX missing_index_140781_140780 ON [BASE_EVO04].[SCHEMA04].[TABLE_SD]
([STATUS_ID]) INCLUDE ([REQUEST_ID], [TABLE_SD_ID], [SUBMIT_DATE_UT],
[RECIPIENT_ID], [REQUESTOR_ID])
|
TABLE_SD
|
69,77
|
11127.1
|
CREATE INDEX missing_index_162417_162416 ON [BASE_EVO04].[SCHEMA04].[TABLE_SD]
([PARENT_REQUEST_ID]) INCLUDE ([TABLE_SD_ID])
|
TABLE_AM
|
32,17
|
9754.5
|
CREATE INDEX missing_index_11_10 ON
[EVO_DATA40000].[40000].[TABLE_AM] ([END_DATE_UT])
|
Table
|
avg_user_impact
|
compiles
|
seeks
|
improv_measure
|
TABLE_SD
|
98,06
|
3048
|
35457
|
6782659.9
|
TABLE_SD
|
97,97
|
3048
|
35457
|
6779946.9
|
TABLE_SD
|
96,92
|
3048
|
35457
|
6703807.8
|
4] Monitor SQL in the Plan Cache
--
Statistics on objects
use EVO_DATA50004
SELECT
OBJECT_NAME([sp].[object_id]) AS "Table",
[sp].[stats_id] AS "Statistic
ID",[s].[name]
AS "Statistic",
[sp].[last_updated] AS "Last
Updated",[sp].[rows],[sp].[rows_sampled],
[sp].[unfiltered_rows],[sp].[modification_counter] AS "Modifications"
FROM [sys].[stats] AS [s]
OUTER APPLY
sys.dm_db_stats_properties ([s].[object_id],[s].[stats_id]) AS
[sp]
WHERE [s].[object_id] = OBJECT_ID(N'50004.AM_EMPLOYEE');
-- TOP 10
SQL in Plan Cache using DMV
-- order by
LIOs
SELECT TOP
10 SUBSTRING(qt.TEXT, (qs.statement_start_offset/2)+1,
((CASE qs.statement_end_offset
WHEN -1 THEN
DATALENGTH(qt.TEXT)
ELSE qs.statement_end_offset
END - qs.statement_start_offset)/2)+1),
qs.execution_count,
qs.total_logical_reads, qs.last_logical_reads,
qs.total_logical_writes, qs.last_logical_writes,
qs.total_worker_time,
qs.last_worker_time,
qs.total_elapsed_time/1000000 total_elapsed_time_in_S,
qs.last_elapsed_time/1000000 last_elapsed_time_in_S,
qs.last_execution_time,
qp.query_plan
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
ORDER BY qs.total_logical_reads DESC -- logical reads
-- ORDER BY qs.total_logical_writes DESC
-- logical writes
-- ORDER BY qs.total_worker_time DESC --
CPU time
-- TOP SQL more detailed , per execution
SELECT TOP 10
GETDATE() AS "Collection
Date",
qs.execution_count
AS "Execution Count",
SUBSTRING(qt.text,qs.statement_start_offset/2 +1,
(CASE WHEN qs.statement_end_offset = -1
THEN LEN(CONVERT(NVARCHAR(MAX), qt.text)) *
2
ELSE qs.statement_end_offset END -
qs.statement_start_offset
)/2
) AS "Query Text",
DB_NAME(qt.dbid) AS
"DB Name",
qs.total_worker_time
AS "Total CPU Time",
qs.total_worker_time/qs.execution_count AS "Avg CPU Time (ms)",
qs.total_physical_reads
AS "Total Physical Reads",
qs.total_physical_reads/qs.execution_count AS "Avg Physical Reads",
qs.total_logical_reads
AS "Total Logical Reads",
qs.total_logical_reads/qs.execution_count AS "Avg Logical Reads",
qs.total_logical_writes
AS "Total Logical Writes",
qs.total_logical_writes/qs.execution_count AS "Avg Logical Writes",
qs.total_elapsed_time
AS "Total Duration",
qs.total_elapsed_time/qs.execution_count AS "Avg Duration (ms)",
qp.query_plan
AS "Plan"
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY
sys.dm_exec_sql_text(qs.sql_handle) AS
qt
CROSS APPLY
sys.dm_exec_query_plan(qs.plan_handle) AS
qp
WHERE
qs.execution_count
> 50 OR
qs.total_worker_time/qs.execution_count > 100 OR
qs.total_physical_reads/qs.execution_count > 1000 OR
qs.total_logical_reads/qs.execution_count > 1000 OR
qs.total_logical_writes/qs.execution_count > 1000 OR
qs.total_elapsed_time/qs.execution_count > 1000
ORDER BY
--qs.execution_count DESC,
--qs.total_elapsed_time/qs.execution_count
DESC,
--qs.total_worker_time/qs.execution_count
DESC,
--qs.total_physical_reads/qs.execution_count
DESC,
qs.total_logical_reads
DESC -- /qs.execution_count DESC -- ,
--qs.total_logical_writes/qs.execution_count
DESC