This monday morning, I received an unpleasant message from a project manager :
« I had the biggest batch program failures in history : Have you done any changes on Production DB ? ».
Like almost every time, DBAs are blamed for wrong reasons !
Starting at 1/30/11 01:50 , errors were related to
ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDO'
ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDO'
Let’s find out what happened in the UNDO tablespace and give some clues about its behaviour :
Oracle 10.2.0.x - Automatic Undo Management
1/ Grid and AWR at time of incident
I first noticed in GRID console that UNDO tablespace grew -Autoextend- from 500 MB to the MAXSIZE 1000 MB.
AWR is more precise : equivalent of V$UNDOSTAT
With these findings, I went on searching AWR for Big SQL and found out that a big DELETE took place and overfilled UNDO Segments …
It is true that Tuned_undoretention can be less than undo_retention specified in the parameter file.
Undo_retention specified in the init.ora is a low threshold.
From Oracle 10g onward, undo retention is automatically tuned to reduce the chances of "snapshot too old" errors during long-running queries. In the event of any undo space constraints, the system will prioritize DML operations over undo retention. In such situations, the low threshold may not be achieved and tuned_undoretention can go below undo_retention.
If the undo retention threshold must be guaranteed, even at the expense of DML operations, the RETENTION GUARANTEE clause can be set against the undo tablespace.
-- Undo Used Size ( and other measures ) per 10 minutes period from AWR
-- UNDOBLKS -> the consumption rate of undo blocks
with blk_sz_MB as ( SELECT VALUE/1024/1024 blk_val FROM v$parameter WHERE NAME = 'db_block_size' )
select BEGIN_TIME , END_TIME ,
ceil(UNDOBLKS*blk_val) "UNDO Used size (MB)",
ceil(ACTIVEBLKS*blk_val) "UNDO Active size (MB)",
ceil(UNEXPIREDBLKS*blk_val) "UNDO Unexpired size (MB)",
ceil(EXPIREDBLKS*blk_val) "UNDO Expired size (MB)",
TUNED_UNDORETENTION
from dba_hist_undostat , blk_sz_MB
where to_char(BEGIN_TIME,'DD/MM/YYYY HH24:MI:SS') between '31/01/2011 00:30:00' and '31/01/2011 03:00:00'
order by 1;
BEGIN_TIME | END_TIME | UNDO Used size (MB) | UNDO Active size (MB) | UNDO Unexpired size (MB) | UNDO Expired size (MB) | TUNED_ UNDORETENTION |
31/01/2011 00:38 | 31/01/2011 00:48 | 1 | 11 | 144 | 1 | 60000 |
31/01/2011 00:48 | 31/01/2011 00:58 | 1 | 11 | 147 | 2 | 60000 |
31/01/2011 00:58 | 31/01/2011 01:08 | 152 | 158 | 141 | 1 | 60000 |
31/01/2011 01:08 | 31/01/2011 01:18 | 187 | 318 | 161 | 3 | 60000 |
31/01/2011 01:18 | 31/01/2011 01:28 | 293 | 566 | 193 | 8 | 30026 |
31/01/2011 01:28 | 31/01/2011 01:38 | 373 | 933 | 54 | 8 | 1516 |
31/01/2011 01:38 | 31/01/2011 01:48 | 60 | 992 | 8 | 0 | 1782 |
31/01/2011 01:48 | 31/01/2011 01:58 | 130 | 992 | 8 | 0 | 2004 |
31/01/2011 01:58 | 31/01/2011 02:08 | 39 | 994 | 7 | 0 | 2389 |
31/01/2011 02:08 | 31/01/2011 02:18 | 35 | 992 | 8 | 1 | 2752 |
31/01/2011 02:18 | 31/01/2011 02:28 | 42 | 992 | 8 | 0 | 3068 |
31/01/2011 02:28 | 31/01/2011 02:38 | 119 | 11 | 987 | 3 | 3739 |
31/01/2011 02:38 | 31/01/2011 02:48 | 154 | 11 | 986 | 2 | 3578 |
31/01/2011 02:48 | 31/01/2011 02:58 | 10 | 11 | 917 | 1 | 3979 |
31/01/2011 02:58 | 31/01/2011 03:08 | 2 | 11 | 917 | 1 | 4408 |
2/ UNDO Management
From this result, we need to understand how AUTO UNDO tablespace is managed :
Rollback Segments may be in 3 different states, because Oracle protects the Consistent Read with the 'undo_retention' parameter, beyond which extents expire.
* dba_undo_extents shows Extent column STATUS
ACTIVE active transactions
EXPIRED commited transactions no more under 'undo_retention' period
UNEXPIRED commited transactions still under 'undo_retention' period
This is the FULL POWER of AUTO Over Manual Rollback Segment : the ability to move Extents between RBS ; EXPIRED or UNEXPIRED Extents may be transferred to ACTIVE transactions if needed.
What is the algorithm used by Oracle when trying to allocate a new extent for SMU in UNDO tablespace ?
[ taken from Internet ]
1. A new extent will be allocated from the undo tablespace.
2. If failed, ( no free space exists in the tablespace )
acquire expired extent from another undo segment.
This involves shrinking from the other segment and adding to the current undo segment.
3. If failed, reclaim expired extents from the current transaction's commited changes.
4. If failed, try allocating more space by datafile extention ( if autoextend ON ).
5. If failed, try reuse an unexpired extent from the current undo segment.
6. If failed, try stealing an unexpired extent from another undo segment.
7. If all failed, report an "Out-Of-Space" error.
As we see, the algorithm prioritizes the consistent read ( unexpired extent ) over the datafile extension.
3/ Going back to the problem
With these findings, I went on searching AWR for Big SQL and found out that a big DELETE took place and overfilled UNDO Segments …
What I learned from this analysis is that the UNDO_RETENTION parameter was set too high ( 60000 sec ) and leads DBF to expand until the MAXSIZE ; then, it was automatically decreased by Oracle for the benefit of Active Transactions.
The 10g Tuned Undo_Retention always tries to be higher than the UNDO_RETENTION init parameter, but may go below this threshold if there are space constraints ; it happens, I believe, between step 5 and 6 of the above agorithm.
The 10g Tuned Undo_Retention always tries to be higher than the UNDO_RETENTION init parameter, but may go below this threshold if there are space constraints ; it happens, I believe, between step 5 and 6 of the above agorithm.
Mettalica : Tuned_UndoRetention Can be Less Than Undo_Retention in Init.ora [ID 1100313.1] :
It is true that Tuned_undoretention can be less than undo_retention specified in the parameter file.
Undo_retention specified in the init.ora is a low threshold.
From Oracle 10g onward, undo retention is automatically tuned to reduce the chances of "snapshot too old" errors during long-running queries. In the event of any undo space constraints, the system will prioritize DML operations over undo retention. In such situations, the low threshold may not be achieved and tuned_undoretention can go below undo_retention.
If the undo retention threshold must be guaranteed, even at the expense of DML operations, the RETENTION GUARANTEE clause can be set against the undo tablespace.
4/ Appendix Queries
-- AWR : Undo Used Size ( and other measures ) per 10 minutes period
-- UNDOBLKS -> the consumption rate of undo blocks
-- UNXPBLKREUCNT should be = 0 ( if no space pressure problem )
with blk_sz_MB as ( SELECT VALUE/1024/1024 blk_val FROM v$parameter WHERE NAME = 'db_block_size' )
select BEGIN_TIME , END_TIME ,
txncount "total number of transactions", MAXCONCURRENCY "tx. executed concurrently",
ceil(UNDOBLKS*blk_val) "UNDO Used size (MB)",
ceil(ACTIVEBLKS*blk_val) "UNDO Active size (MB)",
ceil(UNEXPIREDBLKS*blk_val) "UNDO Unexpired size (MB)",
ceil(EXPIREDBLKS*blk_val) "UNDO Expired size (MB)",
ceil(UNXPBLKREUCNT*blk_val) "Unexpired Reused size (MB)",
TUNED_UNDORETENTION
from dba_hist_undostat , blk_sz_MB
where to_char(BEGIN_TIME,'DD/MM/YYYY HH24:MI:SS') between '31/01/2011 00:30:00' and '31/01/2011 03:00:00'
order by 1;
-- Current Extents Status - Instance Level
select STATUS,count(*) from dba_undo_extents group by status ;
-- Current Extents Status & Size per tablespace ( if changing UNDO tbspce for example ) - Instance Level
select tablespace_name,STATUS,count(*),ceil(sum(bytes/1024/1024)) MB_SZ from dba_undo_extents group by tablespace_name,status ;
-- Current Active Extents Size per Rollback Segment - Instance Level
select OWNER,SEGMENT_NAME,BYTES/1024/1024,tablespace_name from dba_undo_extents where STATUS = 'ACTIVE' order by 3 desc ;
-- Current Undo Used Size - Instance Level
select ceil(sum(RSSIZE)/1024/1024) "UNDO Current Used Size (MB)" from v$rollstat a ;
-- Total Tablespace UNDO Available Size - Instance Level
select (sum(c.bytes)/1024/1024) "UNDO Tblspce Total Size (MB)" from dba_tablespaces b , dba_data_files c
where b.tablespace_name = c.tablespace_name and b.contents = 'UNDO' ;
-- UNDO Max Used Size since instance startup - Instance Level
-- ( attention -> sum the HWM of every undo segments, may be > UNDO tablespace size )
select ceil(sum(HWMSIZE)/1024/1024) "UNDO Max Used Size (MB)" from v$rollstat ;
-- Undo Used Size per 10 minutes period since instance startup - Instance Level
-- gives consumption rate of UNDO Blocks, helps in finding proper UNDO_RETENTION according to UNDO tbspce size
select to_char(BEGIN_TIME,'DD/MM/YYYY HH24:MI:SS') begin_time,to_char(END_TIME,'DD/MM/YYYY HH24:MI:SS') end_time,
ceil(UNDOBLKS*(SELECT VALUE FROM v$parameter WHERE NAME = 'db_block_size')/1024/1024) "10 min int. UNDO Used Sz (MB)"
from v$undostat order by BEGIN_TIME ;
-- Current Undo Consumption MB_Size - Session Level
SELECT e.taddr,e.sid,e.serial#,e.program,
ROUND(d.used_ublk * (SELECT VALUE FROM v$parameter WHERE NAME = 'db_block_size')/1024/1024,2) "UNDO Used Size (MB)"
FROM v$transaction d, v$session e WHERE d.addr = e.taddr
ORDER BY 5 DESC;
-- Current UNDO Consumption per session still with open transaction
select distinct s.username,s.sid,s.serial#,rn.name,rs.curext,rs.curblk,t.used_ublk,t.used_urec,due.tablespace_name,
t.start_time
from v$transaction t
,v$session s
,v$rollname rn
,v$rollstat rs
,dba_undo_extents due
where t.addr = s.taddr
and t.xidusn = rn.usn
and rn.usn = rs.usn
and rn.name = due.segment_name
and due.status = 'ACTIVE' -- uncommitted
order by 7 desc ;