Alter Session Kill on Steroids

Aug 6, 2014 / By Jared Still

Tags: , , , ,

Perhaps you have encountered something like this: A session that is consuming too many resources needs to be killed. You locate the session and use ALTER SYSTEM KILL SESSION ‘SID,SERIAL#’ to kill the session. As you continue to monitor the database you find that the status of the session in v$session is ‘KILLED’, but the session does not go away. You also notice that the SERIAL# is continually changing.

Now you find there is no OS process associated with the session, but the session continues as PMON is unable to finish cleanup for the session. Usually when this happens, the session will be holding a lock. When that happens, the only method to release the lock is to bounce the database. There are some bugs that may be responsible for this problem, such as this one described by Oracle Support:

Pmon Spins While Cleaning Dead Process (Doc ID 1130713.1)

This particular bug affects Oracle 10.2.0.1 – 11.1.0.7. I have personally seen this same behavior happen on many versions of the database from 7.0 on. To avoid these hanging sessions many DBA’s have adopted the habit of first killing the OS process with an OS utility, and if the session is still visible in v$session, issue the ALTER SYSTEM KILL command.

The OS command used on linux/unix is usually ‘kill -9′. On windows it is OraKill. This method usually avoids the problems encountered when killing a session that is holding a lock and processing DML.

I don’t know just what circumstances trigger this behavior, as I have never been able to reproduce it at will. When it does happen though, it is more than annoying as the only way to clear locks held by the recalcitrant session is to bounce the database.

Quite some time ago (at least as far back as Oracle 8i) Oracle introduced the new IMMEDIATE keyword to use with ALTER SYSTEM KILL SESSION. Using this keyword removes the need to use an OS command to kill a session – Oracle will do it for you! To test this I am using Oracle 10.2.0.4 on Oracle Linux 5.5. I have previously run these same tests in 11.2.0.3 with the same results. Had I access to an 8i or 9i database I would have run the tests there. To start with let’s see what happens when a session is killed without the immediate keyword.

Login to the session to be killed:

$ sqlplus scott/tiger@10gr2

Login as SYSDBA from another terminal and check for scott’s session:

SQL> l
  1  select
  2     s.username,
  3     s.sid,
  4     s.serial#,
  5     p.spid spid
  6  from v$session s, v$process p
  7  where s.username = 'SCOTT'
  8*    and p.addr = s.paddr
SQL> /

USERNAME                              SID    SERIAL# SPID
------------------------------ ---------- ---------- ------------
SCOTT                                 133         35 22870

1 row selected.

All that has happened at this point is that Oracle has made an internal call that has disconnected Scott’s session. (tracing that operation is a different topic.) The process on the server has not been terminated. This can be seen by the following experiment:

Logon again as Scott.

In a SYSDBA session check for Scott’s:

 SQL> @scott

USERNAME                              SID    SERIAL# SPID
------------------------------ ---------- ---------- ------------
SCOTT                                 146         81 23678

Now check for the shadow process associated with scott’s session on the server:


[root@ora10gR2 tmp]# ps -fp 23678
UID PID PPID C STIME TTY TIME CMD
oracle 23678 1 0 16:56 ? 00:00:00 oraclejs01 (LOCAL=NO)

Kill the session and check the status:

SQL> alter system kill session '146,81';

SQL> l
  1  select
  2     s.username,
  3     s.sid,
  4     s.serial#,
  5     p.spid spid
  6  from v$session s, v$process p
  7  where s.username = 'SCOTT'
  8*    and p.addr = s.paddr
SQL>/

no rows selected

Check again on the server for the process:

[root@ora10gR2 tmp]# ps -fp 23678
UID PID PPID C STIME TTY TIME CMD
oracle 23678 1 0 16:56 ? 00:00:00 oraclejs01 (LOCAL=NO)

Interesting, isn’t it? We know the process is still alive on the server, but the session information is no longer associated with the process. This happens because Oracle has disconnected the session, which allows the process to continue until the sqlplus session is terminated. The session information is still available in v$session, but is no longer associated with a server process:

select
  2     s.username,
  3     s.status,
  4     s.sid,
  5     s.serial#
  6  from v$session s
  7* where s.username = 'SCOTT'
SQL>/

USERNAME                       STATUS          SID    SERIAL#
------------------------------ -------- ---------- ----------
SCOTT                          KILLED          146         81

1 row selected.

 1* select pid,spid from v$process where pid = 146
SQL>/

no rows selected

When exiting the Scott session, I can see that the session was killed:

SQL> exit
ERROR:
ORA-00028: your session has been killed

Let’s perform the experiment again, but this time use the IMMEDIATE keyword.

Logon as scott:

> sqlplus scott/tiger@10gr2

SQL*Plus: Release 11.2.0.3.0 Production on Tue Aug 5 17:18:53 2014

Logon as SYSDBA and check for the scott session;

SQL> @scott

USERNAME                              SID    SERIAL# SPID
------------------------------ ---------- ---------- ------------
SCOTT                                 146         83 23939

1 row selected.

Before killing scott’s session:

  • get my OS PID
  • enable 10046 trace

The OS PID will be used for strace on the SYSDBA session shadow process on the server.
The 10046 trace is so we can see what is happening in the strace output.

SQL> l
1 select
2 s.username,
3 s.sid,
4 s.serial#,
5 p.spid spid
6 from v$session s, v$process p
7 where s.username is not null
8 and p.addr = s.paddr
9 and userenv('SESSIONID') = s.audsid
10* order by username, sid
SQL>/

USERNAME                              SID    SERIAL# SPID
------------------------------ ---------- ---------- ------------
SYS                                   145         65 23947

1 row selected.

SQL> alter session set events '10046 trace name context forever, level 12';

Session altered.

Now ssh to the db server , check for Scott session shadow process and start strace:

[root@ora10gR2 tmp]# strace -o 23947.strace -p 23947
^Z
[1]+ Stopped strace -o 23947.strace -p 23947
[root@ora10gR2 tmp]# bg
[1]+ strace -o 23947.strace -p 23947 &

[root@ora10gR2 tmp]# ps -p 23939
PID TTY TIME CMD
23939 ? 00:00:00 oracle

Now kill Scott’s session and exit the SYSDBA session:

SQL> alter system kill session '146,83' immediate;

System altered.

The strace command will now have exited on the server.

First check again for Scott’s session:

[root@ora10gR2 tmp]# ps -p 23939
PID TTY TIME CMD
[root@ora10gR2 tmp]#

So the Scott shadow process has terminated.

As the 10046 trace was enabled, the output to the oracle trace file will appear in the strace file, which allows searching for ‘alter system kill’ in the strace file.

From the strace file:

write(5, "alter system kill session '146,8"..., 44) = 44

Now searching for the PID of scott’s session 23939:

read(10, "23939 (oracle) S 1 23939 23939 0"..., 999) = 228
close(10) = 0
open("/proc/23939/stat", O_RDONLY) = 10
read(10, "23939 (oracle) S 1 23939 23939 0"..., 999) = 228
close(10) = 0
kill(23939, SIGKILL) = 0
kill(23939, SIGCONT) = 0
open("/proc/23939/stat", O_RDONLY) = 10
read(10, "23939 (oracle) Z 1 23939 23939 0"..., 999) = 178
close(10) = 0

From the previous text I can see that Oracle opened the status file for PID 23939.
Why it did so twice I am not sure.

What happens after that is the interesting part.

kill(23939, SIGKILL) = 0

That line means that the SIGKILL signal was successfully sent to Scott’s shadow process.

What does that mean? Run kill -l to get a list of signals:

kill -l
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL       5) SIGTRAP
 6) SIGABRT      7) SIGBUS       8) SIGFPE       9) SIGKILL     10) SIGUSR1
11) SIGSEGV     12) SIGUSR2     13) SIGPIPE     14) SIGALRM     15) SIGTERM
16) SIGSTKFLT   17) SIGCHLD     18) SIGCONT     19) SIGSTOP     20) SIGTSTP
21) SIGTTIN     22) SIGTTOU     23) SIGURG      24) SIGXCPU     25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF     28) SIGWINCH    29) SIGIO       30) SIGPWR
31) SIGSYS      34) SIGRTMIN    35) SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3
38) SIGRTMIN+4  39) SIGRTMIN+5  40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7
58) SIGRTMAX-6  59) SIGRTMAX-5  60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
63) SIGRTMAX-1  64) SIGRTMAX

Notice that signal 9 (kill -9) is SIGKILL.

So when killing a session with ALTER SYSTEM KILL SESSION ‘PID,SERIAL#’ IMMEDIATE Oracle is actually doing the kill -9 for you, and has been for many years now.

Though not shown here, this same test was run when the session was killed without using the IMMEDIATE keyword, and there were no attempts to kill the session. This was inferred as well by the fact the the process was still running on the server up until the time the Scott sqlplus session was exited.

10 Responses to “Alter Session Kill on Steroids”

  • Maris Elsins says:

    I had always used “alter system disconnect session …” when I needed to terminate it immediately without letting it wait for the ongoing call to complete.
    Now when you showed me the “alter system kill session .. immediate”, it looked exactly the same as disconnect session, so I checked the latest documentation here http://docs.oracle.com/database/121/SQLRF/statements_2017.htm#SQLRF53113 and confirmed:
    A) DISCONNECT SESSION = KILL SESSION IMMEDIATE
    B) KILL SESSION = DISCONNECT SESSION POST_TRANSACTION
    basically “KILL SESSION”, and “DISCONNECT SESSION” does the same thing, but the default action is different.

    • still says:

      Thanks for the comments Maris. I’ve never really made use of DISCONNECT SESSION, so I tried it.
      Unlike KILL, DISCONNECT SESSION requires either the POST_TRANSACTION or IMMEDIATE keyword:


      SQL> alter system disconnect session '142,1628';
      alter system disconnect session '142,1628'
      *
      ERROR at line 1:
      ORA-02000: missing POST_TRANSACTION or IMMEDIATE keyword

      Here’s the session info:

      CLIENT SRVR
      USERNAME SID SERIAL# SQL ID PID STATUS MACHINE OSUSER CLIENT PROGRAM PID SERVER PROGRAM PID LOGON TIME IDLE TIME
      ---------- ------ ------- -------------- ------ ---------- ---------- ------- -------------------- ------------ -------------------- ----- ----------------- -----------
      DBSNMP 128 783 20 INACTIVE ora10gR2.j oracle JDBC Thin Client 1234 oracle@ora10gR2.jks. 11610 08/08/14 08:57:19 00:00:03:24
      ks.com
      132 1103 22 INACTIVE ora10gR2.j oracle JDBC Thin Client 1234 oracle@ora10gR2.jks. 11616 08/08/14 08:57:19 00:00:03:24
      ks.com
      140 1314 15 INACTIVE ora10gR2.j oracle JDBC Thin Client 1234 oracle@ora10gR2.jks. 11164 08/08/14 08:42:19 00:00:03:21
      ks.com
      146 1843 23 INACTIVE ora10gR2.j oracle JDBC Thin Client 1234 oracle@ora10gR2.jks. 11622 08/08/14 08:57:19 00:00:03:24
      ks.com
      149 24 16 INACTIVE ora10gR2.j oracle JDBC Thin Client 1234 oracle@ora10gR2.jks. 21564 08/05/14 15:38:43 00:00:00:06
      ks.com

      SCOTT 142 1628 7h35uxf5uhmm1 24 INACTIVE poirot jkstill sqlplus@poirot (TNS 4325 oracle@ora10gR2.jks. 11844 08/08/14 09:00:26 00:00:00:15

      SYS 138 1608 chmy5m24rfhvp 26 ACTIVE poirot jkstill sqlplus@poirot (TNS 4327 oracle@ora10gR2.jks. 11852 08/08/14 09:00:41 00:00:00:00

      Now to disconnect the session:


      SQL> alter system disconnect session '142,1628' IMMEDIATE;

      System altered.

      Checking the strace file for the SYSDBA session I find that indeed SIGKILL is used

      open("/proc/11844/stat", O_RDONLY) = 10
      read(10, "11844 (oracle) S 1 11844 11844 0"..., 999) = 228
      close(10) = 0
      kill(11844, SIGKILL) = 0
      kill(11844, SIGCONT) = 0
      open("/proc/11844/stat", O_RDONLY) = 10
      read(10, "11844 (oracle) Z 1 11844 11844 0"..., 999) = 178
      close(10) = 0
      semtimedop(524288, 0x7fff711b2b70, 1, {0, 50000000}) = -1 EAGAIN (Resource temporarily unavailable)
      open("/proc/11844/stat", O_RDONLY) = -1 ENOENT (No such file or directory)

      What cannot be seen from strace though is what Oracle library calls are being made.
      That could be traced with gdb, maybe something to check out another time.

      The reason I mention this is due to the differences in the documentation for KILL and DISCONNECT SESSION.

      KILL

      IMMEDIATE Specify IMMEDIATE to instruct Oracle Database to roll back ongoing transactions, release all session locks, recover the entire session state, and return control to you immediately.

      DISCONNECT SESSION

      The IMMEDIATE setting disconnects the session and recovers the entire session state immediately, without waiting for ongoing transactions to complete.

      Whether or not there are actual differences or this is a documentation bug, I cannot say at this time.
      The documenation however is not consistent – further testing would be required to determine if these are really equivilent.

      The documentation also implies that “DISCONNECT SESSION ‘sid,serial#’ IMMEDIATE POST_TRANSACTION” can be used:


      "The IMMEDIATE setting disconnects the session and recovers the entire session state immediately, without waiting for ongoing transactions to complete.

      If you also specify POST_TRANSACTION and the session has ongoing transactions, then the IMMEDIATE keyword is ignored."

      That however does not work (tested on 11.2.0.4):


      SQL> alter system disconnect session '60,14681' immediate post_transaction;
      alter system disconnect session '60,14681' immediate post_transaction

      • still says:

        Using Frits Hoogland’s macros I ran gdb against two SYSDBA sessions, one of which I used to terminate a session with KILL IMMEDIATE and the other to terminate a session with DISCONNECT SESSION IMMEDIATE.

        Both appear to do the same thing, at least as far as I can tell at this time.

        So Maris, your assertions that these are identical seem to correct.

        The documentation is not quite so clear on this.

        KILL IMMEDIATE

        Continuing.
        opiclo — close cursor
        opiosq0 — parse
        kkshini — hard parse
        opiexe — execute

        opiclo — close cursor
        qerxtIsActive — external table fetch

        Program exited normally.

        DISCONNECT SESSION IMMEDIATE

        Continuing.
        opiclo — close cursor
        opiosq0 — parse
        kkshini — hard parse
        opiexe — execute

        opiclo — close cursor
        qerxtIsActive — external table fetch

        Program exited normally.

  • kill/disconnect immediate does not send SIGKILL at all cases. For example, SIGKILL will not be sent to the process that executes the next query:
    with D as
    (
    select 5 a, 7 b from dual
    union all select 7 a, 5 b from dual
    union all select 6 a, 6 b from dual
    union all select -5 a, -7 b from dual
    union all select -7 a, -5 b from dual
    union all select 7 a, -5 b from dual
    union all select -5 a, 7 b from dual
    union all select -7 a, 5 b from dual
    union all select 5 a, -7 b from dual
    union all select 1 a, 0 b from dual
    union all select 0 a, 1 b from dual
    union all select 0 a, -1 b from dual
    union all select -1 a, 0 b from dual
    union all select 0 a, 0 b from dual
    union all select 0 a, 1/3 b from dual
    union all select 1/3 a, 0 b from dual
    ),
    I0 as (select a, b, ((a – b) / 40) + 0.61 x from D),
    I1 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I0),
    I2 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I1),
    I3 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I2),
    I4 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I3),
    I5 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I4),
    I6 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I5),
    I7 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I6),
    I8 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I7),
    I9 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I8),
    I10 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I9),
    I11 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I10),
    I12 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I11)
    select a, b, a*x + b*(1-x) from I12

    • Jared says:

      Interesting. Explanation please?

    • Jared says:

      I ran this query, and as you said kill() is not used on the session running this query.

      gdb shows the same output for sessions that are killed with kill() as well as those are disconnected internally:

      opiclo — close cursor
      opiosq0 — parse
      kkshini — hard parse
      opiexe — execute

      strace on the SYSDBA session confirms that kill() is used when the session is static, and not used when the query is running.

      Next I tried running this query in the scott session:
      select * from all_tab_columns

      When killing the session the results were the same – kill() was not used.

      Before deciding what to do in each case oracle is reading the /proc/PID/stat file.
      Comparing the values for running and not running a query didn’t show any obvious differences that might trigger this behavior.

      It seems that if the session is doing any work then kill() is not used. The results were the same for trying to kill while disk was being read or if data was being fetched and displayed.

      So using kill -9 or OraKill may still be appropriate.

      Looks like this article will require some editing.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>