Killing an Oracle Job Dead

Mar 30, 2009 / By Lukas Vysusil

Tags: ,

If you ever thought it was easy to kill an oracle job, let me show you something that gave me a headache today.

We’re testing an application that uses custom code to de-queue and propagate AQ messages between databases. There are nine processes in total, and all of them are submitted as jobs using DBMS_JOB. They are supposed to run all the time, waiting for messages to de-queue. Once in a while, I need to kill them in order to recompile the objects they are executing.

The steps are simple enough: remove/break all jobs in question, kill the jobs that are running, recompile the packages with new the version, and submit/un-break the jobs again. This time, however, things were not going as smoothly as I expected. Let’s have a look.

First I remove all the currently running jobs:

SYS@ORCL>select 'exec dbms_ijob.remove('||job||');' from dba_jobs_running;

'EXECDBMS_IJOB.REMOVE('||JOB||');'
----------------------------------------------------------------
exec dbms_ijob.remove(261);
exec dbms_ijob.remove(264);
exec dbms_ijob.remove(267);
exec dbms_ijob.remove(262);
exec dbms_ijob.remove(265);
exec dbms_ijob.remove(268);
exec dbms_ijob.remove(263);
exec dbms_ijob.remove(266);
exec dbms_ijob.remove(269);

9 rows selected.

SYS@ORCL>exec dbms_ijob.remove(261);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(264);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(267);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(262);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(265);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(268);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(263);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(266);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(269);
PL/SQL procedure successfully completed.

As a next step, I’ll just kill the sessions of the running jobs:

SYS@ORCL>select 'alter system kill session '''||sid||','||serial#||''';' from v$session where sid in (select sid from dba_jobs_running);

'ALTERSYSTEMKILLSESSION'''||SID||','||SERIAL#||''';'
--------------------------------------------------------------------------------
alter system kill session '120,73';
alter system kill session '122,111';
alter system kill session '123,788';
alter system kill session '136,20';
alter system kill session '137,28';
alter system kill session '138,143';
alter system kill session '142,132';
alter system kill session '144,129';
alter system kill session '159,68';

9 rows selected.

SYS@ORCL>alter system kill session '120,73';
System altered.

SYS@ORCL> alter system kill session '122,111';
System altered.

SYS@ORCL> alter system kill session '123,788';
System altered.

SYS@ORCL> alter system kill session '136,20';
System altered.

SYS@ORCL> alter system kill session '137,28';
System altered.

SYS@ORCL> alter system kill session '138,143';
System altered.

SYS@ORCL> alter system kill session '142,132';
System altered.

SYS@ORCL> alter system kill session '144,129';
System altered.

SYS@ORCL> alter system kill session '159,68';
System altered.

And the running jobs are gone:

SYS@ORCL>select count(*) from dba_jobs_running;

  COUNT(*)
----------
         0

I’m thinking this is fine, but after few seconds the jobs are back!

SYS@ORCL> /

  COUNT(*)
----------
         9

I found this behavior quite intriguing, so I searched on the Net and found a really nice blog explaining in great detail how the DBMS_JOB jobs behaves.

That gave a hint that the OS process may actually keep the database processes alive. So I think, that sounds reasonable—let’s kill the OS processes instead.

SYS@ORCL>!ps -ef | grep ora_j
oracle   16234     1  0 12:58 ?        00:00:00 ora_j000_ORCL
oracle   16236     1  9 12:58 ?        00:00:07 ora_j001_ORCL
oracle   16238     1  9 12:58 ?        00:00:07 ora_j002_ORCL
oracle   16240     1  0 12:58 ?        00:00:00 ora_j003_ORCL
oracle   16242     1  0 12:58 ?        00:00:00 ora_j004_ORCL
oracle   16244     1  8 12:58 ?        00:00:06 ora_j005_ORCL
oracle   16246     1  9 12:58 ?        00:00:07 ora_j006_ORCL
oracle   16248     1  0 12:58 ?        00:00:00 ora_j007_ORCL
oracle   16250     1  9 12:58 ?        00:00:07 ora_j008_ORCL
oracle   16280 15459  0 12:59 pts/0    00:00:00 /bin/bash -c ps -ef | grep ora_j
oracle   16282 16280  0 12:59 pts/0    00:00:00 grep ora_j

SYS@ORCL>!kill -9 16234 16236 16238 16240 16242 16244 16246 16248 16250 16252

Done!

Hmm, but not really—after a while, the jobs are back from the dead. Under different Oracle PIDs and different OS PIDs.

SYS@ORCL>select count(*) from dba_jobs_running;

  COUNT(*)
----------
         9

Now this left me really puzzled. How do I get rid of them?

Of course! Why hadn’t I thought of this before? I’ll set job_queue_processes to 0. Kill the processes and they will not respawn.

SYS@ORCL>alter system set job_queue_processes=0;

System altered.

Killed all processes again.

SYS@ORCL>select count(*) from dba_jobs_running;

  COUNT(*)
----------
         0

SYS@ORCL>alter system set job_queue_processes=10;

System altered.

SYS@ORCL>select count(*) from dba_jobs_running;

  COUNT(*)
----------
         9

Oops, they’re back again! Notice that the jobs were gone from the DBA_JOBs table, and the running jobs were referencing non-existent job ID’s. How can they be respawning?

SYS@ORCL>select j.what, j.job, r.sid from dba_jobs j, dba_jobs_running r where j.job in (select job from dba_jobs_running) and j.job=r.job order by j.what;

no rows selected

I decided to expand my knowledge about jobs, and read the whole of James Koopmann’s blog. And then I saw the light! User JB (thank you JB, I too learned the hard way) mentioned in comments under the article that one has to run COMMIT after manipulating jobs.

So after all of the above, I just typed . . .

SYS@ORCL> commit;

Commit complete.

. . . then killed the jobs again, and bingo! Those pesky jobs were finally gone.

I’ve done the same procedure in the past using DBMS_JOB and never faced the issue, but I realize that I always switched to the SYS user in order to kill sessions, so it was autocommited upon reconnect.

It is important to mention—the respawned jobs were not doing what they were supposed to do, that means they were not de-queing messages. The database version I’m using is 10.2.0.4, and it’s true for both DBMS_JOB/IJOB, as well as REMOVE and BROKEN procedures, and it might apply to other procedures of this package as well. After all, things are as simple as commit, aren’t they?

6 Responses to “Killing an Oracle Job Dead”

  • thanks for the mention. my Killing the Oracle DBMS_JOB is at http://it.toolbox.com/blogs/database-solutions/killing-the-oracle-dbms_job-6498
    cheers,
    james

  • Vlado says:

    Can you elaborate on what you meant by “…switched to the SYS user in order to kill sessions, so it was autocommited upon reconnect…”, especially the autocommit part?

  • Lukas Vysusil says:

    Hi Vlado, let me clarify, because that part is not too clear.
    In the past I was using DBMS_JOB, hence I had to be logged in as the owner of the job to be able to manipulate it using DBMS_JOB.REMOVE. After removing the job I reconnected using ‘connect / as sysdba’ in order to be able to issue ‘alter system kill …’. Once you reconnect or exit sqlplus session it will autocommit.

    If you use DBMS_IJOB then you do not need to switch between sessions because you can do everything as SYS from single session.

    Lukas

  • Reci says:

    Lukas,

    My Corporation recently had an issue with a job we could not kill via the OS or database process. I just happened to try running multiple commits and bingo! they were gone.

    I was looking for some documentation to help explain what happened and read your article.

    Thanks, for putting it out there. It was as simple as commits:)

    Reci

  • BEDE says:

    Not seen this so far.
    Yet I have another problem: how to programatically kill the job sessions on a cluster?
    Meaning: I have a procedure that receives the job_id as a param and does an execute immediate ‘alter system kill session …’.
    The problem is that, on a cluster, the proc may be running on a different node than that the hob session runs.
    What do you think about this?

  • Lukas says:

    Hi Bede,

    how about checking the dba_jobs_running for SID and INSTANCE columns which will give you the SID and INST_ID which you can use to lookup gv$session and kill the right session.

    Regards,
    Lukas

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>