The idea of using a database table as a queue or list of work to do is nothing new. And it all sounds so simple to begin with when suggestions include using a status column as a reference for what to process and what has been processed. But it all falls apart when the need arises to process the list quickly and in parallel.
This article will recap the traditional problems associated with this pattern and demonstrate the Oracle 11g offering that makes it all just work. And to demonstrate this, I'll be using a simple Java application to compare the old approach with the new approach. The application will be processing 200 rows from a simple table with an 100ms delay for each transaction to simulate actual work.
I've written this tip as I had a bit of a hard time myself this week collating various snippets of useful and useless information. Hopefully, you won't have the same issues after reading this.
In its simplest form, a database table representing a list of work could look like this:
and include data like this:
Traditionally, a processing routine would select an item to process from the list and later update the status to stop it being selected again e.g. 'Done'. However, when multiple threads, connections, processes, etc. are added, you immediately run into contention and duplicate processing issues. To offset these, you might add an intermediate status e.g. 'Processing', but this doesn't really help matters - there's always going to be a fight for the initial selection and there's just no way this solution is going to scale.
A partial solution was offered by the database, where the 'Select for Update' statement would stop the initial fighting during selection.This approach locks the row being selected, so that no other process will select the same one and duplicate processing won't occur. Nice fix for the fight and duplicate processing, but it still doesn't scale. Why? Because, if the record identified in the
select still looks like other records waiting to be processed, any additional
selects by other processes will get stuck behind the locked record and not proceed. When the locked record eventually releases, I understand the data longer satisfies the
select statement and the query is run again - adding to the delay. The result, not only does scaling not occur, but the processing of the records is uneven among the threads, which could lead to other issues depending on your application:
run.bat NOSKIP 10
Thread: 1, Processed: 20, Duration: 22230ms
Thread: 2, Processed: 19, Duration: 22230ms
Thread: 3, Processed: 17, Duration: 22230ms
Thread: 5, Processed: 26, Duration: 22245ms
Thread: 9, Processed: 27, Duration: 22261ms
Thread: 6, Processed: 19, Duration: 22292ms
Thread: 4, Processed: 17, Duration: 22292ms
Thread: 8, Processed: 14, Duration: 22292ms
Thread: 7, Processed: 22, Duration: 22355ms
Thread: 0, Processed: 19, Duration: 22355ms
You might think at this point, that if the 100ms delay wasn't in the demonstration application, the locking wouldn't exist for as long and the test would run faster. This is true, but the idea of this test is to simulate real world processing where longer running transactions are both a fact of life and one of the reasons why this approach doesn't work in the first place.
One solution to the lock related waiting, is to update the record to a 'Processing' state so other queries ignore it. By committing this update, the lock is removed and the other processes can continue. However, now we've introduced a new transaction boundary and an intermediate state that would require additional management if any issues occur, e.g., items could get 'suck' in the Processing state, so maybe you introduce an expiry time.... the complexity and risk has now grown exponentially to what stated off as a simple problem.
What was needed was a way to lock the row being worked on, and allow parallel
selects to ignore or skip over that record. So in comes Oracle's 'Select for Update Skip Locked' statement. This gives us everything we want without the crazy hacks. It does have its quirks, but I'll explain these too.
If you try and implement the new Skip Locked SQL in code or plain SQL, you might think about adding something like
rowcount=1 to the query so you only process one record at a time. Bad idea!
Rowcount is only applied after the query & lock have been run, so a lock is inadvertently applied to all relevant records. The result, any other process running the same SQL gets nothing, because all of the applicable rows were skipped!
You need to control which record the
update lock is being applied to and this can be done through a cursor, in a function, e.g.:
CREATE OR REPLACE PROCEDURE getmynext(outparam OUT number) IS
CURSOR c IS
select id from MYTABLE where status = 'TODO' order by id asc for update skip locked;
FOR i IN 1..1 LOOP
FETCH c INTO l_id;
EXIT WHEN c%NOTFOUND;
select id into outparam from MYTABLE where id = l_id for update;
Within the cursor, the initial
select criteria only specifies the need to Skip the locked rows. The actual locking occurs only as the cursor advances through each row. And with this only being done once, we only lock and process one row.
Important: Executing this function requires the calling code to first establish a transaction -
Once any additional processing is done, the caller updates the record to the '
DONE' status, invokes a commit and releases the lock. The result of this approach? Scalable, parallel processing is achieved with even loading and no contention or added complexity:
run.bat SKIP 10
Thread: 8, Processed: 20, Duration: 2668ms
Thread: 7, Processed: 20, Duration: 2714ms
Thread: 3, Processed: 20, Duration: 2730ms
Thread: 4, Processed: 20, Duration: 2730ms
Thread: 2, Processed: 20, Duration: 2761ms
Thread: 6, Processed: 20, Duration: 2761ms
Thread: 1, Processed: 20, Duration: 2777ms
Thread: 0, Processed: 20, Duration: 2777ms
Thread: 9, Processed: 20, Duration: 2808ms
Thread: 5, Processed: 20, Duration: 2824ms
Incidentally, to invoke the function from a SQL editor:
Example code is attached along with DDL for test table and cursor function creation.
You also can use the application to create your initial dataset:
run CREATE 200