We are seeing the following error in postgresql log
Seems like maintenance query times out and data is accumulating -
This could probably be made in a smarter way - eg several smaller delete batches that looped until there was nothing further to clean up
With the current failure mode the data will accumulate, with a potential of other queries getting slower due to increased data size
Code: Select all
DELETE FROM job_session_messages AS j
WHERE j.job_session_id IN (
SELECT j0.id
FROM job_sessions AS j0
WHERE j0.end_time <= $1 AND j0.status = ANY ($2)
)
and a combination of the next 2 updates
Just did a small test to get an overview of the size of the pending deletes, query build with the logged parameters
Code: Select all
SELECT count(*)
FROM job_session_messages AS j
WHERE j.job_session_id IN (
SELECT j0.id
FROM job_sessions AS j0
WHERE j0.end_time <= 638596224007975371 AND j0.status = ANY ('{2,0,4,3}')
);
count
-----------
151234372
(1 row)
Any body with programming experience will probably agree that trying to do very large deletes in one big swoop is more or less doomed to go wrong
Instead use several minor deletes like this - a couple of these run manually got us through the delete backlog and a following VACCUUM got our DB size back to a more reasonable level
Code: Select all
WITH messages_to_delete AS (
SELECT j.id
FROM job_session_messages AS j
WHERE j.job_session_id IN (
SELECT j0.id
FROM job_sessions AS j0
WHERE j0.end_time <= 638596224007975371 AND j0.status = ANY ('{2,0,4,3}')
)
LIMIT 1000000
)
DELETE FROM job_session_messages
WHERE id IN (SELECT id FROM messages_to_delete);
While we're at it, maintenance on job_session_objects also seems to occasionally fail in a similar way although but looks like it's able to recover by it self after a few retries
Code: Select all
DELETE FROM job_session_objects AS j
WHERE j.job_session_id IN (
SELECT j0.id
FROM job_sessions AS j0
WHERE j0.end_time <= $1 AND j0.status = ANY ($2)
)