UPDATE on huge table without index

This is something that keeps coming back no matter where I work. It’s always something I do the same, but it takes a minute to remember how I did it last time. I guess it’s time to share something super easy.

You have a huge table with constant activity, containing terabytes of data. You need to update/delete roughly a million random rows. Random selects from the table keep some indexes you could use hot. Running a big update, locking rows/table is not an option since you have approx. 40-50 inserts/sec to the table. So you need to run smaller batches when updating table. LIMIT would be nice, but is not supported with OFFSET when using UPDATE or DELETE.

Updating using IN and subquery together with limit is not supported in 5.0.x (usure about 5.1.x and forward), Else this would have been a viable solution.

Selects with specific where clauses might take longer due to some indexes/keys being too big as well

Selecting all Unique/Primary keys to a temp table with an auto increment primary key and then update the production table using this temporary table as a reference table to which rows to update, with a BETWEEN clause on the temporary table’s auto_increment field.
May also enclose everything in a transaction and delete the same ‘between’ on the temporary reference table. Will help you keep track of what you actually did in case you need to abort this.

Production table:

CREATE TABLE `huge_table` (
`request_id` bigint(20) NOT NULL auto_increment,
`customer_id` int(11) default NULL,
`data_id` bigint(20) default NULL,
`group_id` int(11) default NULL,
`user_id` int(11) default NULL,
`action_id` int(11) default NULL,
`external_id` int(11) default NULL,
`user_data` varchar(2048) default NULL,
`entry` text,
`extra_term` varchar(255) default NULL,
`transaction` text,
`receive_time` datetime default NULL,
`from_id` varchar(32) default NULL,
`url` varchar(2048) default NULL,
`from_ip` varchar(2048) default NULL,
`useragent` varchar(2048) default NULL,
`tz` int(11) default NULL,
`cvalue` varchar(255) default NULL,
`rvalue` varchar(255) default NULL,
`uid` varchar(40) default NULL,
PRIMARY KEY (`request_id`),
KEY `k1` (`receive_time`,`action_id`,`url`,`from_id`)

You need to update action_id = 100 on approx 1.5m rows which have receive_time between ‘2010-08-01 00:00:00’ and ‘2010-08-31 23:59:59’ where from_id equals ‘q7b4x5aa0303erer’.

temporary table:

CREATE TABLE `tmp_ids` (
`id` int(11) NOT NULL auto_increment,
`prod_id` bigint(20) default NULL,

Fill the temporary table with the id’s you need to update in the production table:

INSERT INTO tmp_ids SELECT * FROM huge_table USE INDEX (k1) WHERE from_id=’q7b4x5aa0303erer’ AND receive_time BETWEEN ‘2010-08-01 00:00:00’ AND ‘2010-08-31 23:59:59’;

Then just iterate this in a bash script or our favorite scripting language (increasing the between values of course):

UPDATE huge_table ht JOIN tmp_ids ti ON ht.request_id=ti.prod_id SET ht.action_id=100 WHERE ti.id BETWEEN x AND x;

example bash script:


# Config params

# Print usage function
function usage() {
echo “$(basename $0)
echo “rows – number of rows in one go.”
exit 0

# Since we need atleast one variable to continue, check so first variable is supplied
[ -e $1 ] && usage && exit

# get that variable
# Check which is the first key we will use in temp table
s=`mysql -N -u$user -p$pass -h$host $db -e”select min(id) from $db.$tmptbl;”`
# Set latter between value
let b=$s+$rows;
# Count how many rows we have to update so we know how long to loop
count=`mysql -N -u$user -p$pass -h$host $db -e”select count(*) from $db.$tmptbl;”|head`;

# Print starting line
echo “Starting up with updates from id: $s to $b, performing $rows at one go..”

# loop till we are done
while [ 1 ];

# are we done? If so exit
if (($count <= 0)); then exit fi # Run actual db updates echo -n "updating $livetbl between $s and $b.." mysql -N -u$user -p$pass -h$host $db -e"update $livetbl td join $tmptbl i on td.$livecol=i.$tmpcol set $setval where i.$tmpkey between $s and $b" echo -n " deleting in $tmptbl.." mysql -N -u$user -p$pass -h$host $db -e"delete from $db.$tmptbl where $tmpkey between $s and $b" echo -n " Done!" # set between rows let s=$s+$rows let b=$b+$rows let count=$count-$rows # wait if there are more records to update if (($count > 0));
echo ” ..waiting.. still $count records to update”;
sleep $waittime

# All done
echo “Script done. No more records to update”

Leave a comment

Your comment