UPDATE on huge table without index

This is something that keeps coming back no matter where I work. It’s always something I do the same, but it takes a minute to remember how I did it last time. I guess it’s time to share something super easy.

Scenario:
You have a huge table with constant activity, containing terabytes of data. You need to update/delete roughly a million random rows. Random selects from the table keep some indexes you could use hot. Running a big update, locking rows/table is not an option since you have approx. 40-50 inserts/sec to the table. So you need to run smaller batches when updating table. LIMIT would be nice, but is not supported with OFFSET when using UPDATE or DELETE.

Updating using IN and subquery together with limit is not supported in 5.0.x (usure about 5.1.x and forward), Else this would have been a viable solution.

Selects with specific where clauses might take longer due to some indexes/keys being too big as well

Solution:
Selecting all Unique/Primary keys to a temp table with an auto increment primary key and then update the production table using this temporary table as a reference table to which rows to update, with a BETWEEN clause on the temporary table’s auto_increment field.
May also enclose everything in a transaction and delete the same ‘between’ on the temporary reference table. Will help you keep track of what you actually did in case you need to abort this.

Example
Production table:

CREATE TABLE `huge_table` (
`request_id` bigint(20) NOT NULL auto_increment,
`customer_id` int(11) default NULL,
`data_id` bigint(20) default NULL,
`group_id` int(11) default NULL,
`user_id` int(11) default NULL,
`action_id` int(11) default NULL,
`external_id` int(11) default NULL,
`user_data` varchar(2048) default NULL,
`entry` text,
`extra_term` varchar(255) default NULL,
`transaction` text,
`receive_time` datetime default NULL,
`from_id` varchar(32) default NULL,
`url` varchar(2048) default NULL,
`from_ip` varchar(2048) default NULL,
`useragent` varchar(2048) default NULL,
`tz` int(11) default NULL,
`cvalue` varchar(255) default NULL,
`rvalue` varchar(255) default NULL,
`uid` varchar(40) default NULL,
PRIMARY KEY (`request_id`),
KEY `k1` (`receive_time`,`action_id`,`url`,`from_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1047142423 DEFAULT CHARSET=utf8

You need to update action_id = 100 on approx 1.5m rows which have receive_time between ’2010-08-01 00:00:00′ and ’2010-08-31 23:59:59′ where from_id equals ‘q7b4x5aa0303erer’.

temporary table:

CREATE TABLE `tmp_ids` (
`id` int(11) NOT NULL auto_increment,
`prod_id` bigint(20) default NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=15424160 DEFAULT CHARSET=utf8

Fill the temporary table with the id’s you need to update in the production table:

INSERT INTO tmp_ids SELECT * FROM huge_table USE INDEX (k1) WHERE from_id=’q7b4x5aa0303erer’ AND receive_time BETWEEN ’2010-08-01 00:00:00′ AND ’2010-08-31 23:59:59′;

Then just iterate this in a bash script or our favorite scripting language (increasing the between values of course):

UPDATE huge_table ht JOIN tmp_ids ti ON ht.request_id=ti.prod_id SET ht.action_id=100 WHERE ti.id BETWEEN x AND x;

example bash script:

#!/bin/bash

# Config params
user=”username”
pass=”password”
host=”hostname”
db=”schema_name”
tmptbl=”tmp_table”
livetbl=”live_table”
tmpcol=”live_id”
livecol=”request_id”
tmpkey=”id”
upcol=”from_id=1″
waittime=”0.5″

# Print usage function
function usage() {
echo “$(basename $0)
echo “rows – number of rows in one go.”
exit 0
}

# Since we need atleast one variable to continue, check so first variable is supplied
[ -e $1 ] && usage && exit

# get that variable
rows=$1
# Check which is the first key we will use in temp table
s=`mysql -N -u$user -p$pass -h$host $db -e”select min(id) from $db.$tmptbl;”`
# Set latter between value
let b=$s+$rows;
# Count how many rows we have to update so we know how long to loop
count=`mysql -N -u$user -p$pass -h$host $db -e”select count(*) from $db.$tmptbl;”|head`;

# Print starting line
echo “Starting up with updates from id: $s to $b, performing $rows at one go..”

# loop till we are done
while [ 1 ];
do

# are we done? If so exit
if (($count <= 0)); then exit fi # Run actual db updates echo -n "updating $livetbl between $s and $b.." mysql -N -u$user -p$pass -h$host $db -e"update $livetbl td join $tmptbl i on td.$livecol=i.$tmpcol set $setval where i.$tmpkey between $s and $b" echo -n " deleting in $tmptbl.." mysql -N -u$user -p$pass -h$host $db -e"delete from $db.$tmptbl where $tmpkey between $s and $b" echo -n " Done!" # set between rows let s=$s+$rows let b=$b+$rows let count=$count-$rows # wait if there are more records to update if (($count > 0));
then
echo ” ..waiting.. still $count records to update”;
sleep $waittime
fi
done

# All done
echo “Script done. No more records to update”

Leave a comment

Your comment