rsync

All posts tagged rsync

I needed to move 7Tb of files from one host to another using a 4Tb external HDD drive. I want to preserve the file modification times and the directory structures so that I can reverse the procedure at the other end once the 4Tb HDD has shuttled the files to the other end. At the source end, source will split into destination/1, destination/2, destination/3. At the other end I want to be able to….

$ rsync -ra destination/1 reconstituted/
$ rsync -ra destination/2 reconstituted/
$ rsync -ra destination/3 reconstituted/

…so that source matches reconstituted. Incidentally, because my method only handles files, reconstituted will differ from source in that it will omit empty directories.
I started with the 7Tb of files in a directory called source. When i’m finished I want to see those files split into sub-directories within the destination directory. Those sub-directories cannot exceed the limit that I send. The size of those sub-directories is dependant on the mix of files that get put into each sub-directory.

./
./source/
./destination/

First I removed all of the special characters using the script I published here.

I then used a little shell script to generate an index of the files, the sizes and when they were last modified…

#!/bin/bash

# Copyright (c) 2013, James Downie 
# 
# Permission to use, copy, modify, and/or distribute this
# software for any purpose with or without fee is hereby
# granted, provided that the above copyright notice and this
# permission notice appear in all copies.
# 
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL 
# WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL
# THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR
# CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING
# FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF
# CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
# OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
if [ -d "1" ]; then
  /usr/bin/find "$1"  -type f  -exec /usr/bin/stat -f "%z%t%m%t%N" "{}" \;
fi

So with that script index.sh in the same directory as source and destination I launch it like so…

$ ./index.sh source/ > index.txt

My working directory now looks like this…

./
./source/
./destination/
./index.sh
./index.txt

…and index.txt looks something like this…

24580 1366159671  fbhepr/ovtinhyg$/Pheerag/.QF_Fgber
21508 1339654491  fbhepr/ovtinhyg$/Pheerag/NOP Cebwrpg/.QF_Fgber
15364 1335935650  fbhepr/ovtinhyg$/Pheerag/NOP Cebwrpg/NOP Sbezngf/.QF_Fgber
47104 1275531397  fbhepr/ovtinhyg$/Pheerag/NOP Cebwrpg/NOP Sbezngf/Oevrs/Nqqvgvbaf gb oevrs.qbp
1082834 1275628987 fbhepr/ovtinhyg$/Pheerag/NOP Cebwrpg/NOP Sbezngf/Oevrs/NOP Oevrs.nv
1090494 1275540447 fbhepr/ovtinhyg$/Pheerag/NOP Cebwrpg/NOP Sbezngf/Oevrs/Arj NOP nqqerff.nv
9971 1272846247 fbhepr/ovtinhyg$/Pheerag/NOP Cebwrpg/NOP Sbezngf/Oevrs/Bhgre Wbo Thvqryvarf.kyfk
6148 1274414392 fbhepr/ovtinhyg$/Pheerag/NOP Cebwrpg/NOP Sbezngf/THVQR SBE VZNTR CYNPRZRAG/.QF_Fgber
3176033 1273124458 fbhepr/ovtinhyg$/Pheerag/NOP Cebwrpg/NOP Sbezngf/THVQR SBE VZNTR CYNPRZRAG/30883 NOP EVPUYNAQF Pnaq17RN7O.NV
1752558 1259881816 fbhepr/ovtinhyg$/Pheerag/NOP Cebwrpg/NOP Sbezngf/THVQR SBE VZNTR CYNPRZRAG/NOP genl DYQ nccebirq 4 Qrp.cqs

The next step is to work through that list marking each file as part of a destination volume set. If the total volume of a destination volume set is reached, begin a new empty volume set. For this I used awk. This is split.awk

# Copyright (c) 2013, James Downie 
# 
# Permission to use, copy, modify, and/or distribute this
# software for any purpose with or without fee is hereby
# granted, provided that the above copyright notice and this
# permission notice appear in all copies.
# 
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL 
# WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL
# THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR
# CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING
# FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF
# CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
# OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
BEGIN {
  FS="\t"
  OFS="\t"
  batch = 1;
  batch_size = 0;
  limit_m = 3500000
  limit = limit_m * 1024 * 1024
}
{
  size = $1;
  mtime = $2;
  path = $3;
  if (batch_size + size > limit) {
    batch_size = 0;
    batch++;
  }
  batch_size += size;
  printf "# %d %s\n", batch, path
  printf "./migrate_one.sh %d %d \"%s\"\n", batch, mtime, path
}
END {
}

Although my HDD is 4Tb, I left a bit of extra space ant set limit to 3.5Tb. For each line in index.txt, split.awk will output two lines; a comment indicating the batch number and the source file’s path, and a command that will run a script called migrate_one.sh on the file. We’ll run awk over index.txt and direct the output into migrate_all.sh

$ awk -f split.awk index.txt > migrate_all.sh

That runs pretty quickly, considering that index.txt contains 6.7 million lines. Before I run migrate_all.sh I should show you two other scripts. The first is migrate_one.sh

#!/bin/bash

# Copyright (c) 2013, James Downie 
# 
# Permission to use, copy, modify, and/or distribute this
# software for any purpose with or without fee is hereby
# granted, provided that the above copyright notice and this
# permission notice appear in all copies.
# 
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL 
# WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL
# THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR
# CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING
# FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF
# CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
# OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

BATCH="$1"
MTIME="$2"
SOURCE="$3"

D="`dirname "$SOURCE"`"
TD="destination/$BATCH/$D"
T="destination/$BATCH/$SOURCE"
mkdir -v -p "$TD"
mv -v "$SOURCE" "$T";
./setmtime.py $MTIME "$T";

That script works out the deep pathname that we expect to move the file into, and then relies on mkdir‘s -p switch to make all of the parent directories leading to the deepest directory entry. Then a simple mv will relocate the volume of the file from the source tree into the destination tree.
The setmtime.py script is a little hack that sets a file’s mtime to the time nominated with the provided unix timestamp. We have all of the file’s unix timestamps in index.txt because stat determined them for us back when index.sh ran find across source. migrate_one.sh runs setmtime.py against the file once it as been moved to ensure that the file survived the move without losing it’s original “last modified time”.

#!/usr/bin/python

# Copyright (c) 2013, James Downie 
# 
# Permission to use, copy, modify, and/or distribute this
# software for any purpose with or without fee is hereby
# granted, provided that the above copyright notice and this
# permission notice appear in all copies.
# 
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL 
# WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL
# THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR
# CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING
# FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF
# CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
# OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

import os, sys

mtime = int(sys.argv[1])
path = sys.argv[2]

if os.path.exists(path):
  os.utime(path, (mtime, mtime))
else:
  print "Where is '%s'" % path

My directory looked like this…

./
./source/
./destination/
./index.sh
./index.txt
./split.awk
./migrate_all.sh
./migrate_one.sh
./setmtime.py

…when I then ran…

$ sh migrate_all.sh

That took two and a half days to run. Well, actually it took weeks to get right because my script kept breaking as I discovered special characters in source. I made filenameCleanse.py aggressive enough to remove any characters that jeopardised any of these steps and finally it ran without error. The end result looks like this…

$ du -d 1 -h destination/
3.3T    destination/1
3.3T    destination/2
798G    destination/3
7.5T    destination