Hashbang.ca

I wanted to move a large number of files from one directory to another, but the target directory already had many of the filenames already used. This is a common enough problem – digital cameras use DSC#, video downloaders often append numbers to get a unique filename, and so on. In both those examples, the sequence restarts if you empty the program’s work directory. So, you’ll end up with DSC0001.jpg every time you empty your camera’s memory card. If you’re trying to move such files into a single directory, you’ll get conflicts every time.

Instead of manually renaming the files before transferring them, I wrote a simple script to give each file a unique name in the destination directory.

Mimicing `mv`

First, we get the arguments. Like mv, this script can take both SOURCE DEST (to rename a single file/dir to the given name), and SOURCE(S) DIRECTORY to move several sources into a single directory. Here’s how we do that:

use Path::Tiny;

my @source;
push @source, path(shift)
    while @ARGV > 1;             # Accept multiple SOURCEs,
my $dest = path(shift);          # but only one DEST/DIRECTORY.
my $dest_is_dir = $dest->is_dir; # Was DEST a directory?

Now that we have our inputs, and we disambiguated the final argument by checking if it is a directory or not, we can start to actually move the files. Note that all the paths are Path::Tiny objects.

First, we’ll have to handle the case where we’re moving files into a target directory (rather than to a specified name). Then, we simply call the move method to move each file appropriately.

foreach my $from (@source) {
    my $to = path( $dest, ($dest_is_dir ? $file->basename : ()) );

    $from->move($to);
}

At this point, the basic framework of what we want is there, but there are several complications. The first is in the spec: we want to pick some random name if we can’t do the rename because the target filename is already taken. The second is handling what the rename(2) system call won’t: moving files across filesystem boundaries. The last one is doing a bit of de-duplication.

Handling name collisions

I used the tempfile routine to get a unique filename. Normally, this creates a temporary file (which is removed when the object goes out of scope) – but we want to simply use this to pick a unique filename. To do that, we’ll need UNLINK => 0. These files are usually created in the temporary directory, but we want to create them in the target directory, so we use the DIR option.

I also chose to stick the random part of the filename right before the file extension, instead of at the end. This makes it easier to continue to use globbing like *.jpg in the directory once we’re done.

    if ($to->exists) {
        my ($prefix, $suffix) = $to->basename =~ m{^(.*)\.(\w+)$};

        $to = Path::Tiny->tempfile(
            UNLINK => 0,
            TEMPLATE => ($prefix // $to->basename) . '-XXXXXX',
            DIR => $dest_is_dir ? $dest : $dest->dirname,
            ( $suffix ? (SUFFIX => ".$suffix") : () ),
        );
        warn "File already exists, renaming to $to\n";
    }

Moving across filesystem boundaries

The second complication is that this uses the system’s rename call, which (usually) doesn’t move files across filesystem boundaries. My media directories are on another filesystem, so this is a deal breaker for me. Let’s catch that error condition – we can handle it by copying the file, and deleting the original.

use Try::Tiny;
use POSIX qw(:errno_h);

...

    try {
        $from->move($to);
    }
    catch {
        die $_ unless $_->isa('autodie::exception');

        if ($_->errno == EXDEV) { # Invalid cross-device link
            $from->copy($to);
            $from->remove;
        }
        else {
            die $_;
        }
    };

Some de-duplication

In the case of a name collision, there is some probability that the two files are actually the same. In the case when they’re not, the behaviour we have so far is correct. But in the case where you’re moving an identical file, it might be nice to not dump another copy into the destination. We can detect duplicate files, and just nuke the source file if there is already a copy at the destination.

To detect duplicates, we can use a hashing scheme like MD5 or SHA-1, but that can be expensive. We might be moving large media files around, and hashing file content would require reading both files off the disk. We can try to short-cut that by checking if the files have identical size – if not, we can short-circuit the check. Only if they’re the same size do we need to hash the contents. Since we’re not scouring the directory for duplicates (we’re only going to notice them when there is a name collision), this will probably be okay.

my $duplicates = sub {
    my $A = shift;
    my $B = shift;
    return if $A->stat->size != $B->stat->size; # avoid reading file off disk

    # Pull out the big guns
    require Digest::MD5;
    return
        Digest::MD5->new->addfile( $A->filehandle('< ', ':raw') )->digest
        eq
        Digest::MD5->new->addfile( $B->filehandle('< ', ':raw') )->digest
    ;
};

...

        if ($to->exists) {
            if ($args{deduplicate}) {
                STDERR->autoflush(1);
                print STDERR "File already exists; checking for duplication..." if $VERBOSE;
                if ($duplicates->($from, $to)) {
                    print STDERR " `$from' and `$to' are duplicates; removing the source file.\n" if $VERBOSE;
                    $from->remove;
                    next;
                }
                else {
                    print STDERR " `$from' and `$to' are not duplicates.\n" if $VERBOSE;
                }
            }
            # pick a unique name and continue on as before
            ...

TL;DR

And there you have it: a simple script to help you avoid some tedium. I hope it works for you. If not, the code is on github.

App::mvr is available on CPAN if you want to give it a try. The use case is fairly narrow, but it was fun to play around with as a break from studying for finals. Path::Tiny in particular is a nice addition to my toolbox.

Permanent link for post: /post/introducing-mvr-like-mv-but-clever/
Posted: Apr 19, 2013
Tags: linux path-tiny perl release

Pages

Posts

Tags

Links

Customise

Introducing mvr: like mv, but clever

Mimicing `mv`

Handling name collisions

Moving across filesystem boundaries

Some de-duplication

TL;DR

Hashbang.ca

Pages

Posts

Tags

Links

Customise

Introducing mvr: like mv, but clever

Mimicing mv

Handling name collisions

Moving across filesystem boundaries

Some de-duplication

TL;DR

Mimicing `mv`