Merging Google Photos from Google Takeout

2020-01-19

I've been working on my life archive and my latest project has been to pull in my Google Photos. (I've already pulled in photos from other sources such as iPhoto.) To this end, I used Google Takeout to request an export of just my Google Photos. Here's the main steps that I took to download my export and incorporate it into my collection of photos.

The goal

My goal is to have all my pictures, from Google and from other sources, in a Dropbox folder called mnemosyne/pictures. Underneath this folder I want to have pictures structured roughly into sub-folders for year, month, and day, e.g. mnemosyne/pictures/2010/01/01/picture.jpg.

One thing I realized is Google Photos has non-photos in it–it's really a photo and video service. Here's the types of files in my export:¹

Extension	What I'm going to do with it
avi	Goes in `mnemosyne/videos`
cr2	Goes in `mnemosyne/pictures`
gif	Goes in `mnemosyne/pictures`
jpg	Goes in `mnemosyne/pictures`
json	Refer to but do not import
m4v	Goes in `mnemosyne/videos`
mov	Goes in `mnemosyne/videos`
mp4	Goes in `mnemosyne/videos`
pdf	Goes in `mnemosyne/projects`
png	Goes in `mnemosyne/pictures`

Key tools used

I relied on the below command-line tools the most:

bash: I used a heckuva lot of short bash scripts and one-liners.
date: I used date to convert dates a little bit; on OS X/FreeBSD this is done with date -j -f %SOURCEFORMAT $FROM +$TOFORMAT.
exiftool: reads EXIF data from image files.
fdupes: excellent tool to find duplicate files.
jq: JSON file parser to read the exiftool output more easily.
touch: lets you set the last-modified time of a file. I use the last-modified time for when the picture was taken.
The indomitable pairing of find . -type f -iname '*.jpg' -print0 and xargs -0

Downloading and preparing Google Photos for import

The first (large) step was to get my Google Photos data. I requested the export via takeout.google.com, and after ~5 days I got an email that the takeout failed. I then tried again, and the second attempt was successful. Google structured my export into 91 zip files of 2 GB each.

My computer has (barely) enough disk space to download these files and still have all my other pictures on the disk, so I had to be careful not to keep duplicate copies of files in my processing.

I needed to set my computer never to sleep. OS X Catalina has a sleep setting "Prevent computer from sleeping automatically when the display is off" that I needed to check.

I created a manual checklist for the 91 files so I could record each file when I downloaded and processed it.

I logged into Google Takeout via Firefox and started out by downloading 10 files. I found that only two of the files really got downloaded at a time. Once I got the process working for these ten files, then I queued up more files to download at a time. Notably, Firefox creates a second file called $FILENAME.part while files are being downloaded. I rely on this .part file for the next step:

I wrote this shell script to "process" the downloaded files into my ingest folder:

Listing 1: Google Takeout download processing script

#!/bin/sh
set -e

DESTDIR=~/Dropbox/mnemosyne/ingest
for i in *.zip
do if [ -e "$i.part" ]
then
  : # echo "$i still downloading"
else
  echo "processing $i"
  # ditto rather than unzip because ditto handles unicode filenames
  ditto -k -x $i $DESTDIR
  rm $i
  echo "processed $i"
fi
done

This unpacks the files using "ditto". I found that unzip doesn't handle unicode characters and some of the Google Hangouts folders² had "•" characters in them. The set -e tells bash to stop running if there's an error, so the files aren't removed if ditto fails.

I ran this processing script from time to time and I checked off each file in my manual checklist when the script said the zip file was processed.

At the end of this step, I have a bunch of content in ~/Dropbox/mnemosyne/ingest/Takeout/Google Photos. (The export zip files have in internal hierarchy of Takeout/$GOOGLE_SERVICE and since I just requested Google Photos I only have that folder in this takeout.)

Recording the files I downloaded

I may need to do this process again and I don't want to do any of the below steps for the files I've already processed. To that end I ran these scripts:

Listing 2: ls and find export script for Takeout directory

cd ~/Dropbox/mnemosyne/ingest/Takeout
ls -lR > ~/Dropbox/mnemosyne/pictures/Google-Takeout-ls-lR.txt
find . -type f > ~/Dropbox/mnemosyne/pictures/Google-Takeout-find-files.txt

I ran the ls -lR because I have run into problems in the past with needing file metadata–notably the last-modified time.

Removing duplicates

Duplicates within Google Photos takeout

First I checked for duplicates within Google Photos–and there are a lot!

Listing 3: fdupes for internal duplicates

fdupes -r ~/Dropbox/mnemosyne/ingest/Takeout > ~/tmp/google-photos-dupes.txt

I then processed this list of duplicates so that I kept the first ASCIIbetical result, using this Perl script:

Listing 4: Remove all but first ASCIIbetical duplicate

#!perl

$/ = "\n\n";

while(<>) {
  print "To process: $_";

  my @delete_set = split /\n/;
  my @sorted_set = sort @delete_set;
  shift @sorted_set;
  for $file_to_delete (@sorted_set) {
    unlink($file_to_delete);
  }
}

Duplicates vs. my existing photos

Next, I checked for duplicates between Google Photos and my existing photos:

Listing 5: Compare Takeout with existing photos/videos script

# Make sure this returns nothing first:
fdupes -r ~/Dropbox/mnemosyne/pictures ~/Dropbox/mnemosyne/videos
fdupes -r ~/Dropbox/mnemosyne/pictures ~/Dropbox/mnemosyne/videos ~/Dropbox/mnemosyne/ingest/Takeout > ~/tmp/dupes-with-existing.txt

I then removed the ingest version of any files using this Perl script:

Listing 6: Remove ingest version of duplicates script

#!perl

$/ = "\n\n";

while(<>) {
  print "To process: $_";

  my @delete_set = split /\n/;
  my @sorted_set = grep /^ingest/, @delete_set;
  if (@sorted_set < @delete_set) {
      for $file_to_delete (@sorted_set) {
          unlink($file_to_delete);
      }
  }
}

OK–after all this, my Takeout directory should have just my unique files.

Setting last-modified times

Now I need to set my last-modified times (mtime) on the Takeout photos. I originally did this kind of ham-handedly using this script:

Listing 7: set mtime based on directory names script

cd "~/Dropbox/mnemosyne/ingest/Takeout/Google Photos"
for i in 19*-*-* 20*-*-*  # the files starting with 20 are in the form YYYY-MM-DD
do
  YYYYMMDD=$(echo $i | tr -d '-' | cut -c1-8)
  TOUCH_DT="${YYYYMMDD}0001"
  echo touch -m -t $TOUCH_DT $i/*
done

This relies on the fact that Google Photos puts files folders that mostly start with YYYY-MM-DD. However, I later started using exiftool; that script looks something like this:

Listing 8: use exiftool to set mtime script

for i in *jpg
do
  i_date=$(exiftool -j -EXIF:CreateDate "$i" | jq -r '.[] | .CreateDate')
  echo -n "$i_date "; echo touch -m -t "${i_date:0:4}${i_date:5:2}${i_date:8:2}${i_date:11:2}${i_date:14:2}" "$i"
done

(I had to use this script for a folder called Instant Uploads.)

Reading from JSON files

I did end up using the JSON files to process a few files, using this script:

Listing 9: set mtime based on Google Takeout's JSON file script

#!/bin/sh
set -e

JSON_DIR="$HOME/Dropbox/mnemosyne/ingest/Takeout/Google Photos/Instant Upload"
for filename in "$@"
do
    echo "processing $filename"
    pictures_file="$(basename "$filename")"
    json_file="$JSON_DIR/$pictures_file.json"
    if [ -e "$json_file" ]
    then
        picture_ts=$(jq -r '.photoTakenTime|.timestamp' < "$json_file")
        touch_string=$(date -j -f %s "$picture_ts" +%Y%m%d%H%M)
        touch -m -t "$touch_string" "$filename"
    else
        echo "Can't find a JSON file for $filename"
    fi
done

This pulls the photoTakenTime → timestamp value and then uses OS X's date command to convert the date from epoch seconds into the string that touch needs (%Y%m%d%H%M).

Moving files

After all that work, I used two scripts to move the files:

Listing 10: process_pictures.sh – organizes files by mtime into YYYY/MM/DD directories

#!/bin/sh
set -e

PICTURES_ROOT=~/Dropbox/mnemosyne/pictures

for filename in "$@"
do
    pictures_stem=$(stat -f '%Sm' -t '%Y/%m/%d' "$filename")
    pictures_file="$(basename "$filename")"
    newpath="$PICTURES_ROOT/$pictures_stem"
    mkdir -p $newpath
    collisioncount=0
    while [ -e "$newpath" ]
    do
      pictures_file_no_ext="${pictures_file%.*}"
      pictures_file_ext="${pictures_file##*.}"
      newpath="$PICTURES_ROOT/$pictures_stem/$pictures_file_no_ext-$collisioncount.$pictures_file_ext"
      collisioncount=$((collisioncount+1))
    done
    mv -i "$filename" "$newpath"

done

and another script process_videos.sh that is exactly the same except the string pictures is replaced by videos everywhere.

I could then chain these scripts with find statements like this:

Listing 11: sending files to process_pictures.sh

cd "~/Dropbox/mnemosyne/ingest/Takeout/Google Photos"
find . -iname '*.jpg' -print0 | xargs -0 ~/path/to/process_pictures.sh

This moved all the matching files to the correct YYYY/MM/DD folder within the pictures folder. I then did the same for each file type.

Double-checking

At this point I made sure I didn't have any files remaining besides the .json files. Then I did a little review of my mnemosyne/pictures and mnemosyne/videos files to make sure they looked reasonable.

And then I was done!

Other lessons learned

Takeout includes shared libraries: One surprise for me was that Lauren's photos were included in my Google Takeout, because we share our Google Photo libraries. I didn't realize this until after I'd processed anything!

Footnotes:

I figured this out by going to the root of my Google Photos consolidated takeout folder and running

find . -type f | perl -ne '/(\.[^.]+)$/ and print lc($1)'  | sort | uniq -c

Yep, apparently my Google Hangouts media was in the Google Photos takeout.

John Borwick