5  Photo Cleaning For Sharing

5.1

5.2 Background

When sharing images, it is useful to be able to…

5.2.0.1 Remove EXIF data, which can include GPS locations where the photos were taken

Imagemagick is used for this

5.2.0.2 Rename files to have nonsensical names

Using either date or gdate on linux or OSX respectively. gdate is part of the coreutils package

This tutorial is aimed at OSX users and assumes you have homebrew (brew) installed. Using it on a PC through the linux subsystem for windows should be easy. Replace brew with apt-get or similar.

5.3 Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mixtools)
mixtools package, version 2.0.0, Released 2022-12-04
This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772 and the Chan Zuckerberg Initiative: Essential Open Source Software for Science (Grant No. 2020-255193).
set.seed(23423346)

5.4 Copy some photos in to the working directory

system("rm data/photos_for_cleaning/*.JPG")
system("unzip data/photos_for_cleaning/photos_for_cleaning.zip -d data/photos_for_cleaning/")

5.5 Remove EXIF data

Install Imagemagick (hash this line if you already have Imagemagick unless you want a long wait)

#system("brew install imagemagick")

Use imagemagick’s mogrify to make an in place copy of each file with the EXIF data stripped out. This has the benefit of making a new creation datestamp and md5 hash for each file.

Before running this on thousands of photos, we will test that this works by reading an EXIF metadata from a single file.
Let’s see what files are available in the folder data/photos_for_cleaning

pictures = list.files(path = "data/photos_for_cleaning/",pattern = ".JPG",full.names = T)

Now use the identify command to look the content of the EXIF for one file

system(str_c(r"(identify -format '%[EXIF:*]' )", pictures[1]))

You’ll see a load of data from the EXIF. If you want to capture anything, like the aperture settings etc, now is the time to do it. It should be simple to design a loop and function that will capture this info.

After applying the following command from ImageMagick

system(str_c("mogrify -strip ",pictures[1]))

You should be able to rerun this

system(str_c(r"(identify -format '%[EXIF:*]')", pictures[1]))

and you should now see an empty EXIF, i.e. nothing will be shown in the console. If you still see EXIF data, something went wrong.

To apply the function to all files in the folder, we won’t use the built in mogrify -strip data/photos_for_cleaning/*.JPG because it halts when it hits a problem. Instead we will wrap it in an R code with ‘try’.

pictures = list.files(path = "data/photos_for_cleaning/",pattern = ".JPG",full.names = T)

for(i in pictures){
                  (system(str_c("mogrify -strip ",i)))
                  }

5.6 Rename all files

5.6.1 Define a function to create a unique ID

  randomid <- function(n = 1,path,file) {
    a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
    id = paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
    file.rename(from=file, to=str_c(path,id,".JPG",sep=""))
  }

5.6.2 Apply the function, renaming all photos

path = "data/photos_for_cleaning/"
pictures = list.files(path = "data/photos_for_cleaning/",pattern = ".JPG",full.names = T)

for(i in pictures){randomid(path = path,file = i)}

5.6.3 Flag any files with high luminance

df = tibble(
  pictures = list.files(path = "data/photos_for_cleaning/",pattern = ".JPG",full.names = T),
  luminance = NA
)

for (i in 1:nrow(df)){

df$luminance[i] = as.numeric(system(str_c(r"(convert )",df$pictures[i],r"( -colorspace LAB -channel r -separate +channel -format "%[mean]\n" info: )",sep=""),intern = T))

}

5.6.4 Run an Expectation Maximisation to find clusters

set.seed(12341)
em<-normalmixEM(df$luminance, k=2,maxit = 3000)
number of iterations= 6 

Plot the results, with 99.99 Confidence interval

ggplot(df, aes(x = luminance)) +
  geom_histogram(binwidth = 0.007,color="white",fill="grey") +
  mapply(
    function(mean, sd, lambda, n, binwidth) {
      stat_function(
        fun = function(x) {
          (dnorm(x, mean = mean, sd = sd)) * n * binwidth * lambda
        }
      )
    },
    mean = em[["mu"]], #mean
    sd = em[["sigma"]], #standard deviation
    lambda = em[["lambda"]], #amplitude
    n = length(df$luminance), #sample size
    binwidth = 0.007 #binwidth used for histogram
  )+
  geom_vline(xintercept = em$mu[1]+(3.29*em$sigma[1]),lty=2,lwd=1,col="red")

5.6.5 Rename files with high luminance

for (i in 1:nrow(df)){

if (df$luminance[i] > 40000){file.rename(from = df$pictures[i],to = gsub(x = df$pictures[i],pattern = ".JPG",replacement = ".light.JPG"))}
  
message(str_c(df$pictures[i]," luminance = ",df$luminance[i]))
}
data/photos_for_cleaning//ERCVC3682S.JPG luminance = 21065.1
data/photos_for_cleaning//NPHTK8426K.JPG luminance = 20564.3
data/photos_for_cleaning//OYMPF5938B.JPG luminance = 10660.8
data/photos_for_cleaning//TGLWU3648Q.JPG luminance = 50894.2
data/photos_for_cleaning//VATLY3953X.JPG luminance = 52021
data/photos_for_cleaning//VRAXW6073Z.JPG luminance = 55293.9
data/photos_for_cleaning//VTEEG4048F.JPG luminance = 16634.8
data/photos_for_cleaning//WNQZB2575C.JPG luminance = 17054.4
data/photos_for_cleaning//YPZJU4391B.JPG luminance = 53881.3
data/photos_for_cleaning//YQFYO8330G.JPG luminance = 18454.5
data/photos_for_cleaning//ZIIYN8173Q.JPG luminance = 14055.4

5.7 Results

A file flagged as having high luminance

knitr::include_graphics(gsub(x=df$pictures[which(df$luminance==max(df$luminance))],pattern = ".JPG",replacement = ".light.JPG"),error = T)

A file flagged as having low luminance is displayed here

knitr::include_graphics(df$pictures[which(df$luminance==min
                                          (df$luminance))])