TITLE: Extracting pages with colour from a PDF
DATE: 2021-10-26
AUTHOR: John L. Godlee
====================================================================


I wanted to print my PhD thesis so I could have a version to 
annotate before my viva. The cost at my local copy shop to print a 
full colour version of the thesis would have been somewhere around 
£60, while a black and white copy only cost about £15. It wasn't 
necessary to print the whole document in colour as only pages with 
figures contained any colour, so I wanted to find a way to 
automatically extract the pages which did contain colour and create 
a new document containing only those pages, so I could print those 
in colour separately.

I created a shell script that uses ghostscript (gs) to find the 
colour pages, and pdfjam to extract those pages and create a new 
document:

  [ghostscript (gs)]: https://ghostscript.com/
  [pdfjam]: https://github.com/rrthomas/pdfjam

    #!/usr/bin/env sh

    # Extract colour pages from a PDF, then create a new PDF 
containing only those pages. Useful for saving on printing costs.

    if [ "$#" -ne 2 ]; then
        echo "Usage: $0 <input.pdf> <output.pdf>"
        exit 2
    fi

    if [ ! -f $1 ]; then
        echo "Input file not found"
        exit 2
    fi

    pages=$(gs -o - -sDEVICE=inkcov "${1}" | tail -n +6 | sed 
'/^Page*/N;s/\n//' | sed -E '/Page [0-9]+ 0.00000  0.00000  0.00000 
 / d' | grep -Eo '^Page\s[0-9]+' | awk '{print $2}' | tr '\n' ',' | 
sed 's/,$//g')

    if [ -z "${pages}" ]; then
        echo "File has no colour pages"
        exit 2
    fi

    pdfjam "${1}" ${pages} -o "${2}" &> /dev/null

The first part of the script with the if statements simply checks 
whether the parameters passed to the script are valid. The script 
needs to be fed an existing input file, and an output file name.

The pages variable is created by using the inkcov device provided 
in gs >v9.05. The inkcov device displays the ink coverage 
separately for each page, so all that needs to be done is to 
exclude pages which contain only black, and then format the page 
numbers in the way that pdfjam expects. If no colour pages are 
found then the script exits without creating a new PDF. pdfjam then 
takes the input filename, the page range, and the output filename 
and creates a new PDF document.