How To

Extracting Embedded Images from PDF: A Step-by-Step Guide

Share on:

Table of Contents

Sometimes you may need to extract embeded images from PDF files. Follow below methods to extract images from PDF.

Install Poppler on Linux

Poppler is a PDF rendering library based on the xpdf-3.0 code base.

It is with this library that we will have access to PDF file manipulation tools.

To install it, it makes the most sense to resort to the package included in the official repositories of each distribution. Although, you can also compile it or download the binaries.

In the case of Debian, Ubuntu and its derivatives such as Linux Mint, you can run

sudo apt update
sudo apt install poppler-utils

Once the library is installed, then we can use part of its components to accomplish the task.

Extract embedded images from a PDF file

The procedure is simple. Just follow this syntax.

pdfimages -all filename.pdf images/prefix

The above command takes all the images from the filename.pdf file and extracts them into the same directory as the prompt. Of course, you can set an absolute path to where the PDF file is and another one for the output.

As for images/prefix the ideal would be to choose one that identifies the images well and with a format like jpeg or png of this two PNG, it brings more quality.

Then, the command would look like this

pdfimages -all filename.pdf sample

This will originate image files with this nomenclature sample-nnn.png in the directory.

If you want to use jpg, then add the -j option

pdfimages -all -j filename.pdf sample

About the -j option, you might not get the desired results, but see what man says about it:

” Normally, all images are written as PBM (for monochrome images) or PPM for non-monochrome images) files. With this option, images in DCT format are saved as JPEG files. All non-DCT images are saved in PBM/PPM format as usual.”

More options available for extracting images

The above command extracts all images, but many times we want to define a range. Important option if the file is very long.

For this, there are the options -f and -l that define the first and the last page from where to extract the images

pdfimages -f 1 -l 5 -png filename.pdf images

This is perhaps the most useful option because it allows us to limit the output files.

Another very interesting option is -p which includes page numbers in output file names

pdfimages -f 1 -l 5 -png -p filename.pdf images

Thank you for reading this blog post.