April 24, 2019 From rOpenSci (https://deploy-preview-334--ropensci.netlify.app/blog/2019/04/24/pdftools-22/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
Last month we released a new version of pdftools and a new companion package qpdf for working with pdf files in R. This release introduces the ability to perform pdf transformations, such as splitting and combining pages from multiple files. Moreover, the pdf_data()
function which was introduced in pdftools 2.0 is now available on all major systems.
It is now possible to split, join, and compress pdf files with pdftools. For example the pdf_subset()
function creates a new pdf file with a selection of the pages from the input file:
# Load pdftools
library(pdftools)
# extract some pages
pdf_subset('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf',
pages = 1:3, output = "subset.pdf")
# Should say 3
pdf_length("subset.pdf")
Similarly pdf_combine()
is used to join several pdf files into one.
# Generate another pdf
pdf("test.pdf")
plot(mtcars)
dev.off()
# Combine them with the other one
pdf_combine(c("test.pdf", "subset.pdf"), output = "joined.pdf")
# Should say 4
pdf_length("joined.pdf")
The split and join features are provided via a new package qpdf which wraps the qpdf C++ library. The main benefit of qpdf is that no external software (such as pdftk) is needed. The qpdf package is entirely self contained and works reliably on all operating systems with zero system dependencies.
The pdftools 2.0 announcement post from December introduced the new pdf_data()
function for extracting individual text boxes from pdf files. However it was noted that this function was not yet available on most Linux distributions because it requires a recent fix from poppler 0.73.
I am happy to say that this should soon work on all major Linux distributions. Ubuntu has upgraded to poppler 0.74 on Ubuntu Disco which was released this week. I also created a PPA for Ubuntu 16.04 (Xenial) and 18.04 (Bionic) with backports of poppler 0.74. This makes it possible to use pdf_data
on Ubuntu LTS servers, including Travis:
sudo add-apt-repository ppa:cran/poppler
sudo apt-get update
sudo apt-get install libpoppler-cpp-dev
Moreover, the upcoming Fedora 30 will ship with poppler-devel 0.73.
Finally, the upcoming Debian “Buster” release will ship with poppler 0.71, but the Debian maintainers were nice enough to let me backport the required patch from poppler 0.73, so pdf_data()
will work on Debian (and hence CRAN) as well!