4/21/2020
PDFPlumberPDFPython - Python | CTOLib
> Python >
PDFPlumberPDFPython
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
5353
20612061
v0.5.15
v0.5.15
Ad
PDFPlumber v0.5.13
Go beyond APM
Ad
intelligence platform.
The all-in-one software
Sign Up
Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual
debugging.
Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer and pdfminer.six .
Currently tested on Python 2.7, 3.1, 3.4, 3.5, and 3.6.
Table of Contents
Installation
Command line interface
Python library
Visual debugging
Extracting tables
Extracting form values
Demonstrations
Acknowledgments / Contributors
Contributing
Installation
pip install pdfplumber
To use pdfplumber 's visual-debugging tools, you'll also need to have ImageMagick installed on your computer. Installation
instructions here.
Command line interface
Basic example
curl "https://cdn.rawgit.com/jsvine/pdfplumber/master/examples/pdfs/background-checks.pdf" > background-checks.pdf
pdfplumber < background-checks.pdf > background-checks.csv
The output will be a CSV containing info about every character, line, and rectangle in the PDF.
Options
Argument
Description
PDF generated with the free version of http://www.html2pdf.solutions
Dynatrace
4/21/2020
PDFPlumberPDFPython - Python | CTOLib
--format [format]
csv or json . The json format returns slightly more information; it includes PDF-level metadata
and height/width information about each page.
--pages [list of pages]
A space-delimited, 1 -indexed list of pages or hyphenated page ranges. E.g., 1, 11-15 , which
would return data for pages 1, 11, 12, 13, 14, and 15.
Choices are char , anno , line , curve , rect , rect_edge . Defaults to char , anno , line , curve , rect .
--types [list of object
types to extract]
Python library
Basic example
import pdfplumber
with pdfplumber.open("path/to/file.pdf") as pdf:
first_page = pdf.pages[0]
print(first_page.chars[0])
Loading a PDF
pdfplumber provides two main ways to load a PDF:
pdfplumber.open("path/to/file.pdf")
pdfplumber.load(file_like_object)
Both methods return an instance of the pdfplumber.PDF class.
To load a password-protected PDF, pass the password keyword argument, e.g., pdfplumber.open("file.pdf", password =
"test") .
The pdfplumber.PDF class
The top-level pdfplumber.PDF class represents a single PDF and has two main properties:
Property
.metadata
Description
A dictionary of metadata key/value pairs, drawn from the PDF's Info trailers. Typically includes
"CreationDate," "ModDate," "Producer," et cetera.
.pages
A list containing one pdfplumber.Page instance per page loaded.
The pdfplumber.Page class
The pdfplumber.Page class is at the core of pdfplumber . Most things you'll do with pdfplumber will revolve around this class.
It has these main properties:
Property
.page_number
Description
The sequential page number, starting with 1 for the first page, 2 for the second, and so on.
.width
.height
The page's width.
The page's height.
.objects / .chars /
.lines / .rects
Each of these properties is a list, and each list contains one dictionary for each such object
embedded on the page. For more detail, see "Objects" below.
... and these main methods:
Method
Description
PDF generated with the free version of http://www.html2pdf.solutions
4/21/2020
PDFPlumberPDFPython - Python | CTOLib
.crop(bounding_box)
Returns a version of the page cropped to the bounding box, which should be
expressed as 4-tuple with the values (x0, top, x1, bottom) . Cropped pages retain
objects that fall at least partly within the bounding box. If an object falls only partly
within the box, its dimensions are sliced to fit the bounding box.
.within_bbox(bounding_box)
Similar to .crop , but only retains objects that fall entirely within the bounding box.
.filter(test_function)
Returns a version of the page with only the .objects for which test_function(obj)
returns True .
.extract_text(x_tolerance=0,
y_tolerance=0)
.extract_words(x_tolerance=0,
y_tolerance=0)
Collates all of the page's character objects into a single string. Adds spaces where
the difference between the x1 of one character and the x0 of the next is greater
than x_tolerance . Adds newline characters where the difference between the doctop
of one character and the doctop of the next is greater than y_tolerance .
Returns a list of all word-looking things and their bounding boxes. Words are
considered to be sequences of characters where the difference between the x1 of
one character and the x0 of the next is less than or equal to x_tolerance and where
the doctop of one character and the doctop of the next is less than or equal to
y_tolerance .
.extract_tables(table_settings)
Extracts tabular data from the page. For more details see "Extracting tables" below.
.to_image(**conversion_kwargs)
Returns an instance of the PageImage class. For more details, see "Visual debugging"
below. For conversion_kwargs, see here.
Objects
Each instance of pdfplumber.PDF and pdfplumber.Page provides access to four types of PDF objects. The following
properties each return a Python list of the matching objects:
.chars , each representing a single text character.
.annos , each representing a single annotation-text character.
.lines , each representing a single 1-dimensional line.
.rects , each representing a single 2-dimensional rectangle.
.curves , each representing a series of connected points.
Each object is represented as a simple Python dict , with the following properties:
char / anno properties
Property
page_number
Description
Page number on which this character was found.
text
E.g., "z", or "Z" or " ".
fontname
Name of the character's font face.
size
adv
upright
height
width
x0
x1
Font size.
Equal to text width * the font size * scaling factor.
Whether the character is upright.
Height of the character.
Width of the character.
Distance of left side of character from left side of page.
Distance of right side of character from left side of page.
PDF generated with the free version of http://www.html2pdf.solutions
4/21/2020
PDFPlumberPDFPython - Python | CTOLib
y0
y1
top
bottom
doctop
Distance of bottom of character from bottom of page.
Distance of top of character from bottom of page.
Distance of top of character from top of page.
Distance of bottom of the character from top of page.
Distance of top of character from top of document.
object_type
"char" / "anno"
line properties
Property
page_number
Description
Page number on which this line was found.
height
width
x0
x1
y0
y1
top
bottom
doctop
Height of line.
Width of line.
Distance of left-side extremity from left side of page.
Distance of right-side extremity from left side of page.
Distance of bottom extremity from bottom of page.
Distance of top extremity bottom of page.
Distance of top of line from top of page.
Distance of bottom of the line from top of page.
Distance of top of line from top of document.
linewidth
Thickness of line.
object_type
"line"
rect properties
Property
page_number
Description
Page number on which this rectangle was found.
height
width
x0
x1
y0
y1
top
bottom
doctop
Height of rectangle.
Width of rectangle.
Distance of left side of rectangle from left side of page.
Distance of right side of rectangle from left side of page.
Distance of bottom of rectangle from bottom of page.
Distance of top of rectangle from bottom of page.
Distance of top of rectangle from top of page.
Distance of bottom of the rectangle from top of page.
Distance of top of rectangle from top of document.
linewidth
Thickness of line.
object_type
"rect"
PDF generated with the free version of http://www.html2pdf.solutions
4/21/2020
PDFPlumberPDFPython - Python | CTOLib
curve properties
Property
page_number
Description
Page number on which this curve was found.
points
height
width
x0
x1
y0
y1
top
bottom
doctop
Points — as a list of (x, top) tuples — describing the curve.
Height of curve's bounding box.
Width of curve's bounding box.
Distance of curve's left-most point from left side of page.
Distance of curve's right-most point from left side of the page.
Distance of curve's lowest point from bottom of page.
Distance of curve's highest point from bottom of page.
Distance of curve's highest point from top of page.
Distance of curve's lowest point from top of page.
Distance of curve's highest point from top of document.
linewidth
Thickness of line.
object_type
"curve"
Additionally, both pdfplumber.PDF and pdfplumber.Page provide access to two derived lists of objects: .rect_edges (which
decomposes each rectangle into its four lines) and .edges (which combines .rect_edges with .lines ).
Visual debugging
Creating a PageImage with .to_image()
To turn any page (including cropped pages) into an PageImage object, call my_page.to_image() . You can optionally pass a
resolution={integer} keyword argument, which defaults to 72. E.g.:
im = my_pdf.pages[0].to_image(resolution=150)
PageImage objects play nicely with IPython/Jupyter notebooks; they automatically render as cell outputs. For example:
PDF generated with the free version of http://www.html2pdf.solutions
4/21/2020
PDFPlumberPDFPython - Python | CTOLib
Basic PageImage methods
Method
im.reset()
im.copy()
Description
Clears anything you've drawn so far.
Copies the image to a new PageImage object.
im.save(path_or_fileobject, format="PNG")
Saves the annotated image.
Drawing methods
You can pass explicit coordinates or any pdfplumber PDF object (e.g., char, line, rect) to these methods.
Single-object method
Bulk method
im.draw_line(line, stroke={color},
stroke_width=1)
im.draw_lines(list_of_lines,
**kwargs)
Description
Draws a line from a line , curve , or a 2-
tuple of 2-tuples (e.g., ((x, y), (x, y)) ).
im.draw_vline(location, stroke={color},
stroke_width=1)
im.draw_vlines(list_of_locations,
**kwargs)
Draws a vertical line at the x-coordinate
indicated by location .
im.draw_hline(location, stroke={color},
stroke_width=1)
im.draw_hlines(list_of_locations,
**kwargs)
Draws a horizontal line at the y-
coordinate indicated by location .
im.draw_rect(bbox_or_obj, fill={color},
stroke={color}, stroke_width=1)
im.draw_rects(list_of_rects,
**kwargs)
Draws a rectangle from a rect , char ,
etc., or 4-tuple bounding box.
im.draw_circle(center_or_obj, radius=5, fill=
im.draw_circles(list_of_circles,
Draws a circle at (x, y) coordinate or at
PDF generated with the free version of http://www.html2pdf.solutions
4/21/2020
PDFPlumberPDFPython - Python | CTOLib
{color}, stroke={color})
**kwargs)
the center of a char , rect , etc.
Note: The methods above are built on Pillow's ImageDraw methods, but the parameters have been tweaked for
consistency with SVG's fill / stroke / stroke_width nomenclature.
Troubleshooting ImageMagick on Debian-based systems
If you're using pdfplumber on a Debian-based system and encounter a PolicyError , you may be able to fix it by changing
the following line in /etc/ImageMagick-6/policy.xml from this:
... to this:
(More details about policy.xml available here.)
Extracting tables
pdfplumber 's approach to table detection borrows heavily from Anssi Nurminen's master's thesis, and is inspired by
Tabula. It works like this:
1. For any given PDF page, find the lines that are (a) explicitly defined and/or (b) implied by the alignment of words on
the page.
2. Merge overlapping, or nearly-overlapping, lines.
3. Find the intersections of all those lines.
4. Find the most granular set of rectangles (i.e., cells) that use these intersections as their vertices.
5. Group contiguous cells into tables.
Table-extraction methods
pdfplumber.Page objects can call the following table methods:
Method
.find_tables(table_settings={})
.extract_tables(table_settings={})
.extract_table(table_settings={})
Description
Returns a list of Table objects. The Table object provides access to the .cells , .rows ,
and .bbox properties, as well as the .extract(x_tolerance=3, y_tolerance=3) method.
Returns the text extracted from all tables found on the page, represented as a list
of lists of lists, with the structure table -> row -> cell .
Returns the text extracted from the largest table on the page, represented as a
list of lists, with the structure row -> cell . (If multiple tables have the same size
— as measured by the number of cells — this method returns the table closest to
the top of the page.)
.debug_tablefinder(table_settings=
{})
Returns an instance of the TableFinder class, with access to the .edges ,
.intersections , .cells , and .tables properties.
For example:
pdf = pdfplumber.open("path/to/my.pdf")
page = pdf.pages[0]
page.extract_table()
Click here for a more detailed example.
PDF generated with the free version of http://www.html2pdf.solutions
4/21/2020
PDFPlumberPDFPython - Python | CTOLib
Table-extraction settings
By default, extract_tables uses the page's vertical and horizontal lines (or rectangle edges) as cell-separators. But the
method is highly customizable via the table_settings argument. The possible settings, and their defaults:
{
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"explicit_vertical_lines": [],
"explicit_horizontal_lines": [],
"snap_tolerance": 3,
"join_tolerance": 3,
"edge_min_length": 3,
"min_words_vertical": 3,
"min_words_horizontal": 1,
"keep_blank_chars": False,
"text_tolerance": 3,
"text_x_tolerance": None,
"text_y_tolerance": None,
"intersection_tolerance": 3,
"intersection_x_tolerance": None,
"intersection_y_tolerance": None,
}
Setting
"vertical_strategy"
Description
Either "lines" , "lines_strict" , "text" , or "explicit" . See explanation below.
"horizontal_strategy"
Either "lines" , "lines_strict" , "text" , or "explicit" . See explanation below.
"explicit_vertical_lines"
"explicit_horizontal_lines"
"snap_tolerance"
"join_tolerance"
"edge_min_length"
"min_words_vertical"
A list of vertical lines that explicitly demarcate cells in the table. Can be used in
combination with any of the strategies above. Items in the list should be either numbers
— indicating the x coordinate of a line the full height of the page — or a dictionary
describing the line, with at least the following keys: x , top , bottom .
A list of vertical lines that explicitly demarcate cells in the table. Can be used in
combination with any of the strategies above. Items in the list should be either numbers
— indicating the y coordinate of a line the full height of the page — or a dictionary
describing the line, with at least the following keys: top , x0 , x1 .
Parallel lines within snap_tolerance pixels will be "snapped" to the same horizontal or
vertical position.
Line segments on the same infinite line, and whose ends are within join_tolerance of one
another, will be "joined" into a single line segment.
Edges shorter than edge_min_length will be discarded before attempting to reconstruct
the table.
When using "vertical_strategy": "text" , at least min_words_vertical words must share the
same alignment.
"min_words_horizontal"
When using "horizontal_strategy": "text" , at least min_words_horizontal words must share the
same alignment.
"keep_blank_chars"
"text_tolerance" ,
"text_x_tolerance" ,
"text_y_tolerance"
When using the text strategy, consider " " chars to be parts of words and not word-
separators.
When the text strategy searches for words, it will expect the individual letters in each
word to be no more than text_tolerance pixels apart.
PDF generated with the free version of http://www.html2pdf.solutions