logo资料库

PDFPlumber:从PDF文件提取文字和表格的Python库.pdf

第1页 / 共13页
第2页 / 共13页
第3页 / 共13页
第4页 / 共13页
第5页 / 共13页
第6页 / 共13页
第7页 / 共13页
第8页 / 共13页
资料共13页,剩余部分请下载后查看
4/21/2020 PDFPlumberPDFPython - Python | CTOLib  >  Python >  PDFPlumberPDFPython Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.  5353   20612061  v0.5.15 v0.5.15   Ad PDFPlumber  v0.5.13  Go beyond APM Ad intelligence platform. The all-in-one software Sign Up Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging. Works best on machine-generated, rather than scanned, PDFs. Built on  pdfminer  and  pdfminer.six . Currently tested on Python 2.7, 3.1, 3.4, 3.5, and 3.6. Table of Contents Installation Command line interface Python library Visual debugging Extracting tables Extracting form values Demonstrations Acknowledgments / Contributors Contributing Installation pip install pdfplumber To use  pdfplumber 's visual-debugging tools, you'll also need to have  ImageMagick  installed on your computer. Installation instructions here. Command line interface Basic example curl "https://cdn.rawgit.com/jsvine/pdfplumber/master/examples/pdfs/background-checks.pdf" > background-checks.pdf pdfplumber < background-checks.pdf > background-checks.csv The output will be a CSV containing info about every character, line, and rectangle in the PDF. Options Argument Description PDF generated with the free version of http://www.html2pdf.solutions Dynatrace
4/21/2020 PDFPlumberPDFPython - Python | CTOLib  --format [format]   csv  or  json . The  json  format returns slightly more information; it includes PDF-level metadata and height/width information about each page.  --pages [list of pages]  A space-delimited,  1 -indexed list of pages or hyphenated page ranges. E.g.,  1, 11-15 , which would return data for pages 1, 11, 12, 13, 14, and 15. Choices are  char ,  anno ,  line ,  curve ,  rect ,  rect_edge . Defaults to  char ,  anno ,  line ,  curve ,  rect .  --types [list of object types to extract]  Python library Basic example import pdfplumber with pdfplumber.open("path/to/file.pdf") as pdf: first_page = pdf.pages[0] print(first_page.chars[0]) Loading a PDF  pdfplumber  provides two main ways to load a PDF:  pdfplumber.open("path/to/file.pdf")   pdfplumber.load(file_like_object)  Both methods return an instance of the  pdfplumber.PDF  class. To load a password-protected PDF, pass the  password  keyword argument, e.g.,  pdfplumber.open("file.pdf", password = "test") . The  pdfplumber.PDF  class The top-level  pdfplumber.PDF  class represents a single PDF and has two main properties: Property  .metadata  Description A dictionary of metadata key/value pairs, drawn from the PDF's  Info  trailers. Typically includes "CreationDate," "ModDate," "Producer," et cetera.  .pages  A list containing one  pdfplumber.Page  instance per page loaded. The  pdfplumber.Page  class The  pdfplumber.Page  class is at the core of  pdfplumber . Most things you'll do with  pdfplumber  will revolve around this class. It has these main properties: Property  .page_number  Description The sequential page number, starting with  1  for the first page,  2  for the second, and so on.  .width   .height  The page's width. The page's height.  .objects  /  .chars  /  .lines  /  .rects  Each of these properties is a list, and each list contains one dictionary for each such object embedded on the page. For more detail, see "Objects" below. ... and these main methods: Method Description PDF generated with the free version of http://www.html2pdf.solutions
4/21/2020 PDFPlumberPDFPython - Python | CTOLib  .crop(bounding_box)  Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values  (x0, top, x1, bottom) . Cropped pages retain objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box.  .within_bbox(bounding_box)  Similar to  .crop , but only retains objects that fall entirely within the bounding box.  .filter(test_function)  Returns a version of the page with only the  .objects  for which  test_function(obj)  returns  True .  .extract_text(x_tolerance=0, y_tolerance=0)   .extract_words(x_tolerance=0, y_tolerance=0)  Collates all of the page's character objects into a single string. Adds spaces where the difference between the  x1  of one character and the  x0  of the next is greater than  x_tolerance . Adds newline characters where the difference between the  doctop  of one character and the  doctop  of the next is greater than  y_tolerance . Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where the difference between the  x1  of one character and the  x0  of the next is less than or equal to  x_tolerance  and where the  doctop  of one character and the  doctop  of the next is less than or equal to  y_tolerance .  .extract_tables(table_settings)  Extracts tabular data from the page. For more details see "Extracting tables" below.  .to_image(**conversion_kwargs)  Returns an instance of the  PageImage  class. For more details, see "Visual debugging" below. For conversion_kwargs, see here. Objects Each instance of  pdfplumber.PDF  and  pdfplumber.Page  provides access to four types of PDF objects. The following properties each return a Python list of the matching objects:  .chars , each representing a single text character.  .annos , each representing a single annotation-text character.  .lines , each representing a single 1-dimensional line.  .rects , each representing a single 2-dimensional rectangle.  .curves , each representing a series of connected points. Each object is represented as a simple Python  dict , with the following properties:  char  /  anno  properties Property  page_number  Description Page number on which this character was found.  text  E.g., "z", or "Z" or " ".  fontname  Name of the character's font face.  size   adv   upright   height   width   x0   x1  Font size. Equal to text width * the font size * scaling factor. Whether the character is upright. Height of the character. Width of the character. Distance of left side of character from left side of page. Distance of right side of character from left side of page. PDF generated with the free version of http://www.html2pdf.solutions
4/21/2020 PDFPlumberPDFPython - Python | CTOLib  y0   y1   top   bottom   doctop  Distance of bottom of character from bottom of page. Distance of top of character from bottom of page. Distance of top of character from top of page. Distance of bottom of the character from top of page. Distance of top of character from top of document.  object_type  "char" / "anno"  line  properties Property  page_number  Description Page number on which this line was found.  height   width   x0   x1   y0   y1   top   bottom   doctop  Height of line. Width of line. Distance of left-side extremity from left side of page. Distance of right-side extremity from left side of page. Distance of bottom extremity from bottom of page. Distance of top extremity bottom of page. Distance of top of line from top of page. Distance of bottom of the line from top of page. Distance of top of line from top of document.  linewidth  Thickness of line.  object_type  "line"  rect  properties Property  page_number  Description Page number on which this rectangle was found.  height   width   x0   x1   y0   y1   top   bottom   doctop  Height of rectangle. Width of rectangle. Distance of left side of rectangle from left side of page. Distance of right side of rectangle from left side of page. Distance of bottom of rectangle from bottom of page. Distance of top of rectangle from bottom of page. Distance of top of rectangle from top of page. Distance of bottom of the rectangle from top of page. Distance of top of rectangle from top of document.  linewidth  Thickness of line.  object_type  "rect" PDF generated with the free version of http://www.html2pdf.solutions
4/21/2020 PDFPlumberPDFPython - Python | CTOLib  curve  properties Property  page_number  Description Page number on which this curve was found.  points   height   width   x0   x1   y0   y1   top   bottom   doctop  Points — as a list of  (x, top)  tuples — describing the curve. Height of curve's bounding box. Width of curve's bounding box. Distance of curve's left-most point from left side of page. Distance of curve's right-most point from left side of the page. Distance of curve's lowest point from bottom of page. Distance of curve's highest point from bottom of page. Distance of curve's highest point from top of page. Distance of curve's lowest point from top of page. Distance of curve's highest point from top of document.  linewidth  Thickness of line.  object_type  "curve" Additionally, both  pdfplumber.PDF  and  pdfplumber.Page  provide access to two derived lists of objects:  .rect_edges  (which decomposes each rectangle into its four lines) and  .edges  (which combines  .rect_edges  with  .lines ). Visual debugging Creating a  PageImage  with  .to_image()  To turn any page (including cropped pages) into an  PageImage  object, call  my_page.to_image() . You can optionally pass a  resolution={integer}  keyword argument, which defaults to 72. E.g.: im = my_pdf.pages[0].to_image(resolution=150)  PageImage  objects play nicely with IPython/Jupyter notebooks; they automatically render as cell outputs. For example: PDF generated with the free version of http://www.html2pdf.solutions
4/21/2020 PDFPlumberPDFPython - Python | CTOLib Basic  PageImage  methods Method  im.reset()   im.copy()  Description Clears anything you've drawn so far. Copies the image to a new  PageImage  object.  im.save(path_or_fileobject, format="PNG")  Saves the annotated image. Drawing methods You can pass explicit coordinates or any  pdfplumber  PDF object (e.g., char, line, rect) to these methods. Single-object method Bulk method  im.draw_line(line, stroke={color}, stroke_width=1)   im.draw_lines(list_of_lines, **kwargs)  Description Draws a line from a  line ,  curve , or a 2- tuple of 2-tuples (e.g.,  ((x, y), (x, y)) ).  im.draw_vline(location, stroke={color}, stroke_width=1)   im.draw_vlines(list_of_locations, **kwargs)  Draws a vertical line at the x-coordinate indicated by  location .  im.draw_hline(location, stroke={color}, stroke_width=1)   im.draw_hlines(list_of_locations, **kwargs)  Draws a horizontal line at the y- coordinate indicated by  location .  im.draw_rect(bbox_or_obj, fill={color}, stroke={color}, stroke_width=1)   im.draw_rects(list_of_rects, **kwargs)  Draws a rectangle from a  rect ,  char , etc., or 4-tuple bounding box.  im.draw_circle(center_or_obj, radius=5, fill=  im.draw_circles(list_of_circles, Draws a circle at  (x, y)  coordinate or at PDF generated with the free version of http://www.html2pdf.solutions
4/21/2020 PDFPlumberPDFPython - Python | CTOLib {color}, stroke={color})  **kwargs)  the center of a  char ,  rect , etc. Note: The methods above are built on Pillow's  ImageDraw  methods, but the parameters have been tweaked for consistency with SVG's  fill / stroke / stroke_width  nomenclature. Troubleshooting ImageMagick on Debian-based systems If you're using  pdfplumber  on a Debian-based system and encounter a  PolicyError , you may be able to fix it by changing the following line in  /etc/ImageMagick-6/policy.xml  from this: ... to this: (More details about  policy.xml  available here.) Extracting tables  pdfplumber 's approach to table detection borrows heavily from Anssi Nurminen's master's thesis, and is inspired by Tabula. It works like this: 1. For any given PDF page, find the lines that are (a) explicitly defined and/or (b) implied by the alignment of words on the page. 2. Merge overlapping, or nearly-overlapping, lines. 3. Find the intersections of all those lines. 4. Find the most granular set of rectangles (i.e., cells) that use these intersections as their vertices. 5. Group contiguous cells into tables. Table-extraction methods  pdfplumber.Page  objects can call the following table methods: Method  .find_tables(table_settings={})   .extract_tables(table_settings={})   .extract_table(table_settings={})  Description Returns a list of  Table  objects. The  Table  object provides access to the  .cells ,  .rows , and  .bbox  properties, as well as the  .extract(x_tolerance=3, y_tolerance=3)  method. Returns the text extracted from all tables found on the page, represented as a list of lists of lists, with the structure  table -> row -> cell . Returns the text extracted from the largest table on the page, represented as a list of lists, with the structure  row -> cell . (If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.)  .debug_tablefinder(table_settings= {})  Returns an instance of the  TableFinder  class, with access to the  .edges ,  .intersections ,  .cells , and  .tables  properties. For example: pdf = pdfplumber.open("path/to/my.pdf") page = pdf.pages[0] page.extract_table() Click here for a more detailed example. PDF generated with the free version of http://www.html2pdf.solutions
4/21/2020 PDFPlumberPDFPython - Python | CTOLib Table-extraction settings By default,  extract_tables  uses the page's vertical and horizontal lines (or rectangle edges) as cell-separators. But the method is highly customizable via the  table_settings  argument. The possible settings, and their defaults: { "vertical_strategy": "lines", "horizontal_strategy": "lines", "explicit_vertical_lines": [], "explicit_horizontal_lines": [], "snap_tolerance": 3, "join_tolerance": 3, "edge_min_length": 3, "min_words_vertical": 3, "min_words_horizontal": 1, "keep_blank_chars": False, "text_tolerance": 3, "text_x_tolerance": None, "text_y_tolerance": None, "intersection_tolerance": 3, "intersection_x_tolerance": None, "intersection_y_tolerance": None, } Setting  "vertical_strategy"  Description Either  "lines" ,  "lines_strict" ,  "text" , or  "explicit" . See explanation below.  "horizontal_strategy"  Either  "lines" ,  "lines_strict" ,  "text" , or  "explicit" . See explanation below.  "explicit_vertical_lines"   "explicit_horizontal_lines"   "snap_tolerance"   "join_tolerance"   "edge_min_length"   "min_words_vertical"  A list of vertical lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the  x  coordinate of a line the full height of the page — or a dictionary describing the line, with at least the following keys:  x ,  top ,  bottom . A list of vertical lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the  y  coordinate of a line the full height of the page — or a dictionary describing the line, with at least the following keys:  top ,  x0 ,  x1 . Parallel lines within  snap_tolerance  pixels will be "snapped" to the same horizontal or vertical position. Line segments on the same infinite line, and whose ends are within  join_tolerance  of one another, will be "joined" into a single line segment. Edges shorter than  edge_min_length  will be discarded before attempting to reconstruct the table. When using  "vertical_strategy": "text" , at least  min_words_vertical  words must share the same alignment.  "min_words_horizontal"  When using  "horizontal_strategy": "text" , at least  min_words_horizontal  words must share the same alignment.  "keep_blank_chars"   "text_tolerance" ,  "text_x_tolerance" ,  "text_y_tolerance"  When using the  text  strategy, consider  " "  chars to be parts of words and not word- separators. When the  text  strategy searches for words, it will expect the individual letters in each word to be no more than  text_tolerance  pixels apart. PDF generated with the free version of http://www.html2pdf.solutions
分享到:
收藏