langchain.document_loaders.parsers.pdf
.PDFMinerParser¶
- class langchain.document_loaders.parsers.pdf.PDFMinerParser(extract_images: bool = False, *, concatenate_pages: bool = True)[source]¶
Parse PDF using PDFMiner.
Initialize a parser based on PDFMiner.
- Parameters
extract_images – Whether to extract images from PDF.
concatenate_pages – If True, concatenate all PDF pages into one a single document. Otherwise, return one document per page.
Methods
__init__
([extract_images, concatenate_pages])Initialize a parser based on PDFMiner.
lazy_parse
(blob)Lazily parse the blob.
parse
(blob)Eagerly parse the blob into a document or documents.
- __init__(extract_images: bool = False, *, concatenate_pages: bool = True)[source]¶
Initialize a parser based on PDFMiner.
- Parameters
extract_images – Whether to extract images from PDF.
concatenate_pages – If True, concatenate all PDF pages into one a single document. Otherwise, return one document per page.
- parse(blob: Blob) List[Document] ¶
Eagerly parse the blob into a document or documents.
This is a convenience method for interactive development environment.
Production applications should favor the lazy_parse method instead.
Subclasses should generally not over-ride this parse method.
- Parameters
blob – Blob instance
- Returns
List of documents