langchain.document_loaders.parsers.pdf
.AmazonTextractPDFParser¶
- class langchain.document_loaders.parsers.pdf.AmazonTextractPDFParser(textract_features: Optional[Sequence[int]] = None, client: Optional[Any] = None)[source]¶
Send PDF files to Amazon Textract and parse them.
For parsing multi-page PDFs, they have to reside on S3.
The AmazonTextractPDFLoader calls the [Amazon Textract Service](https://aws.amazon.com/textract/) to convert PDFs into a Document structure. Single and multi-page documents are supported with up to 3000 pages and 512 MB of size.
For the call to be successful an AWS account is required, similar to the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) requirements.
Besides the AWS configuration, it is very similar to the other PDF loaders, while also supporting JPEG, PNG and TIFF and non-native PDF formats.
`python from langchain.document_loaders import AmazonTextractPDFLoader loader=AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg") documents = loader.load() `
One feature is the linearization of the output. When using the features LAYOUT, FORMS or TABLES together with Textract
```python from langchain.document_loaders import AmazonTextractPDFLoader # you can mix and match each of the features loader=AmazonTextractPDFLoader(
“example_data/alejandro_rosalez_sample-small.jpeg”, textract_features=[“TABLES”, “LAYOUT”])
it will generate output that formats the text in reading order and try to output the information in a tabular structure or output the key/value pairs with a colon (key: value). This helps most LLMs to achieve better accuracy when processing these texts.
Initializes the parser.
- Parameters
textract_features – Features to be used for extraction, each feature should be passed as an int that conforms to the enum Textract_Features, see amazon-textract-caller pkg
client – boto3 textract client
Methods
__init__
([textract_features, client])Initializes the parser.
lazy_parse
(blob)Iterates over the Blob pages and returns an Iterator with a Document for each page, like the other parsers If multi-page document, blob.path has to be set to the S3 URI and for single page docs the blob.data is taken
parse
(blob)Eagerly parse the blob into a document or documents.
- __init__(textract_features: Optional[Sequence[int]] = None, client: Optional[Any] = None) None [source]¶
Initializes the parser.
- Parameters
textract_features – Features to be used for extraction, each feature should be passed as an int that conforms to the enum Textract_Features, see amazon-textract-caller pkg
client – boto3 textract client
- lazy_parse(blob: Blob) Iterator[Document] [source]¶
Iterates over the Blob pages and returns an Iterator with a Document for each page, like the other parsers If multi-page document, blob.path has to be set to the S3 URI and for single page docs the blob.data is taken
- parse(blob: Blob) List[Document] ¶
Eagerly parse the blob into a document or documents.
This is a convenience method for interactive development environment.
Production applications should favor the lazy_parse method instead.
Subclasses should generally not over-ride this parse method.
- Parameters
blob – Blob instance
- Returns
List of documents