langchain.document_loaders.pdf.AmazonTextractPDFLoader

class langchain.document_loaders.pdf.AmazonTextractPDFLoader(file_path: str, textract_features: Optional[Sequence[str]] = None, client: Optional[Any] = None, credentials_profile_name: Optional[str] = None, region_name: Optional[str] = None, endpoint_url: Optional[str] = None, headers: Optional[Dict] = None)[source]

Load PDF files from a local file system, HTTP or S3.

To authenticate, the AWS client uses the following methods to automatically load credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html

If a specific credential profile should be used, you must pass the name of the profile from the ~/.aws/credentials file that is to be used.

Make sure the credentials / roles used have the required policies to access the Amazon Textract service.

Example

Initialize the loader.

Parameters
  • file_path – A file, url or s3 path for input file

  • textract_features – Features to be used for extraction, each feature should be passed as a str that conforms to the enum Textract_Features, see amazon-textract-caller pkg

  • client – boto3 textract client (Optional)

  • credentials_profile_name – AWS profile name, if not default (Optional)

  • region_name – AWS region, eg us-east-1 (Optional)

  • endpoint_url – endpoint url for the textract service (Optional)

Attributes

source

Methods

__init__(file_path[, textract_features, ...])

Initialize the loader.

lazy_load()

Lazy load documents

load()

Load given path as pages.

load_and_split([text_splitter])

Load Documents and split into chunks.

__init__(file_path: str, textract_features: Optional[Sequence[str]] = None, client: Optional[Any] = None, credentials_profile_name: Optional[str] = None, region_name: Optional[str] = None, endpoint_url: Optional[str] = None, headers: Optional[Dict] = None) None[source]

Initialize the loader.

Parameters
  • file_path – A file, url or s3 path for input file

  • textract_features – Features to be used for extraction, each feature should be passed as a str that conforms to the enum Textract_Features, see amazon-textract-caller pkg

  • client – boto3 textract client (Optional)

  • credentials_profile_name – AWS profile name, if not default (Optional)

  • region_name – AWS region, eg us-east-1 (Optional)

  • endpoint_url – endpoint url for the textract service (Optional)

lazy_load() Iterator[Document][source]

Lazy load documents

load() List[Document][source]

Load given path as pages.

load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document]

Load Documents and split into chunks. Chunks are returned as Documents.

Parameters

text_splitter – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns

List of Documents.

Examples using AmazonTextractPDFLoader