langchain_community.document_loaders.pdf.MathpixPDFLoader

class langchain_community.document_loaders.pdf.MathpixPDFLoader(file_path: str, processed_file_format: str = 'md', max_wait_time_seconds: int = 500, should_clean_pdf: bool = False, extra_request_data: Optional[Dict[str, Any]] = None, **kwargs: Any)[source]

Load PDF files using Mathpix service.

Initialize with a file path.

Parameters
  • file_path (str) – a file for loading.

  • processed_file_format (str) – a format of the processed file. Default is “md”.

  • max_wait_time_seconds (int) – a maximum time to wait for the response from the server. Default is 500.

  • should_clean_pdf (bool) – a flag to clean the PDF file. Default is False.

  • extra_request_data (Optional[Dict[str, Any]]) – Additional request data.

  • **kwargs (Any) – additional keyword arguments.

Attributes

data

source

url

Methods

__init__(file_path[, processed_file_format, ...])

Initialize with a file path.

alazy_load()

A lazy loader for Documents.

aload()

Load data into Document objects.

clean_pdf(contents)

Clean the PDF file.

get_processed_pdf(pdf_id)

lazy_load()

A lazy loader for Documents.

load()

Load data into Document objects.

load_and_split([text_splitter])

Load Documents and split into chunks.

send_pdf()

wait_for_processing(pdf_id)

Wait for processing to complete.

__init__(file_path: str, processed_file_format: str = 'md', max_wait_time_seconds: int = 500, should_clean_pdf: bool = False, extra_request_data: Optional[Dict[str, Any]] = None, **kwargs: Any) None[source]

Initialize with a file path.

Parameters
  • file_path (str) – a file for loading.

  • processed_file_format (str) – a format of the processed file. Default is “md”.

  • max_wait_time_seconds (int) – a maximum time to wait for the response from the server. Default is 500.

  • should_clean_pdf (bool) – a flag to clean the PDF file. Default is False.

  • extra_request_data (Optional[Dict[str, Any]]) – Additional request data.

  • **kwargs (Any) – additional keyword arguments.

Return type

None

async alazy_load() AsyncIterator[Document]

A lazy loader for Documents.

Return type

AsyncIterator[Document]

async aload() List[Document]

Load data into Document objects.

Return type

List[Document]

clean_pdf(contents: str) str[source]

Clean the PDF file.

Parameters

contents (str) – a PDF file contents.

Return type

str

Returns:

get_processed_pdf(pdf_id: str) str[source]
Parameters

pdf_id (str) –

Return type

str

lazy_load() Iterator[Document]

A lazy loader for Documents.

Return type

Iterator[Document]

load() List[Document][source]

Load data into Document objects.

Return type

List[Document]

load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document]

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns

List of Documents.

Return type

List[Document]

send_pdf() str[source]
Return type

str

wait_for_processing(pdf_id: str) None[source]

Wait for processing to complete.

Parameters

pdf_id (str) – a PDF id.

Return type

None

Returns: None

Examples using MathpixPDFLoader