langchain_upstage.layout_analysis.UpstageLayoutAnalysisLoader

class langchain_upstage.layout_analysis.UpstageLayoutAnalysisLoader(file_path: Union[str, Path, List[str], List[Path]], output_type: Union[Literal['text', 'html'], dict] = 'html', split: Literal['none', 'element', 'page'] = 'none', api_key: Optional[str] = None, use_ocr: bool = False, exclude: list = ['header', 'footer'])[source]

Upstage Layout Analysis.

To use, you should have the environment variable UPSTAGE_API_KEY set with your API key or pass it as a named parameter to the constructor.

Example

from langchain_upstage import UpstageLayoutAnalysis

file_path = "/PATH/TO/YOUR/FILE.pdf"
loader = UpstageLayoutAnalysis(
            file_path, split="page", output_type="text"
         )

Initializes an instance of the Upstage document loader.

Parameters
  • file_path (Union[str, Path, List[str], List[Path]) – The path to the document to be loaded.

  • output_type (Union[OutputType, dict], optional) – The type of output to be generated by the parser. Defaults to “html”.

  • split (SplitType, optional) – The type of splitting to be applied. Defaults to “none” (no splitting).

  • api_key (str, optional) – The API key for accessing the Upstage API. Defaults to None, in which case it will be fetched from the environment variable UPSTAGE_API_KEY.

  • use_ocr (bool, optional) – Extract text from images in the document. Defaults to False. (Use text info in PDF file)

  • exclude (list, optional) – Exclude specific elements from the output. Defaults to [“header”, “footer”].

Methods

__init__(file_path[, output_type, split, ...])

Initializes an instance of the Upstage document loader.

alazy_load()

A lazy loader for Documents.

aload()

Load data into Document objects.

lazy_load()

Lazily loads and parses the document using the UpstageLayoutAnalysisParser.

load()

Loads and parses the document using the UpstageLayoutAnalysisParser.

load_and_split([text_splitter])

Load Documents and split into chunks.

merge_and_split(documents[, splitter])

Merges the page content and metadata of multiple documents into a single document, or splits the documents using a custom splitter.

__init__(file_path: Union[str, Path, List[str], List[Path]], output_type: Union[Literal['text', 'html'], dict] = 'html', split: Literal['none', 'element', 'page'] = 'none', api_key: Optional[str] = None, use_ocr: bool = False, exclude: list = ['header', 'footer'])[source]

Initializes an instance of the Upstage document loader.

Parameters
  • file_path (Union[str, Path, List[str], List[Path]) – The path to the document to be loaded.

  • output_type (Union[OutputType, dict], optional) – The type of output to be generated by the parser. Defaults to “html”.

  • split (SplitType, optional) – The type of splitting to be applied. Defaults to “none” (no splitting).

  • api_key (str, optional) – The API key for accessing the Upstage API. Defaults to None, in which case it will be fetched from the environment variable UPSTAGE_API_KEY.

  • use_ocr (bool, optional) – Extract text from images in the document. Defaults to False. (Use text info in PDF file)

  • exclude (list, optional) – Exclude specific elements from the output. Defaults to [“header”, “footer”].

async alazy_load() AsyncIterator[Document]

A lazy loader for Documents.

Return type

AsyncIterator[Document]

async aload() List[Document]

Load data into Document objects.

Return type

List[Document]

lazy_load() Iterator[Document][source]

Lazily loads and parses the document using the UpstageLayoutAnalysisParser.

Returns

An iterator of Document objects representing the parsed layout analysis.

Return type

Iterator[Document]

load() List[Document][source]

Loads and parses the document using the UpstageLayoutAnalysisParser.

Returns

A list of Document objects representing the parsed layout analysis.

Return type

List[Document]

load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document]

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns

List of Documents.

Return type

List[Document]

merge_and_split(documents: List[Document], splitter: Optional[object] = None) List[Document][source]

Merges the page content and metadata of multiple documents into a single document, or splits the documents using a custom splitter.

Parameters
  • documents (list) – A list of Document objects to be merged and split.

  • splitter (object, optional) – An optional splitter object that implements the split_documents method. If provided, the documents will be split using this splitter. Defaults to None, in which case the documents are merged.

Returns

A list of Document objects. If no splitter is provided, a single Document object is returned with the merged content and combined metadata. If a splitter is provided, the documents are split and a list of Document objects is returned.

Return type

list

Raises
  • AssertionError – If a splitter is provided but it does not implement the

  • split_documents