`langchain_upstage.layout_analysis_parsers`.UpstageLayoutAnalysisParser¶

class langchain_upstage.layout_analysis_parsers.UpstageLayoutAnalysisParser(api_key: Optional[str] = None, output_type: Union[Literal['text', 'html'], dict] = 'html', split: Literal['none', 'element', 'page'] = 'none', use_ocr: bool = False, exclude: list = [])[source]¶

Upstage Layout Analysis Parser.

To use, you should have the environment variable UPSTAGE_API_KEY set with your API key or pass it as a named parameter to the constructor.

Example

from langchain_upstage import UpstageLayoutAnalysisParser

loader = UpstageLayoutAnalysisParser(split="page", output_type="text")

Initializes an instance of the Upstage class.

Parameters

api_key (str, optional) – The API key for accessing the Upstage API. Defaults to None, in which case it will be fetched from the environment variable UPSTAGE_API_KEY.
output_type (Union[OutputType, dict], optional) – The type of output to be generated by the parser. Defaults to “html”.
split (SplitType, optional) – The type of splitting to be applied. Defaults to “none” (no splitting).
use_ocr (bool, optional) – Extract text from images in the document. Defaults to False. (Use text info in PDF file)
exclude (list, optional) – Exclude specific elements from the output. Defaults to [] (all included).

Methods

`__init__`([api_key, output_type, split, ...])	Initializes an instance of the Upstage class.
`lazy_parse`(blob[, is_batch])	Lazily parses a document and yields Document objects based on the specified split type.
`parse`(blob)	Eagerly parse the blob into a document or documents.

__init__(api_key: Optional[str] = None, output_type: Union[Literal['text', 'html'], dict] = 'html', split: Literal['none', 'element', 'page'] = 'none', use_ocr: bool = False, exclude: list = [])[source]¶

Initializes an instance of the Upstage class.

Parameters

api_key (str, optional) – The API key for accessing the Upstage API. Defaults to None, in which case it will be fetched from the environment variable UPSTAGE_API_KEY.
output_type (Union[OutputType, dict], optional) – The type of output to be generated by the parser. Defaults to “html”.
split (SplitType, optional) – The type of splitting to be applied. Defaults to “none” (no splitting).
use_ocr (bool, optional) – Extract text from images in the document. Defaults to False. (Use text info in PDF file)
exclude (list, optional) – Exclude specific elements from the output. Defaults to [] (all included).

lazy_parse(blob: Blob, is_batch: bool = False) → Iterator[Document][source]¶

Lazily parses a document and yields Document objects based on the specified split type.

Parameters

blob (Blob) – The input document blob to parse.
is_batch (bool, optional) – Whether to parse the document in batches. Defaults to False (single page parsing)

Yields

Document – The parsed document object.

Raises

ValueError – If an invalid split type is provided.

Return type

Iterator[Document]

parse(blob: Blob) → List[Document]¶

Eagerly parse the blob into a document or documents.

This is a convenience method for interactive development environment.

Production applications should favor the lazy_parse method instead.

Subclasses should generally not over-ride this parse method.

Parameters: blob (Blob) – Blob instance
Returns: List of documents
Return type: List[Document]

langchain_upstage.layout_analysis_parsers.UpstageLayoutAnalysisParser¶

`langchain_upstage.layout_analysis_parsers`.UpstageLayoutAnalysisParser¶