langchain_upstage.layout_analysis_parsers
.UpstageLayoutAnalysisParser¶
- class langchain_upstage.layout_analysis_parsers.UpstageLayoutAnalysisParser(api_key: Optional[str] = None, output_type: Union[Literal['text', 'html'], dict] = 'html', split: Literal['none', 'element', 'page'] = 'none', use_ocr: bool = False, exclude: list = [])[source]¶
Upstage Layout Analysis Parser.
To use, you should have the environment variable UPSTAGE_API_KEY set with your API key or pass it as a named parameter to the constructor.
Example
from langchain_upstage import UpstageLayoutAnalysisParser loader = UpstageLayoutAnalysisParser(split="page", output_type="text")
Initializes an instance of the Upstage class.
- Parameters
api_key (str, optional) – The API key for accessing the Upstage API. Defaults to None, in which case it will be fetched from the environment variable UPSTAGE_API_KEY.
output_type (Union[OutputType, dict], optional) – The type of output to be generated by the parser. Defaults to “html”.
split (SplitType, optional) – The type of splitting to be applied. Defaults to “none” (no splitting).
use_ocr (bool, optional) – Extract text from images in the document. Defaults to False. (Use text info in PDF file)
exclude (list, optional) – Exclude specific elements from the output. Defaults to [] (all included).
Methods
__init__
([api_key, output_type, split, ...])Initializes an instance of the Upstage class.
lazy_parse
(blob[, is_batch])Lazily parses a document and yields Document objects based on the specified split type.
parse
(blob)Eagerly parse the blob into a document or documents.
- __init__(api_key: Optional[str] = None, output_type: Union[Literal['text', 'html'], dict] = 'html', split: Literal['none', 'element', 'page'] = 'none', use_ocr: bool = False, exclude: list = [])[source]¶
Initializes an instance of the Upstage class.
- Parameters
api_key (str, optional) – The API key for accessing the Upstage API. Defaults to None, in which case it will be fetched from the environment variable UPSTAGE_API_KEY.
output_type (Union[OutputType, dict], optional) – The type of output to be generated by the parser. Defaults to “html”.
split (SplitType, optional) – The type of splitting to be applied. Defaults to “none” (no splitting).
use_ocr (bool, optional) – Extract text from images in the document. Defaults to False. (Use text info in PDF file)
exclude (list, optional) – Exclude specific elements from the output. Defaults to [] (all included).
- lazy_parse(blob: Blob, is_batch: bool = False) Iterator[Document] [source]¶
Lazily parses a document and yields Document objects based on the specified split type.
- Parameters
blob (Blob) – The input document blob to parse.
is_batch (bool, optional) – Whether to parse the document in batches. Defaults to False (single page parsing)
- Yields
Document – The parsed document object.
- Raises
ValueError – If an invalid split type is provided.
- Return type
Iterator[Document]