langchain_upstage.layout_analysis
.UpstageLayoutAnalysisLoader¶
- class langchain_upstage.layout_analysis.UpstageLayoutAnalysisLoader(file_path: Union[str, Path, List[str], List[Path]], output_type: Union[Literal['text', 'html'], dict] = 'html', split: Literal['none', 'element', 'page'] = 'none', api_key: Optional[str] = None, use_ocr: bool = False, exclude: list = ['header', 'footer'])[source]¶
Upstage Layout Analysis.
To use, you should have the environment variable UPSTAGE_API_KEY set with your API key or pass it as a named parameter to the constructor.
Example
from langchain_upstage import UpstageLayoutAnalysis file_path = "/PATH/TO/YOUR/FILE.pdf" loader = UpstageLayoutAnalysis( file_path, split="page", output_type="text" )
Initializes an instance of the Upstage document loader.
- Parameters
file_path (Union[str, Path, List[str], List[Path]) – The path to the document to be loaded.
output_type (Union[OutputType, dict], optional) – The type of output to be generated by the parser. Defaults to “html”.
split (SplitType, optional) – The type of splitting to be applied. Defaults to “none” (no splitting).
api_key (str, optional) – The API key for accessing the Upstage API. Defaults to None, in which case it will be fetched from the environment variable UPSTAGE_API_KEY.
use_ocr (bool, optional) – Extract text from images in the document. Defaults to False. (Use text info in PDF file)
exclude (list, optional) – Exclude specific elements from the output. Defaults to [“header”, “footer”].
Methods
__init__
(file_path[, output_type, split, ...])Initializes an instance of the Upstage document loader.
A lazy loader for Documents.
aload
()Load data into Document objects.
Lazily loads and parses the document using the UpstageLayoutAnalysisParser.
load
()Loads and parses the document using the UpstageLayoutAnalysisParser.
load_and_split
([text_splitter])Load Documents and split into chunks.
- __init__(file_path: Union[str, Path, List[str], List[Path]], output_type: Union[Literal['text', 'html'], dict] = 'html', split: Literal['none', 'element', 'page'] = 'none', api_key: Optional[str] = None, use_ocr: bool = False, exclude: list = ['header', 'footer'])[source]¶
Initializes an instance of the Upstage document loader.
- Parameters
file_path (Union[str, Path, List[str], List[Path]) – The path to the document to be loaded.
output_type (Union[OutputType, dict], optional) – The type of output to be generated by the parser. Defaults to “html”.
split (SplitType, optional) – The type of splitting to be applied. Defaults to “none” (no splitting).
api_key (str, optional) – The API key for accessing the Upstage API. Defaults to None, in which case it will be fetched from the environment variable UPSTAGE_API_KEY.
use_ocr (bool, optional) – Extract text from images in the document. Defaults to False. (Use text info in PDF file)
exclude (list, optional) – Exclude specific elements from the output. Defaults to [“header”, “footer”].
- async alazy_load() AsyncIterator[Document] ¶
A lazy loader for Documents.
- Return type
AsyncIterator[Document]
- lazy_load() Iterator[Document] [source]¶
Lazily loads and parses the document using the UpstageLayoutAnalysisParser.
- Returns
An iterator of Document objects representing the parsed layout analysis.
- Return type
Iterator[Document]
- load() List[Document] [source]¶
Loads and parses the document using the UpstageLayoutAnalysisParser.
- Returns
A list of Document objects representing the parsed layout analysis.
- Return type
List[Document]
- load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document] ¶
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters
text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns
List of Documents.
- Return type
List[Document]