langchain_community.document_loaders.docugami
.DocugamiLoader¶
- class langchain_community.document_loaders.docugami.DocugamiLoader[source]¶
Bases:
BaseLoader
,BaseModel
Deprecated since version 0.0.24: Use
docugami_langchain.DocugamiLoader
instead.Load from Docugami.
To use, you should have the
dgml-utils
python package installed.Create a new model by parsing and validating input data from keyword arguments.
Raises ValidationError if the input data cannot be parsed to form a valid model.
- param access_token: Optional[str] = None¶
The Docugami API access token to use.
- param api: str = 'https://api.docugami.com/v1preview1'¶
The Docugami API endpoint to use.
- param docset_id: Optional[str] = None¶
The Docugami API docset ID to use.
- param document_ids: Optional[Sequence[str]] = None¶
The Docugami API document IDs to use.
- param file_paths: Optional[Sequence[Union[Path, str]]] = None¶
The local file paths to use.
- param include_project_metadata_in_doc_metadata: bool = True¶
Set to True if you want to include the project metadata in the doc metadata.
- param include_xml_tags: bool = False¶
Set to true for XML tags in chunk output text.
- param max_metadata_length: int = 512¶
Max length of metadata text returned.
- param max_text_length: int = 4096¶
Max length of chunk text returned.
- param min_text_length: int = 32¶
Threshold under which chunks are appended to next to avoid over-chunking.
- param parent_hierarchy_levels: int = 0¶
Set appropriately to get parent chunks using the chunk hierarchy.
- param parent_id_key: str = 'doc_id'¶
Metadata key for parent doc ID.
- param sub_chunk_tables: bool = False¶
Set to True to return sub-chunks within tables.
- param whitespace_normalize_text: bool = True¶
Set to False if you want to full whitespace formatting in the original XML doc, including indentation.
- async alazy_load() AsyncIterator[Document] ¶
A lazy loader for Documents.
- Return type
AsyncIterator[Document]
- load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document] ¶
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters
text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns
List of Documents.
- Return type
List[Document]