langchain_core.documents.transformers.BaseDocumentTransformer

class langchain_core.documents.transformers.BaseDocumentTransformer[source]

Abstract base class for document transformation systems.

A document transformation system takes a sequence of Documents and returns a sequence of transformed Documents.

Example

class EmbeddingsRedundantFilter(BaseDocumentTransformer, BaseModel):
    embeddings: Embeddings
    similarity_fn: Callable = cosine_similarity
    similarity_threshold: float = 0.95

    class Config:
        arbitrary_types_allowed = True

    def transform_documents(
        self, documents: Sequence[Document], **kwargs: Any
    ) -> Sequence[Document]:
        stateful_documents = get_stateful_documents(documents)
        embedded_documents = _get_embeddings_from_stateful_docs(
            self.embeddings, stateful_documents
        )
        included_idxs = _filter_similar_embeddings(
            embedded_documents, self.similarity_fn, self.similarity_threshold
        )
        return [stateful_documents[i] for i in sorted(included_idxs)]

    async def atransform_documents(
        self, documents: Sequence[Document], **kwargs: Any
    ) -> Sequence[Document]:
        raise NotImplementedError

Methods

__init__()

atransform_documents(documents, **kwargs)

Asynchronously transform a list of documents.

transform_documents(documents, **kwargs)

Transform a list of documents.

__init__()
async atransform_documents(documents: Sequence[Document], **kwargs: Any) Sequence[Document][source]

Asynchronously transform a list of documents.

Parameters
  • documents (Sequence[Document]) – A sequence of Documents to be transformed.

  • kwargs (Any) –

Returns

A list of transformed Documents.

Return type

Sequence[Document]

abstract transform_documents(documents: Sequence[Document], **kwargs: Any) Sequence[Document][source]

Transform a list of documents.

Parameters
  • documents (Sequence[Document]) – A sequence of Documents to be transformed.

  • kwargs (Any) –

Returns

A list of transformed Documents.

Return type

Sequence[Document]