`langchain_community.document_transformers.beautiful_soup_transformer`.BeautifulSoupTransformer¶

class langchain_community.document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer[source]¶

Transform HTML content by extracting specific tags and removing unwanted ones.

Example

from langchain_community.document_transformers import BeautifulSoupTransformer

bs4_transformer = BeautifulSoupTransformer()
docs_transformed = bs4_transformer.transform_documents(docs)

Initialize the transformer.

This checks if the BeautifulSoup4 package is installed. If not, it raises an ImportError.

Methods

`__init__`()	Initialize the transformer.
`atransform_documents`(documents, **kwargs)	Asynchronously transform a list of documents.
`extract_tags`(html_content, tags, *[, ...])	Extract specific tags from a given HTML content.
`remove_unnecessary_lines`(content)	Clean up the content by removing unnecessary lines.
`remove_unwanted_classnames`(html_content, ...)	Remove unwanted classname from a given HTML content.
`remove_unwanted_tags`(html_content, unwanted_tags)	Remove unwanted tags from a given HTML content.
`transform_documents`(documents[, ...])	Transform a list of Document objects by cleaning their HTML content.

__init__() → None[source]¶

Initialize the transformer.

This checks if the BeautifulSoup4 package is installed. If not, it raises an ImportError.

Return type: None

async atransform_documents(documents: Sequence[Document], **kwargs: Any) → Sequence[Document][source]¶

Asynchronously transform a list of documents.

Parameters

documents (Sequence[Document]) – A sequence of Documents to be transformed.
kwargs (Any) –

Returns

A list of transformed Documents.

Return type

Sequence[Document]

static extract_tags(html_content: str, tags: Union[List[str], Tuple[str, ...]], *, remove_comments: bool = False) → str[source]¶

Extract specific tags from a given HTML content.

Parameters

html_content (str) – The original HTML content string.
tags (Union[List[str], Tuple[str, ...]]) – A list of tags to be extracted from the HTML.
remove_comments (bool) –

Returns

A string combining the content of the extracted tags.

Return type

str

static remove_unnecessary_lines(content: str) → str[source]¶

Clean up the content by removing unnecessary lines.

Parameters: content (str) – A string, which may contain unnecessary lines or spaces.
Returns: A cleaned string with unnecessary lines removed.
Return type: str

static remove_unwanted_classnames(html_content: str, unwanted_classnames: Union[List[str], Tuple[str, ...]]) → str[source]¶

Remove unwanted classname from a given HTML content.

Parameters

html_content (str) – The original HTML content string.
unwanted_classnames (Union[List[str], Tuple[str, ...]]) – A list of classnames to be removed from the HTML.

Returns

A cleaned HTML string with unwanted classnames removed.

Return type

str

static remove_unwanted_tags(html_content: str, unwanted_tags: Union[List[str], Tuple[str, ...]]) → str[source]¶

Remove unwanted tags from a given HTML content.

Parameters

html_content (str) – The original HTML content string.
unwanted_tags (Union[List[str], Tuple[str, ...]]) – A list of tags to be removed from the HTML.

Returns

A cleaned HTML string with unwanted tags removed.

Return type

str

transform_documents(documents: Sequence[Document], unwanted_tags: Union[List[str], Tuple[str, ...]] = ('script', 'style'), tags_to_extract: Union[List[str], Tuple[str, ...]] = ('p', 'li', 'div', 'a'), remove_lines: bool = True, *, unwanted_classnames: Union[Tuple[str, ...], List[str]] = (), remove_comments: bool = False, **kwargs: Any) → Sequence[Document][source]¶

Transform a list of Document objects by cleaning their HTML content.

Parameters

documents (Sequence[Document]) – A sequence of Document objects containing HTML content.
unwanted_tags (Union[List[str], Tuple[str, ...]]) – A list of tags to be removed from the HTML.
tags_to_extract (Union[List[str], Tuple[str, ...]]) – A list of tags whose content will be extracted.
remove_lines (bool) – If set to True, unnecessary lines will be removed.
unwanted_classnames (Union[Tuple[str, ...], List[str]]) – A list of class names to be removed from the HTML
remove_comments (bool) – If set to True, comments will be removed.
kwargs (Any) –

Returns

A sequence of Document objects with transformed content.

Return type

Sequence[Document]

Examples using BeautifulSoupTransformer¶

langchain_community.document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer¶

Examples using BeautifulSoupTransformer¶

`langchain_community.document_transformers.beautiful_soup_transformer`.BeautifulSoupTransformer¶