langchain.document_transformers.beautiful_soup_transformer
.BeautifulSoupTransformer¶
- class langchain.document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer[source]¶
Transform HTML content by extracting specific tags and removing unwanted ones.
Example
Initialize the transformer.
This checks if the BeautifulSoup4 package is installed. If not, it raises an ImportError.
Methods
__init__
()Initialize the transformer.
atransform_documents
(documents, **kwargs)Asynchronously transform a list of documents.
extract_tags
(html_content, tags)Extract specific tags from a given HTML content.
remove_unnecessary_lines
(content)Clean up the content by removing unnecessary lines.
remove_unwanted_tags
(html_content, unwanted_tags)Remove unwanted tags from a given HTML content.
transform_documents
(documents[, ...])Transform a list of Document objects by cleaning their HTML content.
- __init__() None [source]¶
Initialize the transformer.
This checks if the BeautifulSoup4 package is installed. If not, it raises an ImportError.
- async atransform_documents(documents: Sequence[Document], **kwargs: Any) Sequence[Document] [source]¶
Asynchronously transform a list of documents.
- Parameters
documents – A sequence of Documents to be transformed.
- Returns
A list of transformed Documents.
- static extract_tags(html_content: str, tags: List[str]) str [source]¶
Extract specific tags from a given HTML content.
- Parameters
html_content – The original HTML content string.
tags – A list of tags to be extracted from the HTML.
- Returns
A string combining the content of the extracted tags.
- static remove_unnecessary_lines(content: str) str [source]¶
Clean up the content by removing unnecessary lines.
- Parameters
content – A string, which may contain unnecessary lines or spaces.
- Returns
A cleaned string with unnecessary lines removed.
- static remove_unwanted_tags(html_content: str, unwanted_tags: List[str]) str [source]¶
Remove unwanted tags from a given HTML content.
- Parameters
html_content – The original HTML content string.
unwanted_tags – A list of tags to be removed from the HTML.
- Returns
A cleaned HTML string with unwanted tags removed.
- transform_documents(documents: Sequence[Document], unwanted_tags: List[str] = ['script', 'style'], tags_to_extract: List[str] = ['p', 'li', 'div', 'a'], remove_lines: bool = True, **kwargs: Any) Sequence[Document] [source]¶
Transform a list of Document objects by cleaning their HTML content.
- Parameters
documents – A sequence of Document objects containing HTML content.
unwanted_tags – A list of tags to be removed from the HTML.
tags_to_extract – A list of tags whose content will be extracted.
remove_lines – If set to True, unnecessary lines will be
content. (removed from the HTML) –
- Returns
A sequence of Document objects with transformed content.