langchain_community.document_loaders.chromium.AsyncChromiumLoader¶

class langchain_community.document_loaders.chromium.AsyncChromiumLoader(urls: List[str], *, headless: bool = True)[source]¶

Scrape HTML pages from URLs using a headless instance of the Chromium.

Initialize the loader with a list of URL paths.

Parameters
  • urls (List[str]) – A list of URLs to scrape content from.

  • headless (bool) – Whether to run browser in headless mode.

Raises

ImportError – If the required ‘playwright’ package is not installed.

Methods

__init__(urls, *[, headless])

Initialize the loader with a list of URL paths.

alazy_load()

Asynchronously load text content from the provided URLs.

aload()

Load data into Document objects.

ascrape_playwright(url)

Asynchronously scrape the content of a given URL using Playwright's async API.

lazy_load()

Lazily load text content from the provided URLs.

load()

Load data into Document objects.

load_and_split([text_splitter])

Load Documents and split into chunks.

__init__(urls: List[str], *, headless: bool = True)[source]¶

Initialize the loader with a list of URL paths.

Parameters
  • urls (List[str]) – A list of URLs to scrape content from.

  • headless (bool) – Whether to run browser in headless mode.

Raises

ImportError – If the required ‘playwright’ package is not installed.

async alazy_load() AsyncIterator[Document][source]¶

Asynchronously load text content from the provided URLs.

This method leverages asyncio to initiate the scraping of all provided URLs simultaneously. It improves performance by utilizing concurrent asynchronous requests. Each Document is yielded as soon as its content is available, encapsulating the scraped content.

Yields

Document – A Document object containing the scraped content, along with its source URL as metadata.

Return type

AsyncIterator[Document]

async aload() List[Document]¶

Load data into Document objects.

Return type

List[Document]

async ascrape_playwright(url: str) str[source]¶

Asynchronously scrape the content of a given URL using Playwright’s async API.

Parameters

url (str) – The URL to scrape.

Returns

The scraped HTML content or an error message if an exception occurs.

Return type

str

lazy_load() Iterator[Document][source]¶

Lazily load text content from the provided URLs.

This method yields Documents one at a time as they’re scraped, instead of waiting to scrape all URLs before returning.

Yields

Document – The scraped content encapsulated within a Document object.

Return type

Iterator[Document]

load() List[Document]¶

Load data into Document objects.

Return type

List[Document]

load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document]¶

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns

List of Documents.

Return type

List[Document]

Examples using AsyncChromiumLoader¶