langchain_community.document_loaders.url_playwright.PlaywrightURLLoader¶

class langchain_community.document_loaders.url_playwright.PlaywrightURLLoader(urls: List[str], continue_on_failure: bool = True, headless: bool = True, remove_selectors: Optional[List[str]] = None, evaluator: Optional[PlaywrightEvaluator] = None, proxy: Optional[Dict[str, str]] = None)[source]¶

Load HTML pages with Playwright and parse with Unstructured.

This is useful for loading pages that require javascript to render.

Parameters
  • urls (List[str]) –

  • continue_on_failure (bool) –

  • headless (bool) –

  • remove_selectors (Optional[List[str]]) –

  • evaluator (Optional[PlaywrightEvaluator]) –

  • proxy (Optional[Dict[str, str]]) –

urls¶

List of URLs to load.

Type

List[str]

continue_on_failure¶

If True, continue loading other URLs on failure.

Type

bool

headless¶

If True, the browser will run in headless mode.

Type

bool

proxy¶

If set, the browser will access URLs through the specified proxy.

Type

Optional[Dict[str, str]]

Example

from langchain_community.document_loaders import PlaywrightURLLoader

urls = ["https://api.ipify.org/?format=json",]
proxy={
    "server": "https://xx.xx.xx:15818", # https://<host>:<port>
    "username": "username",
    "password": "password"
}
loader = PlaywrightURLLoader(urls, proxy=proxy)
data = loader.load()

Load a list of URLs using Playwright.

Methods

__init__(urls[, continue_on_failure, ...])

Load a list of URLs using Playwright.

alazy_load()

Load the specified URLs with Playwright and create Documents asynchronously.

aload()

Load the specified URLs with Playwright and create Documents asynchronously.

lazy_load()

Load the specified URLs using Playwright and create Document instances.

load()

Load data into Document objects.

load_and_split([text_splitter])

Load Documents and split into chunks.

__init__(urls: List[str], continue_on_failure: bool = True, headless: bool = True, remove_selectors: Optional[List[str]] = None, evaluator: Optional[PlaywrightEvaluator] = None, proxy: Optional[Dict[str, str]] = None)[source]¶

Load a list of URLs using Playwright.

Parameters
  • urls (List[str]) –

  • continue_on_failure (bool) –

  • headless (bool) –

  • remove_selectors (Optional[List[str]]) –

  • evaluator (Optional[PlaywrightEvaluator]) –

  • proxy (Optional[Dict[str, str]]) –

async alazy_load() AsyncIterator[Document][source]¶

Load the specified URLs with Playwright and create Documents asynchronously. Use this function when in a jupyter notebook environment.

Returns

A list of Document instances with loaded content.

Return type

AsyncIterator[Document]

async aload() List[Document][source]¶

Load the specified URLs with Playwright and create Documents asynchronously. Use this function when in a jupyter notebook environment.

Returns

A list of Document instances with loaded content.

Return type

List[Document]

lazy_load() Iterator[Document][source]¶

Load the specified URLs using Playwright and create Document instances.

Returns

A list of Document instances with loaded content.

Return type

Iterator[Document]

load() List[Document]¶

Load data into Document objects.

Return type

List[Document]

load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document]¶

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns

List of Documents.

Return type

List[Document]

Examples using PlaywrightURLLoader¶