HtmlLinkExtractor#

Beta

This feature is in beta. It is actively being worked on, so the API may change.

Extract hyperlinks from HTML content.

Expects the input to be an HTML string or a BeautifulSoup object.

Parameters:
  • kind (str) – The kind of edge to extract. Defaults to “hyperlink”.

  • drop_fragments (bool) – Whether fragments in URLs and links should be dropped. Defaults to True.

Methods

__init__(*[, kind, drop_fragments])

Extract hyperlinks from HTML content.

as_document_extractor([url_metadata_key])

Return a LinkExtractor that applies to documents.

extract_many(inputs)

Add edges from each input to the corresponding documents.

extract_one(input)

Add edges from each input to the corresponding documents.

Extract hyperlinks from HTML content.

Expects the input to be an HTML string or a BeautifulSoup object.

Parameters:
  • kind (str) – The kind of edge to extract. Defaults to “hyperlink”.

  • drop_fragments (bool) – Whether fragments in URLs and links should be dropped. Defaults to True.

Return a LinkExtractor that applies to documents.

NOTE: Since the HtmlLinkExtractor parses HTML, if you use with other similar link extractors it may be more efficient to call the link extractors directly on the parsed BeautifulSoup object.

Parameters:

url_metadata_key (str) – The name of the filed in document metadata with the URL of the document.

Return type:

LinkExtractor[Document]

Add edges from each input to the corresponding documents.

Parameters:

inputs (Iterable[InputT]) – The input content to extract edges from.

Returns:

Iterable over the set of links extracted from the input.

Return type:

Iterable[Set[Link]]

Add edges from each input to the corresponding documents.

Parameters:

input (HtmlInput) – The input content to extract edges from.

Returns:

Set of links extracted from the input.

Return type:

Set[Link]