langchain.document_loaders.parsers.language.language_parser.LanguageParser

class langchain.document_loaders.parsers.language.language_parser.LanguageParser(language: Optional[Language] = None, parser_threshold: int = 0)[source]

Parse using the respective programming language syntax.

Each top-level function and class in the code is loaded into separate documents. Furthermore, an extra document is generated, containing the remaining top-level code that excludes the already segmented functions and classes.

This approach can potentially improve the accuracy of QA models over source code.

Currently, the supported languages for code parsing are Python and JavaScript.

The language used for parsing can be configured, along with the minimum number of lines required to activate the splitting based on syntax.

Examples

    from langchain.text_splitter.Language
    from langchain.document_loaders.generic import GenericLoader
    from langchain.document_loaders.parsers import LanguageParser

    loader = GenericLoader.from_filesystem(
        "./code",
        glob="**/*",
        suffixes=[".py", ".js"],
        parser=LanguageParser()
    )
    docs = loader.load()

Example instantiations to manually select the language:

.. code-block:: python

    from langchain.text_splitter import Language

    loader = GenericLoader.from_filesystem(
        "./code",
        glob="**/*",
        suffixes=[".py"],
        parser=LanguageParser(language=Language.PYTHON)
    )

Example instantiations to set number of lines threshold:

.. code-block:: python

    loader = GenericLoader.from_filesystem(
        "./code",
        glob="**/*",
        suffixes=[".py"],
        parser=LanguageParser(parser_threshold=200)
    )

Language parser that split code using the respective language syntax.

Parameters
  • language – If None (default), it will try to infer language from source.

  • parser_threshold – Minimum lines needed to activate parsing (0 by default).

Methods

__init__([language, parser_threshold])

Language parser that split code using the respective language syntax.

lazy_parse(blob)

Lazy parsing interface.

parse(blob)

Eagerly parse the blob into a document or documents.

__init__(language: Optional[Language] = None, parser_threshold: int = 0)[source]

Language parser that split code using the respective language syntax.

Parameters
  • language – If None (default), it will try to infer language from source.

  • parser_threshold – Minimum lines needed to activate parsing (0 by default).

lazy_parse(blob: Blob) Iterator[Document][source]

Lazy parsing interface.

Subclasses are required to implement this method.

Parameters

blob – Blob instance

Returns

Generator of documents

parse(blob: Blob) List[Document]

Eagerly parse the blob into a document or documents.

This is a convenience method for interactive development environment.

Production applications should favor the lazy_parse method instead.

Subclasses should generally not over-ride this parse method.

Parameters

blob – Blob instance

Returns

List of documents

Examples using LanguageParser