langchain_community.utilities.spark_sql.SparkSQL¶

class langchain_community.utilities.spark_sql.SparkSQL(spark_session: Optional[SparkSession] = None, catalog: Optional[str] = None, schema: Optional[str] = None, ignore_tables: Optional[List[str]] = None, include_tables: Optional[List[str]] = None, sample_rows_in_table_info: int = 3)[source]¶

SparkSQL is a utility class for interacting with Spark SQL.

Initialize a SparkSQL object.

Parameters
  • spark_session (Optional[SparkSession]) – A SparkSession object. If not provided, one will be created.

  • catalog (Optional[str]) – The catalog to use. If not provided, the default catalog will be used.

  • schema (Optional[str]) – The schema to use. If not provided, the default schema will be used.

  • ignore_tables (Optional[List[str]]) – A list of tables to ignore. If not provided, all tables will be used.

  • include_tables (Optional[List[str]]) – A list of tables to include. If not provided, all tables will be used.

  • sample_rows_in_table_info (int) – The number of rows to include in the table info. Defaults to 3.

Methods

__init__([spark_session, catalog, schema, ...])

Initialize a SparkSQL object.

from_uri(database_uri[, engine_args])

Creating a remote Spark Session via Spark connect.

get_table_info([table_names])

get_table_info_no_throw([table_names])

Get information about specified tables.

get_usable_table_names()

Get names of tables available.

run(command[, fetch])

run_no_throw(command[, fetch])

Execute a SQL command and return a string representing the results.

__init__(spark_session: Optional[SparkSession] = None, catalog: Optional[str] = None, schema: Optional[str] = None, ignore_tables: Optional[List[str]] = None, include_tables: Optional[List[str]] = None, sample_rows_in_table_info: int = 3)[source]¶

Initialize a SparkSQL object.

Parameters
  • spark_session (Optional[SparkSession]) – A SparkSession object. If not provided, one will be created.

  • catalog (Optional[str]) – The catalog to use. If not provided, the default catalog will be used.

  • schema (Optional[str]) – The schema to use. If not provided, the default schema will be used.

  • ignore_tables (Optional[List[str]]) – A list of tables to ignore. If not provided, all tables will be used.

  • include_tables (Optional[List[str]]) – A list of tables to include. If not provided, all tables will be used.

  • sample_rows_in_table_info (int) – The number of rows to include in the table info. Defaults to 3.

classmethod from_uri(database_uri: str, engine_args: Optional[dict] = None, **kwargs: Any) SparkSQL[source]¶

Creating a remote Spark Session via Spark connect. For example: SparkSQL.from_uri(“sc://localhost:15002”)

Parameters
  • database_uri (str) –

  • engine_args (Optional[dict]) –

  • kwargs (Any) –

Return type

SparkSQL

get_table_info(table_names: Optional[List[str]] = None) str[source]¶
Parameters

table_names (Optional[List[str]]) –

Return type

str

get_table_info_no_throw(table_names: Optional[List[str]] = None) str[source]¶

Get information about specified tables.

Follows best practices as specified in: Rajkumar et al, 2022 (https://arxiv.org/abs/2204.00498)

If sample_rows_in_table_info, the specified number of sample rows will be appended to each table description. This can increase performance as demonstrated in the paper.

Parameters

table_names (Optional[List[str]]) –

Return type

str

get_usable_table_names() Iterable[str][source]¶

Get names of tables available.

Return type

Iterable[str]

run(command: str, fetch: str = 'all') str[source]¶
Parameters
  • command (str) –

  • fetch (str) –

Return type

str

run_no_throw(command: str, fetch: str = 'all') str[source]¶

Execute a SQL command and return a string representing the results.

If the statement returns rows, a string of the results is returned. If the statement returns no rows, an empty string is returned.

If the statement throws an error, the error message is returned.

Parameters
  • command (str) –

  • fetch (str) –

Return type

str

Examples using SparkSQL¶