2. Spark Connect

Spark Connect separates the client application from the Spark driver by using a client-server protocol. The Python application creates a remote Spark session and sends unresolved logical plans to a Spark Connect server. The server owns execution.

This is useful when notebooks, applications, or services should talk to Spark without running inside the same JVM driver process. The official Spark Connect quick start launches a Connect server, then creates a Python session with SparkSession.builder.remote("sc://localhost:15002").

[1]:

import importlib.util

requirements = ["grpc", "google.protobuf", "pyspark.sql.connect.session"]
for module in requirements:
    print(module, "available:", importlib.util.find_spec(module) is not None)

grpc available: True
google.protobuf available: True
pyspark.sql.connect.session available: True

2.1. Server And Client Shape

A Spark Connect setup has two processes. The server is a Spark application with the Connect endpoint enabled. The client is the Python code that calls remote(...). The Docker notebook runner does not start a long-running Connect server, so this chapter validates the client dependencies and demonstrates the application code shape with a local session.

[2]:

server_command = '''
start-connect-server.sh \
  --packages org.apache.spark:spark-connect_2.12:<spark-version>
'''.strip()

client_code = '''
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
'''.strip()

print("Server command pattern:")
print(server_command)
print("\nClient session pattern:")
print(client_code)

Server command pattern:
start-connect-server.sh   --packages org.apache.spark:spark-connect_2.12:<spark-version>

Client session pattern:
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()

2.2. Write Code Against A Session

The best habit is to write functions that accept a Spark session. The same function can receive a classic local Spark session in tests or a Spark Connect session in an application.

[3]:

from pathlib import Path
import shutil

from pyspark.sql import SparkSession, functions as F

DATA_DIR = Path.cwd()
OUTPUT_DIR = DATA_DIR / "_spark_output" / "spark-connect"

if OUTPUT_DIR.exists():
    shutil.rmtree(OUTPUT_DIR)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

spark = (
    SparkSession.builder
    .master("local[*]")
    .appName("spark-intro-spark-connect")
    .config("spark.driver.host", "127.0.0.1")
    .config("spark.driver.bindAddress", "127.0.0.1")
    .config("spark.sql.shuffle.partitions", "4")
    .config("spark.default.parallelism", "4")
    .config("spark.sql.adaptive.enabled", "false")
    .config("spark.ui.showConsoleProgress", "false")
    .getOrCreate()
)

sc = spark.sparkContext
sc.setLogLevel("ERROR")

print("Spark version:", spark.version)
print("Spark master:", sc.master)

Spark version: 3.5.8
Spark master: local[*]

[4]:

def revenue_by_region(spark_session):
    sales = spark_session.createDataFrame(
        [("north", "book", 10.0), ("north", "course", 20.0), ("south", "book", 7.5)],
        ["region", "category", "amount"],
    )
    return sales.groupBy("region").agg(F.round(F.sum("amount"), 2).alias("revenue"))

revenue_by_region(spark).orderBy("region").show()

+------+-------+
|region|revenue|
+------+-------+
| north|   30.0|
| south|    7.5|
+------+-------+

2.3. Practical Differences

Spark Connect supports the DataFrame API, but it changes where code runs. Client-side Python code runs in the client process. Spark execution runs on the server. Code that reaches into driver internals, relies on direct JVM access, or assumes local files exist on the driver may need to change.

[5]:

spark.stop()

2.4. What To Remember

Spark Connect is an application architecture choice. It lets client code submit Spark plans to a remote server. Keep transformation code session-oriented, avoid driver internals, and test the same logic locally before pointing it at a Connect endpoint.