6. Local Runtime Notes

The examples in this book run on a local Spark master with local[*]. That means Spark uses local CPU cores instead of connecting to a standalone Spark cluster, YARN, Kubernetes, or Hadoop. The goal is to make the examples easy to run end-to-end while preserving Spark’s programming model.

Local mode is not a toy interpreter. Spark still creates jobs, stages, tasks, partitions, SQL plans, and shuffle files. What local mode does not simulate is a real distributed operating environment with remote executors, worker failures, HDFS permissions, network shuffle costs, and cluster scheduling delays.

6.1. Docker Runner

From the repository root, the local verification path is:

docker build -t book-spark-intro-check:local docker/spark-intro-check
docker/spark-intro-check/execute-notebooks.sh

The runner copies the source notebooks into a temporary working directory inside the container, executes them with jupyter-nbconvert, and writes executed notebooks to sphinx/spark-intro/build/executed-notebooks. Source notebooks should only depend on local files from the book and on packages installed in the check image.

[1]:
from pathlib import Path
import os
import platform
import shutil
import sys

from pyspark.sql import SparkSession, functions as F

DATA_DIR = Path.cwd()
OUTPUT_DIR = DATA_DIR / "_spark_output" / "local-runtime"

if OUTPUT_DIR.exists():
    shutil.rmtree(OUTPUT_DIR)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

spark = (
    SparkSession.builder
    .master("local[*]")
    .appName("spark-intro-local-runtime")
    .config("spark.driver.host", "127.0.0.1")
    .config("spark.driver.bindAddress", "127.0.0.1")
    .config("spark.sql.shuffle.partitions", "4")
    .config("spark.default.parallelism", "4")
    .config("spark.sql.adaptive.enabled", "false")
    .config("spark.ui.showConsoleProgress", "false")
    .getOrCreate()
)

sc = spark.sparkContext
sc.setLogLevel("ERROR")

print("Spark version:", spark.version)
print("Spark master:", sc.master)

Spark version: 3.5.8
Spark master: local[*]

6.2. Session Settings

Most notebooks in this book use the same session pattern. The important pieces are local[*], a small shuffle partition count, localhost driver binding, and console progress disabled so notebook output stays readable.

[2]:
settings = {
    "spark.version": spark.version,
    "python.version": sys.version.split()[0],
    "platform": platform.platform(),
    "master": sc.master,
    "defaultParallelism": sc.defaultParallelism,
    "spark.driver.host": spark.conf.get("spark.driver.host"),
    "spark.driver.bindAddress": spark.conf.get("spark.driver.bindAddress"),
    "spark.sql.shuffle.partitions": spark.conf.get("spark.sql.shuffle.partitions"),
    "spark.sql.adaptive.enabled": spark.conf.get("spark.sql.adaptive.enabled"),
    "SPARK_LOCAL_IP": os.environ.get("SPARK_LOCAL_IP", ""),
    "SPARK_LOCAL_HOSTNAME": os.environ.get("SPARK_LOCAL_HOSTNAME", ""),
    "Spark UI": sc.uiWebUrl or "not available",
}

spark.createDataFrame(
    [(key, str(value)) for key, value in settings.items()],
    ["setting", "value"],
).show(truncate=False)

+----------------------------+---------------------------------------------+
|setting                     |value                                        |
+----------------------------+---------------------------------------------+
|spark.version               |3.5.8                                        |
|python.version              |3.11.15                                      |
|platform                    |Linux-6.8.0-106-generic-x86_64-with-glibc2.36|
|master                      |local[*]                                     |
|defaultParallelism          |4                                            |
|spark.driver.host           |127.0.0.1                                    |
|spark.driver.bindAddress    |127.0.0.1                                    |
|spark.sql.shuffle.partitions|4                                            |
|spark.sql.adaptive.enabled  |false                                        |
|SPARK_LOCAL_IP              |127.0.0.1                                    |
|SPARK_LOCAL_HOSTNAME        |localhost                                    |
|Spark UI                    |http://127.0.0.1:4040                        |
+----------------------------+---------------------------------------------+

6.3. Local Output Directories

Examples that write files should write beneath _spark_output. The directory is ignored by git and can be deleted at any time. Spark writes directories, not single files, because each partition can produce its own part file.

[3]:
example_path = OUTPUT_DIR / "range_by_bucket"

(
    spark.range(0, 8)
    .withColumn("bucket", (F.col("id") % 2).cast("int"))
    .write
    .mode("overwrite")
    .partitionBy("bucket")
    .parquet(str(example_path))
)

for path in sorted(example_path.glob("bucket=*")):
    print(path.relative_to(DATA_DIR))

spark.read.parquet(str(example_path)).groupBy("bucket").count().orderBy("bucket").show()

_spark_output/local-runtime/range_by_bucket/bucket=0
_spark_output/local-runtime/range_by_bucket/bucket=1
+------+-----+
|bucket|count|
+------+-----+
|     0|    4|
|     1|    4|
+------+-----+

6.4. Practical Local Knobs

Use small defaults for teaching. A laptop does not need 200 shuffle partitions for tiny examples. The book generally uses these settings.

  • local[*] uses all available local cores.

  • spark.sql.shuffle.partitions = 4 keeps small group-by and join examples from creating many tiny tasks.

  • spark.default.parallelism = 4 keeps basic RDD examples predictable.

  • spark.driver.host and spark.driver.bindAddress are pinned to 127.0.0.1 so Docker and local hostname resolution do not surprise Spark.

  • spark.sql.adaptive.enabled = false keeps physical plans more stable for teaching.

For larger local experiments, raise the shuffle partition count and give Docker more memory. For production work, tune these settings against the real cluster and data shape.

6.5. Common Local Problems

Java mismatch. PySpark needs a compatible JVM. The Docker tool installs OpenJDK 17 so the examples do not depend on the host Java installation.

Hostname binding errors. Spark drivers and executors need to talk to each other, even in local mode. Binding the driver to 127.0.0.1 avoids many container hostname issues.

Leftover output directories. Spark refuses to overwrite many output directories unless .mode("overwrite") is set or the directory is removed first. The notebooks usually remove their own _spark_output/... directory at the start.

Too many tiny tasks. Spark defaults are designed for larger jobs. For local teaching examples, lower partition counts keep execution and notebook output manageable.

Missing GraphFrames jar. The graph chapter needs both the Python graphframes package and the matching Spark jar. The Docker check image installs both.

6.6. What To Remember

The local runtime is a teaching and verification environment. It keeps infrastructure out of the way so readers can focus on Spark APIs. When moving from this book to a real cluster, keep the code structure but revisit storage paths, checkpoints, memory, shuffle partitions, broadcast thresholds, and operational monitoring.

[4]:
spark.stop()