Spark, No Tears

Preface

This book teaches students how to program in Apache Spark. The notebooks run with a local Spark master, local[*]. No Hadoop, HDFS, or standalone Spark cluster is required for the examples.

To execute the notebooks end-to-end, build the local Docker check image from the repository root.

docker build -t book-spark-intro-check:local docker/spark-intro-check
docker/spark-intro-check/execute-notebooks.sh

The runner writes executed notebooks to sphinx/spark-intro/build/executed-notebooks.

The diagram below shows the major groups in the book. The first chapters build the API foundation. The middle chapters focus on data movement, data quality, and specialized workloads. The final chapters cover applied projects, local testing, packaging, table formats, and runtime choices.

That left-to-right path matches the mental model most readers need: first understand distributed execution, then learn where Spark moves data, then use higher-level workloads, and finally shape the work into local tests, submitted jobs, and table-format-aware applications.

Data Movement and Quality

Specialized Workloads

About

One-Off Coder is an educational, service and product company. Please visit us online to discover how we may help you achieve life-long success in your personal coding career or with your company’s business goals and objectives.

Copyright

This work is licensed under a Creative Commons Attribution 4.0 International License by One-Off Coder.

The full source code is available on GitHub.

Cite this book as follows.:

@misc{oneoffcoder_spark_intro_2019,
title={Spark, No Tears},
url={https://learn-spark.oneoffcoder.com},
author={One-Off Coder},
year={2019},
month={Oct}}

Author

Jee Vang, Ph.D.

Patreon