9. Date and Time

Working date or datetime types can be tricky. Spark has two data types that can represent dates.

  • TimestampType represents values with date (year, month, day), time (hour, minute, second) and timezone offset.

  • DateType represents values with date information only.

9.1. Pandas datetime

Date and datetime information can be represented in Pandas dataframes in a variety of ways. Typically, one would use functions available from Pandas (e.g. to_datetime()) to convert values to datetime values. Be careful, however, as the type information is all over the place.

[1]:
import pandas as pd
import numpy as np

pdf = pd.DataFrame({
    'x': [1, 2, 3],
    'y': pd.to_datetime(['2021-12-01', '2021-12-02', '2021-12-03'])
})

pdf
[1]:
x y
0 1 2021-12-01
1 2 2021-12-02
2 3 2021-12-03

You can see that .dtypes says that the type is datetime64[ns].

[2]:
pdf.dtypes
[2]:
x             int64
y    datetime64[ns]
dtype: object

If you navigate to a specific value, then the type information changes!

[3]:
type(pdf.iloc[0].y)
[3]:
pandas._libs.tslibs.timestamps.Timestamp

Still, the awe and wonder of Spark and Pandas can successfully create a Spark dataframe from a Pandas one. Take note that the Spark data type is timestamp.

[4]:
sdf = sqlContext.createDataFrame(pdf)
sdf.show()
+---+-------------------+
|  x|                  y|
+---+-------------------+
|  1|2021-12-01 00:00:00|
|  2|2021-12-02 00:00:00|
|  3|2021-12-03 00:00:00|
+---+-------------------+

[5]:
sdf.printSchema()
root
 |-- x: long (nullable = true)
 |-- y: timestamp (nullable = true)

9.2. Python datetime

If you are not converting data from Pandas, then you should just use Python’s datetime module. Whatever datetime.strptime() returns will most likely be convertible to Spark’s datetime types. By default, again, notice that the type is timestamp.

[6]:
from datetime import datetime
from pyspark.sql import Row
from pyspark.sql.types import *

rdd = sc.parallelize([(a, b) for a, b in zip([1, 2, 3], ['2021-12-01', '2021-12-02', '2021-12-03'])])\
    .map(lambda tup: Row(n=tup[0], d=datetime.strptime(tup[1], '%Y-%m-%d')))

sdf = sqlContext.createDataFrame(rdd)
[7]:
sdf.printSchema()
root
 |-- d: timestamp (nullable = true)
 |-- n: long (nullable = true)

[8]:
sdf.show()
+-------------------+---+
|                  d|  n|
+-------------------+---+
|2021-12-01 00:00:00|  1|
|2021-12-02 00:00:00|  2|
|2021-12-03 00:00:00|  3|
+-------------------+---+

If we do not use Row and just tuples, then we have to specify a schema. We set the datetime type to TimestampType.

[14]:
rdd = sc.parallelize([(a, b) for a, b in zip([1, 2, 3], ['2021-12-01', '2021-12-02', '2021-12-03'])])\
    .map(lambda tup: (tup[0], datetime.strptime(tup[1], '%Y-%m-%d')))

schema = StructType([
    StructField('n', IntegerType(), True),
    StructField('d', TimestampType(), True)
])

sdf = spark.createDataFrame(rdd, schema)
[15]:
sdf.printSchema()
root
 |-- n: integer (nullable = true)
 |-- d: timestamp (nullable = true)

[16]:
sdf.show()
+---+-------------------+
|  n|                  d|
+---+-------------------+
|  1|2021-12-01 00:00:00|
|  2|2021-12-02 00:00:00|
|  3|2021-12-03 00:00:00|
+---+-------------------+

But we can also specify DateType.

[17]:
rdd = sc.parallelize([(a, b) for a, b in zip([1, 2, 3], ['2021-12-01', '2021-12-02', '2021-12-03'])])\
    .map(lambda tup: (tup[0], datetime.strptime(tup[1], '%Y-%m-%d')))

schema = StructType([
    StructField('n', IntegerType(), True),
    StructField('d', DateType(), True)
])

sdf = spark.createDataFrame(rdd, schema)
[18]:
sdf.printSchema()
root
 |-- n: integer (nullable = true)
 |-- d: date (nullable = true)

[19]:
sdf.show()
+---+----------+
|  n|         d|
+---+----------+
|  1|2021-12-01|
|  2|2021-12-02|
|  3|2021-12-03|
+---+----------+