9. Date and Time
Working date
or datetime
types can be tricky. Spark has two data types that can represent dates.
TimestampType
represents values with date (year, month, day), time (hour, minute, second) and timezone offset.DateType
represents values with date information only.
9.1. Pandas datetime
Date and datetime information can be represented in Pandas dataframes in a variety of ways. Typically, one would use functions available from Pandas (e.g. to_datetime()
) to convert values to datetime values. Be careful, however, as the type information is all over the place.
[1]:
import pandas as pd
import numpy as np
pdf = pd.DataFrame({
'x': [1, 2, 3],
'y': pd.to_datetime(['2021-12-01', '2021-12-02', '2021-12-03'])
})
pdf
[1]:
x | y | |
---|---|---|
0 | 1 | 2021-12-01 |
1 | 2 | 2021-12-02 |
2 | 3 | 2021-12-03 |
You can see that .dtypes
says that the type is datetime64[ns]
.
[2]:
pdf.dtypes
[2]:
x int64
y datetime64[ns]
dtype: object
If you navigate to a specific value, then the type information changes!
[3]:
type(pdf.iloc[0].y)
[3]:
pandas._libs.tslibs.timestamps.Timestamp
Still, the awe and wonder of Spark and Pandas can successfully create a Spark dataframe from a Pandas one. Take note that the Spark data type is timestamp
.
[4]:
sdf = sqlContext.createDataFrame(pdf)
sdf.show()
+---+-------------------+
| x| y|
+---+-------------------+
| 1|2021-12-01 00:00:00|
| 2|2021-12-02 00:00:00|
| 3|2021-12-03 00:00:00|
+---+-------------------+
[5]:
sdf.printSchema()
root
|-- x: long (nullable = true)
|-- y: timestamp (nullable = true)
9.2. Python datetime
If you are not converting data from Pandas, then you should just use Python’s datetime
module. Whatever datetime.strptime()
returns will most likely be convertible to Spark’s datetime types. By default, again, notice that the type is timestamp
.
[6]:
from datetime import datetime
from pyspark.sql import Row
from pyspark.sql.types import *
rdd = sc.parallelize([(a, b) for a, b in zip([1, 2, 3], ['2021-12-01', '2021-12-02', '2021-12-03'])])\
.map(lambda tup: Row(n=tup[0], d=datetime.strptime(tup[1], '%Y-%m-%d')))
sdf = sqlContext.createDataFrame(rdd)
[7]:
sdf.printSchema()
root
|-- d: timestamp (nullable = true)
|-- n: long (nullable = true)
[8]:
sdf.show()
+-------------------+---+
| d| n|
+-------------------+---+
|2021-12-01 00:00:00| 1|
|2021-12-02 00:00:00| 2|
|2021-12-03 00:00:00| 3|
+-------------------+---+
If we do not use Row
and just tuples, then we have to specify a schema. We set the datetime type to TimestampType
.
[14]:
rdd = sc.parallelize([(a, b) for a, b in zip([1, 2, 3], ['2021-12-01', '2021-12-02', '2021-12-03'])])\
.map(lambda tup: (tup[0], datetime.strptime(tup[1], '%Y-%m-%d')))
schema = StructType([
StructField('n', IntegerType(), True),
StructField('d', TimestampType(), True)
])
sdf = spark.createDataFrame(rdd, schema)
[15]:
sdf.printSchema()
root
|-- n: integer (nullable = true)
|-- d: timestamp (nullable = true)
[16]:
sdf.show()
+---+-------------------+
| n| d|
+---+-------------------+
| 1|2021-12-01 00:00:00|
| 2|2021-12-02 00:00:00|
| 3|2021-12-03 00:00:00|
+---+-------------------+
But we can also specify DateType
.
[17]:
rdd = sc.parallelize([(a, b) for a, b in zip([1, 2, 3], ['2021-12-01', '2021-12-02', '2021-12-03'])])\
.map(lambda tup: (tup[0], datetime.strptime(tup[1], '%Y-%m-%d')))
schema = StructType([
StructField('n', IntegerType(), True),
StructField('d', DateType(), True)
])
sdf = spark.createDataFrame(rdd, schema)
[18]:
sdf.printSchema()
root
|-- n: integer (nullable = true)
|-- d: date (nullable = true)
[19]:
sdf.show()
+---+----------+
| n| d|
+---+----------+
| 1|2021-12-01|
| 2|2021-12-02|
| 3|2021-12-03|
+---+----------+