Integrating PySpark with SQL server using JDBC
First of all I need the JDBC driver for Spark in order to make the connection to a Microsoft SQL server.
$ wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre8/mssql-jdbc-6.4.0.jre8.jar -P /opt/notebooks/
The configuration is saved in config.ini
with the following fields:
[mydb]
database = mydb
host = mydb-server.database.windows.net
username = readonly
password = mypassword
port = 1433
Loading the configuration is simple with configparser
:
from configparser import ConfigParser
config = ConfigParser()
config.read('config.ini')
Set the values for the connection:
jdbc_url = f"jdbc:sqlserver://{config.get('mydb', 'host')}:{config.get('mydb', 'port')};database={config.get('mydb', 'database')}"
connection_properties = {
"user": config.get('mydb', 'username'),
"password": config.get('mydb', 'password')
}
When creating the SparkSession make sure the path to the JAR is correctly set:
from pyspark.sql import SparkSession
jars = [
"mssql-jdbc-6.4.0.jre8.jar",
]
spark = (SparkSession
.builder
.appName("PySpark with SQL server")
.config("spark.driver.extraClassPath", ":".join(jars))
.getOrCreate())
The session is created and we can query the actual database:
schema = 'dbo'
table = 'users'
The reading is done using the jdbc
read option and specifying the connection details:
df = spark \
.read \
.jdbc(jdbc_url,
f"{schema}.{table}",
properties=connection_properties)
An alternative approach is to use the same syntax as for the Redshift article by omitting the connection_properties
and use a more explicit notation.
df = spark
.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", f"{schema}.{table}") \
.option("user", config.get('mydb', 'username')) \
.option("password", config.get('mydb', 'password')) \
.load()