Integrating PySpark with SQL server using JDBC

First of all I need the JDBC driver for Spark in order to make the connection to a Microsoft SQL server.

$ wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre8/mssql-jdbc-6.4.0.jre8.jar -P /opt/notebooks/

The configuration is saved in config.ini with the following fields:

[mydb]
database = mydb
host = mydb-server.database.windows.net
username = readonly
password = mypassword
port = 1433

Loading the configuration is simple with configparser:

from configparser import ConfigParser

config = ConfigParser()
config.read('config.ini')

Set the values for the connection:

jdbc_url = f"jdbc:sqlserver://{config.get('mydb', 'host')}:{config.get('mydb', 'port')};database={config.get('mydb', 'database')}"
connection_properties = {
    "user": config.get('mydb', 'username'),
    "password": config.get('mydb', 'password')
}

When creating the SparkSession make sure the path to the JAR is correctly set:

from pyspark.sql import SparkSession

jars = [
    "mssql-jdbc-6.4.0.jre8.jar",
]
spark = (SparkSession
  .builder
  .appName("PySpark with SQL server")
  .config("spark.driver.extraClassPath", ":".join(jars))
  .getOrCreate())  

The session is created and we can query the actual database:

schema = 'dbo'
table = 'users'

The reading is done using the jdbc read option and specifying the connection details:

df = spark \
    .read \
    .jdbc(jdbc_url,
          f"{schema}.{table}",
          properties=connection_properties)

An alternative approach is to use the same syntax as for the Redshift article by omitting the connection_properties and use a more explicit notation.

df = spark
    .read \
    .format("jdbc") \
    .option("url", jdbc_url) \
    .option("dbtable", f"{schema}.{table}") \
    .option("user", config.get('mydb', 'username')) \
    .option("password", config.get('mydb', 'password')) \
    .load()