Spark Schema - Explained with Examples - Spark By {Examples} (2023)

Spark Schema defines the structure of the DataFrame which you can get by calling printSchema() method on the DataFrame object. Spark SQL provides to programmatically specify the schema.

By default, Spark infers the schema from the data, however, sometimes we may need to define our own schema (column names and data types), especially while working with unstructured and semi-structured data, this article explains how to define simple, nested, and complex schemas with examples.

1. Schema – Defines the Structure of the DataFrame

What is Spark Schema

Spark schema is the structure of the DataFrame or Dataset, we can define it using StructType class which is a collection of StructField that define the column name(String), column type (DataType), nullable column (Boolean) and metadata (MetaData)

For the rest of the article I’ve explained by using the Scala example, a similar method could be used with PySpark, and if time permits I will cover it in the future. If you are looking for PySpark, I would still recommend reading through this article as it would give you an idea of its usage.

2. Create Schema using StructType & StructField

While creating a Spark DataFrame we can specify the schema using StructType and StructField classes. we can also add nested struct StructType, ArrayType for arrays, and MapType for key-value pairs which we will discuss in detail in later sections.

Spark defines StructType & StructField case class as follows.

case class StructType(fields: Array[StructField])case class StructField( name: String, dataType: DataType, nullable: Boolean = true, metadata: Metadata = Metadata.empty)

The below example demonstrates a very simple example of using StructType & StructField on DataFrame and its usage with sample data to support it.

import org.apache.spark.sql.types.{IntegerType,StringType,StructType,StructField}import org.apache.spark.sql.{Row, SparkSession}val simpleData = Seq(Row("James","","Smith","36636","M",3000), Row("Michael","Rose","","40288","M",4000), Row("Robert","","Williams","42114","M",4000), Row("Maria","Anne","Jones","39192","F",4000), Row("Jen","Mary","Brown","","F",-1) )val simpleSchema = StructType(Array( StructField("firstname",StringType,true), StructField("middlename",StringType,true), StructField("lastname",StringType,true), StructField("id", StringType, true), StructField("gender", StringType, true), StructField("salary", IntegerType, true) ))val df = spark.createDataFrame( spark.sparkContext.parallelize(simpleData),simpleSchema)

3. Spark DataFrame printSchema()

To get the schema of the Spark DataFrame, use printSchema() on Spark DataFrame object.


From the above example, printSchema() prints the schema to console(stdout) and show() displays the content of the Spark DataFrame.

root |-- firstname: string (nullable = true) |-- middlename: string (nullable = true) |-- lastname: string (nullable = true) |-- id: string (nullable = true) |-- gender: string (nullable = true) |-- salary: integer (nullable = true)+---------+----------+--------+-----+------+------+|firstname|middlename|lastname| id|gender|salary|+---------+----------+--------+-----+------+------+| James| | Smith|36636| M| 3000|| Michael| Rose| |40288| M| 4000|| Robert| |Williams|42114| M| 4000|| Maria| Anne| Jones|39192| F| 4000|| Jen| Mary| Brown| | F| -1|+---------+----------+--------+-----+------+------+

4. Create Nested struct Schema

While working on Spark DataFrame we often need to work with the nested struct columns. On the below example I am using a different approach to instantiating StructType and use add method (instead of StructField) to add column names and datatype.

val structureData = Seq( Row(Row("James","","Smith"),"36636","M",3100), Row(Row("Michael","Rose",""),"40288","M",4300), Row(Row("Robert","","Williams"),"42114","M",1400), Row(Row("Maria","Anne","Jones"),"39192","F",5500), Row(Row("Jen","Mary","Brown"),"","F",-1))val structureSchema = new StructType() .add("name",new StructType() .add("firstname",StringType) .add("middlename",StringType) .add("lastname",StringType)) .add("id",StringType) .add("gender",StringType) .add("salary",IntegerType)val df2 = spark.createDataFrame( spark.sparkContext.parallelize(structureData),structureSchema)df2.printSchema()

Prints below schema and DataFrame. Note that printSchema() displays struct for nested structure fields.

root |-- name: struct (nullable = true) | |-- firstname: string (nullable = true) | |-- middlename: string (nullable = true) | |-- lastname: string (nullable = true) |-- id: string (nullable = true) |-- gender: string (nullable = true) |-- salary: integer (nullable = true)+--------------------+-----+------+------+| name| id|gender|salary|+--------------------+-----+------+------+| [James, , Smith]|36636| M| 3100|| [Michael, Rose, ]|40288| M| 4300|| [Robert, , Willi...|42114| M| 1400|| [Maria, Anne, Jo...|39192| F| 5500|| [Jen, Mary, Brown]| | F| -1|+--------------------+-----+------+------+

5. Loading SQL Schema from JSON

If you have too many fields and the structure of the DataFrame changes now and then, it’s a good practice to load the SQL schema from JSON file. Note the definition in JSON uses the different layout and you can get this by using schema.prettyJson()

{ "type" : "struct", "fields" : [ { "name" : "name", "type" : { "type" : "struct", "fields" : [ { "name" : "firstname", "type" : "string", "nullable" : true, "metadata" : { } }, { "name" : "middlename", "type" : "string", "nullable" : true, "metadata" : { } }, { "name" : "lastname", "type" : "string", "nullable" : true, "metadata" : { } } ] }, "nullable" : true, "metadata" : { } }, { "name" : "dob", "type" : "string", "nullable" : true, "metadata" : { } }, { "name" : "gender", "type" : "string", "nullable" : true, "metadata" : { } }, { "name" : "salary", "type" : "integer", "nullable" : true, "metadata" : { } } ]}
val url = ClassLoader.getSystemResource("schema.json")val schemaSource = Source.fromFile(url.getFile).getLines.mkStringval schemaFromJson = DataType.fromJson(schemaSource).asInstanceOf[StructType]val df3 = spark.createDataFrame( spark.sparkContext.parallelize(structureData),schemaFromJson)df3.printSchema()

This prints the same output as the previous section. You can also, have a name, type, and flag for nullable in a comma-separated file and we can use these to create a struct programmatically, I will leave this to you to explore.

6. Using Arrays & Map Columns

Spark SQL also supports ArrayType and MapType to define the schema with array and map collections respectively. On the below example, column “hobbies” defined as ArrayType(StringType) and “properties” defined as MapType(StringType,StringType) meaning both key and value as String.

val arrayStructureData = Seq( Row(Row("James","","Smith"),List("Cricket","Movies"),Map("hair"->"black","eye"->"brown")), Row(Row("Michael","Rose",""),List("Tennis"),Map("hair"->"brown","eye"->"black")), Row(Row("Robert","","Williams"),List("Cooking","Football"),Map("hair"->"red","eye"->"gray")), Row(Row("Maria","Anne","Jones"),null,Map("hair"->"blond","eye"->"red")), Row(Row("Jen","Mary","Brown"),List("Blogging"),Map("white"->"black","eye"->"black")))val arrayStructureSchema = new StructType() .add("name",new StructType() .add("firstname",StringType) .add("middlename",StringType) .add("lastname",StringType)) .add("hobbies", ArrayType(StringType)) .add("properties", MapType(StringType,StringType))val df5 = spark.createDataFrame( spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)df5.printSchema()

Outputs the below schema and the DataFrame data. Note that field Hobbies is an array type and properties is map type.

root |-- name: struct (nullable = true) | |-- firstname: string (nullable = true) | |-- middlename: string (nullable = true) | |-- lastname: string (nullable = true) |-- hobbies: array (nullable = true) | |-- element: string (containsNull = true) |-- properties: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true)+---------------------+-------------------+------------------------------+|name |hobbies |properties |+---------------------+-------------------+------------------------------+|[James, , Smith] |[Cricket, Movies] |[hair -> black, eye -> brown] ||[Michael, Rose, ] |[Tennis] |[hair -> brown, eye -> black] ||[Robert, , Williams]|[Cooking, Football]|[hair -> red, eye -> gray] ||[Maria, Anne, Jones]|null |[hair -> blond, eye -> red] ||[Jen, Mary, Brown] |[Blogging] |[white -> black, eye -> black]|+---------------------+-------------------+------------------------------+

7. Convert Scala Case Class to Spark Schema

Spark SQL also provides Encoders to convert case class to struct object. If you are using older versions of Spark, you can also transform the case class to the schema using the Scala hack. Both examples are present here.

case class Name(first:String,last:String,middle:String)case class Employee(fullName:Name,age:Integer,gender:String)import org.apache.spark.sql.catalyst.ScalaReflectionval scalaSchema = ScalaReflection.schemaFor[Employee].dataType.asInstanceOf[StructType]val encoderSchema = Encoders.product[Employee].schemaencoderSchema.printTreeString()

Spark DataFrame printTreeString() outputs the below schema similar to printSchema().

root |-- fullName: struct (nullable = true) | |-- first: string (nullable = true) | |-- last: string (nullable = true) | |-- middle: string (nullable = true) |-- age: integer (nullable = true) |-- gender: string (nullable = true)

8. Creating schema from DDL String

Like loading structure from JSON string, we can also create it from DDL, you can also generate DDL from a schema using toDDL(). printTreeString() on struct object prints the schema similar to printSchemafunction returns.

val ddlSchemaStr = "`fullName` STRUCT<`first`: STRING, `last`: STRING, `middle`: STRING>,`age` INT,`gender` STRING"val ddlSchema = StructType.fromDDL(ddlSchemaStr)ddlSchema.printTreeString()

9. Checking if a Field Exists in a Schema

We often need to check if a column present in a Dataframe schema, we can easily do this using several functions on SQL StructType and StructField.


This example returns “true” for both scenarios. And for the second one if you have IntegetType instead of StringType it returns false as the datatype for first name column is String, as it checks every property ins field. Similarly, you can also check if two schemas are equal and more.

The complete example explained here is available at GitHub project.


In this article, you have learned the usage of Spark SQL schema, create it programmatically using StructType and StructField, convert case class to the schema, using ArrayType, MapType, and finally how to display the DataFrame schema using printSchema() and printTreeString().

Related Articles

  • Spark printSchema() Example
  • Spark Merge Two DataFrames with Different Columns or Schema
  • Spark read JSON with or without schema
  • Spark Convert case class to Schema
  • Spark SQL Explained with Examples
  • Spark spark.table() vs
  • Broadcast Join in Spark

You may also like reading:

  1. Spark SQL Map functions – complete list
  2. Spark – explode Array of Struct to rows
  3. Spark – explode Array of Map to rows
  4. Spark – Create a DataFrame with Array of Struct column
  5. Spark Convert case class to Schema
  6. Spark read JSON with or without schema
  7. Spark – explode Array of Array (nested array) to rows
Top Articles
Latest Posts
Article information

Author: Kerri Lueilwitz

Last Updated: 03/28/2023

Views: 5399

Rating: 4.7 / 5 (47 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Kerri Lueilwitz

Birthday: 1992-10-31

Address: Suite 878 3699 Chantelle Roads, Colebury, NC 68599

Phone: +6111989609516

Job: Chief Farming Manager

Hobby: Mycology, Stone skipping, Dowsing, Whittling, Taxidermy, Sand art, Roller skating

Introduction: My name is Kerri Lueilwitz, I am a courageous, gentle, quaint, thankful, outstanding, brave, vast person who loves writing and wants to share my knowledge and understanding with you.