pyspark.sql.DataFrame

class pyspark.sql.DataFrame(jdf, sql_ctx)[source]

A distributed collection of data grouped into named columns.

A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession:

people = spark.read.parquet("...")

Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame, Column.

To select a column from the DataFrame, use the apply method:

ageCol = people.age

A more concrete example:

# To create DataFrame using SparkSession
people = spark.read.parquet("...")
department = spark.read.parquet("...")

people.filter(people.age > 30).join(department, people.deptId == department.id) \
  .groupBy(department.name, "gender").agg({"salary": "avg", "age": "max"})

New in version 1.3.

__init__(jdf, sql_ctx)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

Attributes