pyspark.sql.SparkSession.createDataFrame¶
-
SparkSession.
createDataFrame
(data, schema=None, samplingRatio=None, verifySchema=True)[source]¶ Creates a
DataFrame
from anRDD
, a list or apandas.DataFrame
.When
schema
is a list of column names, the type of each column will be inferred fromdata
.When
schema
isNone
, it will try to infer the schema (column names and types) fromdata
, which should be an RDD of eitherRow
,namedtuple
, ordict
.When
schema
ispyspark.sql.types.DataType
or a datatype string, it must match the real data, or an exception will be thrown at runtime. If the given schema is notpyspark.sql.types.StructType
, it will be wrapped into apyspark.sql.types.StructType
as its only field, and the field name will be “value”. Each record will also be wrapped into a tuple, which can be converted to row later.If schema inference is needed,
samplingRatio
is used to determined the ratio of rows used for schema inference. The first row will be used ifsamplingRatio
isNone
.- Parameters
data – an RDD of any kind of SQL data representation (e.g. row, tuple, int, boolean, etc.),
list
, orpandas.DataFrame
.schema – a
pyspark.sql.types.DataType
or a datatype string or a list of column names, default isNone
. The data type string format equals topyspark.sql.types.DataType.simpleString
, except that top level struct type can omit thestruct<>
and atomic types usetypeName()
as their format, e.g. usebyte
instead oftinyint
forpyspark.sql.types.ByteType
. We can also useint
as a short name forIntegerType
.samplingRatio – the sample ratio of rows used for inferring
verifySchema – verify data types of every row against schema.
- Returns
Changed in version 2.1: Added verifySchema.
Note
Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.
Note
When Arrow optimization is enabled, strings inside Pandas DataFrame in Python 2 are converted into bytes as they are bytes in Python 2 whereas regular strings are left as strings. When using strings in Python 2, use unicode u”” as Python standard practice.
>>> l = [('Alice', 1)] >>> spark.createDataFrame(l).collect() [Row(_1='Alice', _2=1)] >>> spark.createDataFrame(l, ['name', 'age']).collect() [Row(name='Alice', age=1)]
>>> d = [{'name': 'Alice', 'age': 1}] >>> spark.createDataFrame(d).collect() [Row(age=1, name='Alice')]
>>> rdd = sc.parallelize(l) >>> spark.createDataFrame(rdd).collect() [Row(_1='Alice', _2=1)] >>> df = spark.createDataFrame(rdd, ['name', 'age']) >>> df.collect() [Row(name='Alice', age=1)]
>>> from pyspark.sql import Row >>> Person = Row('name', 'age') >>> person = rdd.map(lambda r: Person(*r)) >>> df2 = spark.createDataFrame(person) >>> df2.collect() [Row(name='Alice', age=1)]
>>> from pyspark.sql.types import * >>> schema = StructType([ ... StructField("name", StringType(), True), ... StructField("age", IntegerType(), True)]) >>> df3 = spark.createDataFrame(rdd, schema) >>> df3.collect() [Row(name='Alice', age=1)]
>>> spark.createDataFrame(df.toPandas()).collect() [Row(name='Alice', age=1)] >>> spark.createDataFrame(pandas.DataFrame([[1, 2]])).collect() [Row(0=1, 1=2)]
>>> spark.createDataFrame(rdd, "a: string, b: int").collect() [Row(a='Alice', b=1)] >>> rdd = rdd.map(lambda row: row[1]) >>> spark.createDataFrame(rdd, "int").collect() [Row(value=1)] >>> spark.createDataFrame(rdd, "boolean").collect() Traceback (most recent call last): ... Py4JJavaError: ...
New in version 2.0.