pyspark.SparkContext

class pyspark.SparkContext(master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, gateway=None, jsc=None, profiler_cls=<class 'pyspark.profiler.BasicProfiler'>)[source]

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.

Note

Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.

Note

SparkContext instance is not supported to share across multiple processes out of the box, and PySpark does not guarantee multi-processing execution. Use threads instead for concurrent processing purpose.

__init__(master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, gateway=None, jsc=None, profiler_cls=<class 'pyspark.profiler.BasicProfiler'>)[source]

Create a new SparkContext. At least the master and app name should be set, either through the named parameters here or through conf.

Parameters
  • master – Cluster URL to connect to (e.g. mesos://host:port, spark://host:port, local[4]).

  • appName – A name for your job, to display on the cluster web UI.

  • sparkHome – Location where Spark is installed on cluster nodes.

  • pyFiles – Collection of .zip or .py files to send to the cluster and add to PYTHONPATH. These can be paths on the local file system or HDFS, HTTP, HTTPS, or FTP URLs.

  • environment – A dictionary of environment variables to set on worker nodes.

  • batchSize – The number of Python objects represented as a single Java object. Set 1 to disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use an unlimited batch size

  • serializer – The serializer for RDDs.

  • conf – A SparkConf object setting Spark properties.

  • gateway – Use an existing gateway and JVM, otherwise a new JVM will be instantiated.

  • jsc – The JavaSparkContext instance (optional).

  • profiler_cls – A class of custom Profiler used to do profiling (default is pyspark.profiler.BasicProfiler).

>>> from pyspark.context import SparkContext
>>> sc = SparkContext('local', 'test')
>>> sc2 = SparkContext('local', 'test2') 
Traceback (most recent call last):
    ...
ValueError:...

Methods

Attributes