pyspark package

Contents

class dummy_spark.RDD(jrdd, ctx, jrdd_deserializer=None)[source]

A Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel. This is a dummy version of that, that is just a list under the hood. To be used for testing, and maybe development if you play fast and loose.

Important note: the dummy RDD is NOT lazily loaded.

aggregate(zeroValue, seqOp, combOp)[source]

NotImplemented

Parameters:
  • zeroValue
  • seqOp
  • combOp
Returns:

aggregateByKey(zeroValue, seqFunc, combFunc, numPartitions=None)[source]

NotImplemented

Parameters:
  • zeroValue
  • seqFunc
  • combFunc
  • numPartitions
Returns:

cache()[source]
Returns:
cartesian(other)[source]
Parameters:other
Returns:
checkpoint()[source]
Returns:
coalesce(numPartitions, shuffle=False)[source]

NotImplemented

Parameters:
  • numPartitions
  • shuffle
Returns:

cogroup(other, numPartitions=None)[source]
Parameters:
  • other
  • numPartitions
Returns:

collect()[source]
Returns:
collectAsMap()[source]

NotImplemented

Returns:
combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None)[source]

NotImplemented

Parameters:
  • createCombiner
  • mergeValue
  • mergeCombiners
  • numPartitions
Returns:

context
Returns:
count()[source]
Returns:
countApprox(timeout, confidence=0.95)[source]
Parameters:
  • timeout
  • confidence
Returns:

countApproxDistinct(relativeSD=0.05)[source]
Parameters:relativeSD
Returns:
countByKey()[source]

NotImplemented

Returns:
countByValue()[source]

NotImplemented

Returns:
distinct(numPartitions=None)[source]
Parameters:numPartitions
Returns:
filter(f)[source]
Parameters:f
Returns:
first()[source]
Returns:
flatMap(f, preservesPartitioning=False)[source]
Parameters:
  • f
  • preservesPartitioning
Returns:

flatMapValues(f)[source]
Parameters:f
Returns:
fold(zeroValue, op)[source]

NotImplemented

Parameters:
  • zeroValue
  • op
Returns:

foldByKey(zeroValue, func, numPartitions=None)[source]

NotImplemented

Parameters:
  • zeroValue
  • func
  • numPartitions
Returns:

foreach(f)[source]
Parameters:f
Returns:
foreachPartition(f)[source]
Parameters:f
Returns:
fullOuterJoin(other, numPartitions=None)[source]

NotImplemented

Parameters:
  • other
  • numPartitions
Returns:

getCheckpointFile()[source]
Returns:
getNumPartitions()[source]
Returns:
getStorageLevel()[source]

NotImplemented

Returns:
glom()[source]
Returns:
groupBy(f, numPartitions=None)[source]
Parameters:
  • f
  • numPartitions
Returns:

groupByKey(numPartitions=None)[source]
Parameters:numPartitions
Returns:
groupWith(other, *others)[source]

NotImplemented

Parameters:
  • other
  • others
Returns:

histogram(buckets)[source]

NotImplemented

Parameters:buckets
Returns:
id()[source]
Returns:
intersection(other)[source]
Parameters:other
Returns:
isCheckpointed()[source]
Returns:
isEmpty()[source]
Returns:
join(other, numPartitions=None)[source]

NotImplemented

Parameters:
  • other
  • numPartitions
Returns:

keyBy(f)[source]

NotImplemented

Parameters:f
Returns:
keys()[source]

NotImplemented

Returns:
leftOuterJoin(other, numPartitions=None)[source]

NotImplemented

Parameters:
  • other
  • numPartitions
Returns:

lookup(key)[source]
Parameters:key
Returns:
map(f, preservesPartitioning=False)[source]
Parameters:
  • f
  • preservesPartitioning
Returns:

mapPartitions(f, preservesPartitioning=False)[source]
Parameters:
  • f
  • preservesPartitioning
Returns:

mapPartitionsWithIndex(f, preservesPartitioning=False)[source]

NotImplemented

Parameters:
  • f
  • preservesPartitioning
Returns:

mapValues(f)[source]
Parameters:f
Returns:
max(key=None)[source]
Parameters:key
Returns:
mean()[source]
Returns:
meanApprox(timeout, confidence=0.95)[source]
Parameters:
  • timeout
  • confidence
Returns:

min(key=None)[source]
Parameters:key
Returns:
name()[source]
Returns:
partitionBy(numPartitions, partitionFunc=None)[source]

NotImplemented

Parameters:
  • numPartitions
  • partitionFunc
Returns:

persist(storageLevel=None)[source]
Parameters:storageLevel
Returns:
pipe(command, env=None)[source]

NotImplemented

Parameters:
  • command
  • env
Returns:

randomSplit(weights, seed=None)[source]
Parameters:
  • weights
  • seed
Returns:

reduce(f)[source]

NotImplemented

Parameters:f
Returns:
reduceByKey(func, numPartitions=None)[source]
Parameters:
  • func
  • numPartitions
Returns:

reduceByKeyLocally(func)[source]

NotImplemented

Parameters:func
Returns:
repartition(numPartitions)[source]
Parameters:numPartitions
Returns:
repartitionAndSortWithinPartitions(numPartitions=None, partitionFunc=None, ascending=True, keyfunc=<function <lambda>>)[source]
Parameters:
  • numPartitions
  • partitionFunc
  • ascending
  • keyfunc
Returns:

rightOuterJoin(other, numPartitions=None)[source]

NotImplemented

Parameters:
  • other
  • numPartitions
Returns:

sample(withReplacement, fraction, seed=None)[source]
Parameters:
  • withReplacement
  • fraction
  • seed
Returns:

sampleByKey(withReplacement, fractions, seed=None)[source]

NotImplemented

Parameters:
  • withReplacement
  • fractions
  • seed
Returns:

sampleStdev()[source]

NotImplemented

Returns:
sampleVariance()[source]

NotImplemented

Returns:
saveAsHadoopDataset(conf, keyConverter=None, valueConverter=None)[source]

NotImplemented

Parameters:
  • conf
  • keyConverter
  • valueConverter
Returns:

saveAsHadoopFile(path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None, compressionCodecClass=None)[source]

NotImplemented

Parameters:
  • path
  • outputFormatClass
  • keyClass
  • valueClass
  • keyConverter
  • valueConverter
  • conf
  • compressionCodecClass
Returns:

saveAsNewAPIHadoopDataset(conf, keyConverter=None, valueConverter=None)[source]

NotImplemented

Parameters:
  • conf
  • keyConverter
  • valueConverter
Returns:

saveAsNewAPIHadoopFile(path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None)[source]

NotImplemented

Parameters:
  • path
  • outputFormatClass
  • keyClass
  • valueClass
  • keyConverter
  • valueConverter
  • conf
Returns:

saveAsPickleFile(path, batchSize=10)[source]

NotImplemented

Parameters:
  • path
  • batchSize
Returns:

saveAsSequenceFile(path, compressionCodecClass=None)[source]

NotImplemented

Parameters:
  • path
  • compressionCodecClass
Returns:

saveAsTextFile(path, compressionCodecClass=None)[source]

NotImplemented

Parameters:
  • path
  • compressionCodecClass
Returns:

setName(name)[source]
Parameters:name
Returns:
sortBy(keyfunc, ascending=True, numPartitions=None)[source]
Parameters:
  • keyfunc
  • ascending
  • numPartitions
Returns:

sortByKey(ascending=True, numPartitions=None, keyfunc=<function <lambda>>)[source]
Parameters:
  • ascending
  • numPartitions
  • keyfunc
Returns:

stats()[source]

NotImplemented

Returns:
stdev()[source]

NotImplemented

Returns:
subtract(other, numPartitions=None)[source]

NotImplemented

Parameters:
  • other
  • numPartitions
Returns:

subtractByKey(other, numPartitions=None)[source]

NotImplemented

Parameters:
  • other
  • numPartitions
Returns:

sum()[source]
Returns:
sumApprox(timeout, confidence=0.95)[source]
Parameters:
  • timeout
  • confidence
Returns:

take(num)[source]
Parameters:num
Returns:
takeOrdered(num, key=None)[source]

NotImplemented

Parameters:
  • num
  • key
Returns:

takeSample(withReplacement, num, seed=None)[source]
Parameters:
  • withReplacement
  • num
  • seed
Returns:

toDebugString()[source]

NotImplemented

Returns:
toLocalIterator()[source]
Returns:
top(num, key=None)[source]

NotImplemented

Parameters:
  • num
  • key
Returns:

treeAggregate(zeroValue, seqOp, combOp, depth=2)[source]

NotImplemented

Parameters:
  • zeroValue
  • seqOp
  • combOp
  • depth
Returns:

treeReduce(f, depth=2)[source]

NotImplemented

Parameters:
  • f
  • depth
Returns:

union(other)[source]
Parameters:other
Returns:
unpersist()[source]
Returns:
values()[source]

NotImplemented

Returns:
variance()[source]

NotImplemented

Returns:
zip(other)[source]
Parameters:other
Returns:
zipWithIndex()[source]
Returns:
zipWithUniqueId()[source]

NotImplemented

Returns:
class dummy_spark.SparkConf(loadDefaults=True, _jvm=None, _jconf=None)[source]
DEBUG_STRING = 'no string for dummy version'
contains(key)[source]
Parameters:key
Returns:
get(key, defaultValue=None)[source]
Parameters:
  • key
  • defaultValue
Returns:

getAll()[source]
Returns:
set(key, value)[source]
Parameters:
  • key
  • value
Returns:

setAll(pairs)[source]
Parameters:pairs
Returns:
setAppName(value)[source]
Parameters:value
Returns:
setExecutorEnv(key=None, value=None, pairs=None)[source]
Parameters:
  • key
  • value
  • pairs
Returns:

setIfMissing(key, value)[source]
Parameters:
  • key
  • value
Returns:

setMaster(value)[source]
Parameters:value
Returns:
setSparkHome(value)[source]
Parameters:value
Returns:
toDebugString()[source]
Returns:
class dummy_spark.SparkContext(master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=None, conf=None, gateway=None, jsc=None, profiler_cls=None)[source]

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.

DUMMY_VERSION = 'dummy version'
PACKAGE_EXTENSIONS = ('.zip', '.egg', '.jar')
accumulator(value, accum_param=None)[source]

NotImplemented

Parameters:
  • value
  • accum_param
Returns:

addFile(path)[source]

NotImplemented

Parameters:path
Returns:
addPyFile(path)[source]
Parameters:path
Returns:
binaryFiles(path, minPartitions=None)[source]

NotImplemented

Parameters:
  • path
  • minPartitions
Returns:

binaryRecords(path, recordLength)[source]

NotImplemented

Parameters:
  • path
  • recordLength
Returns:

broadcast(value)[source]

NotImplemented

Parameters:value
Returns:
cancelAllJobs()[source]

NotImplemented

Returns:
cancelJobGroup(groupId)[source]

NotImplemented

Parameters:groupId
Returns:
clearFiles()[source]

NotImplemented

Returns:
defaultMinPartitions
Returns:
defaultParallelism
Returns:
dump_profiles(path)[source]

NotImplemented

Parameters:path
Returns:
emptyRDD()[source]
Returns:
getLocalProperty(key)[source]

NotImplemented

Parameters:key
Returns:
hadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]

NotImplemented

Parameters:
  • path
  • inputFormatClass
  • keyClass
  • valueClass
  • keyConverter
  • valueConverter
  • conf
  • batchSize
Returns:

hadoopRDD(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]

NotImplemented

Parameters:
  • inputFormatClass
  • keyClass
  • valueClass
  • keyConverter
  • valueConverter
  • conf
  • batchSize
Returns:

newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]

NotImplemented

Parameters:
  • path
  • inputFormatClass
  • keyClass
  • valueClass
  • keyConverter
  • valueConverter
  • conf
  • batchSize
Returns:

newAPIHadoopRDD(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]

Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for sc.sequenceFile.

Parameters:
  • inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
  • keyConverter – (None by default)
  • valueConverter – (None by default)
  • conf – Hadoop configuration, passed in as a dict (None by default)
  • batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
parallelize(c, numSlices=None)[source]
Parameters:
  • c
  • numSlices
Returns:

pickleFile(name, minPartitions=None)[source]

NotImplemented

Parameters:
  • name
  • minPartitions
Returns:

range(start, end=None, step=1, numSlices=None)[source]
Parameters:
  • start
  • end
  • step
  • numSlices
Returns:

runJob(rdd, partitionFunc, partitions=None, allowLocal=False)[source]

NotImplemented

Parameters:
  • rdd
  • partitionFunc
  • partitions
  • allowLocal
Returns:

sequenceFile(path, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)[source]

NotImplemented

Parameters:
  • path
  • keyClass
  • valueClass
  • keyConverter
  • valueConverter
  • minSplits
  • batchSize
Returns:

setCheckpointDir(dirName)[source]

NotImplemented

Parameters:dirName
Returns:
setJobGroup(groupId, description, interruptOnCancel=False)[source]

NotImplemented

Parameters:
  • groupId
  • description
  • interruptOnCancel
Returns:

setLocalProperty(key, value)[source]

NotImplemented

Parameters:
  • key
  • value
Returns:

setLogLevel(logLevel)[source]
Parameters:logLevel
Returns:
classmethod setSystemProperty(key, value)[source]
Parameters:
  • key
  • value
Returns:

show_profiles()[source]

NotImplemented

Returns:
sparkUser()[source]

NotImplemented

Returns:
startTime
Returns:
statusTracker()[source]

NotImplemented

Returns:
stop()[source]
Returns:
textFile(name, minPartitions=None, use_unicode=True)[source]
Parameters:
  • name
  • minPartitions
  • use_unicode
Returns:

union(rdds)[source]

NotImplemented

Parameters:rdds
Returns:
version
Returns:
wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]

NotImplemented

Parameters:
  • path
  • minPartitions
  • use_unicode
Returns: