pyspark package¶
Subpackages¶
Contents¶
-
class
dummy_spark.
RDD
(jrdd, ctx, jrdd_deserializer=None)[source]¶ A Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel. This is a dummy version of that, that is just a list under the hood. To be used for testing, and maybe development if you play fast and loose.
Important note: the dummy RDD is NOT lazily loaded.
-
aggregate
(zeroValue, seqOp, combOp)[source]¶ NotImplemented
Parameters: - zeroValue –
- seqOp –
- combOp –
Returns:
-
aggregateByKey
(zeroValue, seqFunc, combFunc, numPartitions=None)[source]¶ NotImplemented
Parameters: - zeroValue –
- seqFunc –
- combFunc –
- numPartitions –
Returns:
-
coalesce
(numPartitions, shuffle=False)[source]¶ NotImplemented
Parameters: - numPartitions –
- shuffle –
Returns:
-
combineByKey
(createCombiner, mergeValue, mergeCombiners, numPartitions=None)[source]¶ NotImplemented
Parameters: - createCombiner –
- mergeValue –
- mergeCombiners –
- numPartitions –
Returns:
-
context
¶ Returns:
-
foldByKey
(zeroValue, func, numPartitions=None)[source]¶ NotImplemented
Parameters: - zeroValue –
- func –
- numPartitions –
Returns:
-
fullOuterJoin
(other, numPartitions=None)[source]¶ NotImplemented
Parameters: - other –
- numPartitions –
Returns:
-
join
(other, numPartitions=None)[source]¶ NotImplemented
Parameters: - other –
- numPartitions –
Returns:
-
leftOuterJoin
(other, numPartitions=None)[source]¶ NotImplemented
Parameters: - other –
- numPartitions –
Returns:
-
mapPartitions
(f, preservesPartitioning=False)[source]¶ Parameters: - f –
- preservesPartitioning –
Returns:
-
mapPartitionsWithIndex
(f, preservesPartitioning=False)[source]¶ NotImplemented
Parameters: - f –
- preservesPartitioning –
Returns:
-
partitionBy
(numPartitions, partitionFunc=None)[source]¶ NotImplemented
Parameters: - numPartitions –
- partitionFunc –
Returns:
-
repartitionAndSortWithinPartitions
(numPartitions=None, partitionFunc=None, ascending=True, keyfunc=<function <lambda>>)[source]¶ Parameters: - numPartitions –
- partitionFunc –
- ascending –
- keyfunc –
Returns:
-
rightOuterJoin
(other, numPartitions=None)[source]¶ NotImplemented
Parameters: - other –
- numPartitions –
Returns:
-
sample
(withReplacement, fraction, seed=None)[source]¶ Parameters: - withReplacement –
- fraction –
- seed –
Returns:
-
sampleByKey
(withReplacement, fractions, seed=None)[source]¶ NotImplemented
Parameters: - withReplacement –
- fractions –
- seed –
Returns:
-
saveAsHadoopDataset
(conf, keyConverter=None, valueConverter=None)[source]¶ NotImplemented
Parameters: - conf –
- keyConverter –
- valueConverter –
Returns:
-
saveAsHadoopFile
(path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None, compressionCodecClass=None)[source]¶ NotImplemented
Parameters: - path –
- outputFormatClass –
- keyClass –
- valueClass –
- keyConverter –
- valueConverter –
- conf –
- compressionCodecClass –
Returns:
-
saveAsNewAPIHadoopDataset
(conf, keyConverter=None, valueConverter=None)[source]¶ NotImplemented
Parameters: - conf –
- keyConverter –
- valueConverter –
Returns:
-
saveAsNewAPIHadoopFile
(path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None)[source]¶ NotImplemented
Parameters: - path –
- outputFormatClass –
- keyClass –
- valueClass –
- keyConverter –
- valueConverter –
- conf –
Returns:
-
saveAsPickleFile
(path, batchSize=10)[source]¶ NotImplemented
Parameters: - path –
- batchSize –
Returns:
-
saveAsSequenceFile
(path, compressionCodecClass=None)[source]¶ NotImplemented
Parameters: - path –
- compressionCodecClass –
Returns:
-
saveAsTextFile
(path, compressionCodecClass=None)[source]¶ NotImplemented
Parameters: - path –
- compressionCodecClass –
Returns:
-
sortBy
(keyfunc, ascending=True, numPartitions=None)[source]¶ Parameters: - keyfunc –
- ascending –
- numPartitions –
Returns:
-
sortByKey
(ascending=True, numPartitions=None, keyfunc=<function <lambda>>)[source]¶ Parameters: - ascending –
- numPartitions –
- keyfunc –
Returns:
-
subtract
(other, numPartitions=None)[source]¶ NotImplemented
Parameters: - other –
- numPartitions –
Returns:
-
subtractByKey
(other, numPartitions=None)[source]¶ NotImplemented
Parameters: - other –
- numPartitions –
Returns:
-
takeSample
(withReplacement, num, seed=None)[source]¶ Parameters: - withReplacement –
- num –
- seed –
Returns:
-
-
class
dummy_spark.
SparkConf
(loadDefaults=True, _jvm=None, _jconf=None)[source]¶ -
DEBUG_STRING
= 'no string for dummy version'¶
-
-
class
dummy_spark.
SparkContext
(master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=None, conf=None, gateway=None, jsc=None, profiler_cls=None)[source]¶ Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create
RDD
and broadcast variables on that cluster.-
DUMMY_VERSION
= 'dummy version'¶
-
PACKAGE_EXTENSIONS
= ('.zip', '.egg', '.jar')¶
-
accumulator
(value, accum_param=None)[source]¶ NotImplemented
Parameters: - value –
- accum_param –
Returns:
-
binaryFiles
(path, minPartitions=None)[source]¶ NotImplemented
Parameters: - path –
- minPartitions –
Returns:
-
binaryRecords
(path, recordLength)[source]¶ NotImplemented
Parameters: - path –
- recordLength –
Returns:
-
defaultMinPartitions
¶ Returns:
-
defaultParallelism
¶ Returns:
-
hadoopFile
(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]¶ NotImplemented
Parameters: - path –
- inputFormatClass –
- keyClass –
- valueClass –
- keyConverter –
- valueConverter –
- conf –
- batchSize –
Returns:
-
hadoopRDD
(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]¶ NotImplemented
Parameters: - inputFormatClass –
- keyClass –
- valueClass –
- keyConverter –
- valueConverter –
- conf –
- batchSize –
Returns:
-
newAPIHadoopFile
(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]¶ NotImplemented
Parameters: - path –
- inputFormatClass –
- keyClass –
- valueClass –
- keyConverter –
- valueConverter –
- conf –
- batchSize –
Returns:
-
newAPIHadoopRDD
(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]¶ Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for sc.sequenceFile.
Parameters: - inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
- keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
- valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
- keyConverter – (None by default)
- valueConverter – (None by default)
- conf – Hadoop configuration, passed in as a dict (None by default)
- batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
-
pickleFile
(name, minPartitions=None)[source]¶ NotImplemented
Parameters: - name –
- minPartitions –
Returns:
-
range
(start, end=None, step=1, numSlices=None)[source]¶ Parameters: - start –
- end –
- step –
- numSlices –
Returns:
-
runJob
(rdd, partitionFunc, partitions=None, allowLocal=False)[source]¶ NotImplemented
Parameters: - rdd –
- partitionFunc –
- partitions –
- allowLocal –
Returns:
-
sequenceFile
(path, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)[source]¶ NotImplemented
Parameters: - path –
- keyClass –
- valueClass –
- keyConverter –
- valueConverter –
- minSplits –
- batchSize –
Returns:
-
setJobGroup
(groupId, description, interruptOnCancel=False)[source]¶ NotImplemented
Parameters: - groupId –
- description –
- interruptOnCancel –
Returns:
-
startTime
¶ Returns:
-
textFile
(name, minPartitions=None, use_unicode=True)[source]¶ Parameters: - name –
- minPartitions –
- use_unicode –
Returns:
-
version
¶ Returns:
-