pyspark package¶
Subpackages¶
Contents¶
-
class
dummy_spark.RDD(jrdd, ctx, jrdd_deserializer=None)[source]¶ A Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel. This is a dummy version of that, that is just a list under the hood. To be used for testing, and maybe development if you play fast and loose.
Important note: the dummy RDD is NOT lazily loaded.
-
aggregate(zeroValue, seqOp, combOp)[source]¶ NotImplemented
Parameters: - zeroValue –
- seqOp –
- combOp –
Returns:
-
aggregateByKey(zeroValue, seqFunc, combFunc, numPartitions=None)[source]¶ NotImplemented
Parameters: - zeroValue –
- seqFunc –
- combFunc –
- numPartitions –
Returns:
-
coalesce(numPartitions, shuffle=False)[source]¶ NotImplemented
Parameters: - numPartitions –
- shuffle –
Returns:
-
combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None)[source]¶ NotImplemented
Parameters: - createCombiner –
- mergeValue –
- mergeCombiners –
- numPartitions –
Returns:
-
context¶ Returns:
-
foldByKey(zeroValue, func, numPartitions=None)[source]¶ NotImplemented
Parameters: - zeroValue –
- func –
- numPartitions –
Returns:
-
fullOuterJoin(other, numPartitions=None)[source]¶ NotImplemented
Parameters: - other –
- numPartitions –
Returns:
-
join(other, numPartitions=None)[source]¶ NotImplemented
Parameters: - other –
- numPartitions –
Returns:
-
leftOuterJoin(other, numPartitions=None)[source]¶ NotImplemented
Parameters: - other –
- numPartitions –
Returns:
-
mapPartitions(f, preservesPartitioning=False)[source]¶ Parameters: - f –
- preservesPartitioning –
Returns:
-
mapPartitionsWithIndex(f, preservesPartitioning=False)[source]¶ NotImplemented
Parameters: - f –
- preservesPartitioning –
Returns:
-
partitionBy(numPartitions, partitionFunc=None)[source]¶ NotImplemented
Parameters: - numPartitions –
- partitionFunc –
Returns:
-
repartitionAndSortWithinPartitions(numPartitions=None, partitionFunc=None, ascending=True, keyfunc=<function <lambda>>)[source]¶ Parameters: - numPartitions –
- partitionFunc –
- ascending –
- keyfunc –
Returns:
-
rightOuterJoin(other, numPartitions=None)[source]¶ NotImplemented
Parameters: - other –
- numPartitions –
Returns:
-
sample(withReplacement, fraction, seed=None)[source]¶ Parameters: - withReplacement –
- fraction –
- seed –
Returns:
-
sampleByKey(withReplacement, fractions, seed=None)[source]¶ NotImplemented
Parameters: - withReplacement –
- fractions –
- seed –
Returns:
-
saveAsHadoopDataset(conf, keyConverter=None, valueConverter=None)[source]¶ NotImplemented
Parameters: - conf –
- keyConverter –
- valueConverter –
Returns:
-
saveAsHadoopFile(path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None, compressionCodecClass=None)[source]¶ NotImplemented
Parameters: - path –
- outputFormatClass –
- keyClass –
- valueClass –
- keyConverter –
- valueConverter –
- conf –
- compressionCodecClass –
Returns:
-
saveAsNewAPIHadoopDataset(conf, keyConverter=None, valueConverter=None)[source]¶ NotImplemented
Parameters: - conf –
- keyConverter –
- valueConverter –
Returns:
-
saveAsNewAPIHadoopFile(path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None)[source]¶ NotImplemented
Parameters: - path –
- outputFormatClass –
- keyClass –
- valueClass –
- keyConverter –
- valueConverter –
- conf –
Returns:
-
saveAsPickleFile(path, batchSize=10)[source]¶ NotImplemented
Parameters: - path –
- batchSize –
Returns:
-
saveAsSequenceFile(path, compressionCodecClass=None)[source]¶ NotImplemented
Parameters: - path –
- compressionCodecClass –
Returns:
-
saveAsTextFile(path, compressionCodecClass=None)[source]¶ NotImplemented
Parameters: - path –
- compressionCodecClass –
Returns:
-
sortBy(keyfunc, ascending=True, numPartitions=None)[source]¶ Parameters: - keyfunc –
- ascending –
- numPartitions –
Returns:
-
sortByKey(ascending=True, numPartitions=None, keyfunc=<function <lambda>>)[source]¶ Parameters: - ascending –
- numPartitions –
- keyfunc –
Returns:
-
subtract(other, numPartitions=None)[source]¶ NotImplemented
Parameters: - other –
- numPartitions –
Returns:
-
subtractByKey(other, numPartitions=None)[source]¶ NotImplemented
Parameters: - other –
- numPartitions –
Returns:
-
takeSample(withReplacement, num, seed=None)[source]¶ Parameters: - withReplacement –
- num –
- seed –
Returns:
-
-
class
dummy_spark.SparkConf(loadDefaults=True, _jvm=None, _jconf=None)[source]¶ -
DEBUG_STRING= 'no string for dummy version'¶
-
-
class
dummy_spark.SparkContext(master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=None, conf=None, gateway=None, jsc=None, profiler_cls=None)[source]¶ Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create
RDDand broadcast variables on that cluster.-
DUMMY_VERSION= 'dummy version'¶
-
PACKAGE_EXTENSIONS= ('.zip', '.egg', '.jar')¶
-
accumulator(value, accum_param=None)[source]¶ NotImplemented
Parameters: - value –
- accum_param –
Returns:
-
binaryFiles(path, minPartitions=None)[source]¶ NotImplemented
Parameters: - path –
- minPartitions –
Returns:
-
binaryRecords(path, recordLength)[source]¶ NotImplemented
Parameters: - path –
- recordLength –
Returns:
-
defaultMinPartitions¶ Returns:
-
defaultParallelism¶ Returns:
-
hadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]¶ NotImplemented
Parameters: - path –
- inputFormatClass –
- keyClass –
- valueClass –
- keyConverter –
- valueConverter –
- conf –
- batchSize –
Returns:
-
hadoopRDD(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]¶ NotImplemented
Parameters: - inputFormatClass –
- keyClass –
- valueClass –
- keyConverter –
- valueConverter –
- conf –
- batchSize –
Returns:
-
newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]¶ NotImplemented
Parameters: - path –
- inputFormatClass –
- keyClass –
- valueClass –
- keyConverter –
- valueConverter –
- conf –
- batchSize –
Returns:
-
newAPIHadoopRDD(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]¶ Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for sc.sequenceFile.
Parameters: - inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
- keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
- valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
- keyConverter – (None by default)
- valueConverter – (None by default)
- conf – Hadoop configuration, passed in as a dict (None by default)
- batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
-
pickleFile(name, minPartitions=None)[source]¶ NotImplemented
Parameters: - name –
- minPartitions –
Returns:
-
range(start, end=None, step=1, numSlices=None)[source]¶ Parameters: - start –
- end –
- step –
- numSlices –
Returns:
-
runJob(rdd, partitionFunc, partitions=None, allowLocal=False)[source]¶ NotImplemented
Parameters: - rdd –
- partitionFunc –
- partitions –
- allowLocal –
Returns:
-
sequenceFile(path, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)[source]¶ NotImplemented
Parameters: - path –
- keyClass –
- valueClass –
- keyConverter –
- valueConverter –
- minSplits –
- batchSize –
Returns:
-
setJobGroup(groupId, description, interruptOnCancel=False)[source]¶ NotImplemented
Parameters: - groupId –
- description –
- interruptOnCancel –
Returns:
-
startTime¶ Returns:
-
textFile(name, minPartitions=None, use_unicode=True)[source]¶ Parameters: - name –
- minPartitions –
- use_unicode –
Returns:
-
version¶ Returns:
-