logo资料库

spark API RDD.pdf

第1页 / 共20页
第2页 / 共20页
第3页 / 共20页
第4页 / 共20页
第5页 / 共20页
第6页 / 共20页
第7页 / 共20页
第8页 / 共20页
资料共20页,剩余部分请下载后查看
Home Trees Indices Help Package pyspark :: M odule rdd :: Class RDD Class RDD source code object --+ | RDD Spark  1.0.2  Python  API  Docs [frames]  |  no  frames]   A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. Instance Methods __init__(self, jrdd, ctx, jrdd_deserializer)   x.__init__(...)  initializes  x;  see  help(type(x))  for  signature id(self)   A  unique  ID  for  this  RDD  (within  its  SparkContext). __repr__(self)   repr(x) context(self)   The  SparkContext  that  this  RDD  was  created  on. cache(self)   Persist  this  RDD  with  the  default  storage  level  (MEMORY_ONLY). persist(self, storageLevel)   Set  this  RDD's  storage  level  to  persist  its  values  across  operations  after  the   first  time  it  is  computed. unpersist(self)   Mark  the  RDD  as  non-­‐persistent,  and  remove  all  blocks  for  it  from  memory   and  disk. checkpoint(self)   Mark  this  RDD  for  checkpointing. isCheckpointed(self)   Return  whether  this  RDD  has  been  checkpointed  or  not source  code source  code source  code source  code source  code source  code source  code source  code source  code
mapPartitionsWithSplit(self, f, preservesPartitioning= mapPartitionsWithIndex(self, f, preservesPartitioning= getCheckpointFile(self)   Gets  the  name  of  the  file  to  which  this  RDD  was  checkpointed map(self, f, preservesPartitioning=False)   Return  a  new  RDD  by  applying  a  function  to  each  element  of  this  RDD. flatMap(self, f, preservesPartitioning=False)   Return  a  new  RDD  by  first  applying  a  function  to  all  elements  of  this  RDD,   and  then  flattening  the  results. mapPartitions(self, f, preservesPartitioning=False)   Return  a  new  RDD  by  applying  a  function  to  each  partition  of  this  RDD. False)   Return  a  new  RDD  by  applying  a  function  to  each  partition  of  this  RDD,  while   tracking  the  index  of  the  original  partition. False)   Deprecated:  use  mapPartitionsWithIndex  instead. filter(self, f)   Return  a  new  RDD  containing  only  the  elements  that  satisfy  a  predicate. distinct(self)   Return  a  new  RDD  containing  the  distinct  elements  in  this  RDD. sample(self, withReplacement, fraction, seed=None)   Return  a  sampled  subset  of  this  RDD  (relies  on  numpy  and  falls  back  on   default  random  generator  if  numpy  is  unavailable). takeSample(self, withReplacement, num, seed=None)   Return  a  fixed-­‐size  sampled  subset  of  this  RDD  (currently  requires  numpy). union(self, other)   Return  the  union  of  this  RDD  and  another  one. intersection(self, other)   Return  the  intersection  of  this  RDD  and  another  one. __add__(self, other)   Return  the  union  of  this  RDD  and  another  one. source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code
sortByKey(self, ascending=True, numPartitions=None, key func=lambda x: x)   Sorts  this  RDD,  which  is  assumed  to  consist  of  (key,  value)  pairs. glom(self)   Return  an  RDD  created  by  coalescing  all  elements  within  each  partition  into   a  list. cartesian(self, other)   Return  the  Cartesian  product  of  this  RDD  and  another  one,  that  is,  the  RDD   of  all  pairs  of  elements(a, b)  where  a  is  in  self  and  b  is  in  other. groupBy(self, f, numPartitions=None)   Return  an  RDD  of  grouped  items. pipe(self, command, env={})   Return  an  RDD  created  by  piping  elements  to  a  forked  external  process. foreach(self, f)   Applies  a  function  to  all  elements  of  this  RDD. foreachPartition(self, f)   Applies  a  function  to  each  partition  of  this  RDD. collect(self)   Return  a  list  that  contains  all  of  the  elements  in  this  RDD. reduce(self, f)   Reduces  the  elements  of  this  RDD  using  the  specified  commutative  and   associative  binary  operator. fold(self, zeroValue, op)   Aggregate  the  elements  of  each  partition,  and  then  the  results  for  all  the   partitions,  using  a  given  associative  function  and  a  neutral  "zero  value." aggregate(self, zeroValue, seqOp, combOp)   Aggregate  the  elements  of  each  partition,  and  then  the  results  for  all  the   partitions,  using  a  given  combine  functions  and  a  neutral  "zero  value." max(self)   Find  the  maximum  item  in  this  RDD. min(self)   Find  the  maximum  item  in  this  RDD. source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code
sum(self)   Add  up  the  elements  in  this  RDD. count(self)   Return  the  number  of  elements  in  this  RDD. stats(self)   Return  a  StatCounter  object  that  captures  the  mean,  variance  and  count   of  the  RDD's  elements  in  one  operation. mean(self)   Compute  the  mean  of  this  RDD's  elements. variance(self)   Compute  the  variance  of  this  RDD's  elements. stdev(self)   Compute  the  standard  deviation  of  this  RDD's  elements. sampleStdev(self)   Compute  the  sample  standard  deviation  of  this  RDD's  elements  (which   corrects  for  bias  in  estimating  the  standard  deviation  by  dividing  by  N-­‐1   instead  of  N). sampleVariance(self)   Compute  the  sample  variance  of  this  RDD's  elements  (which  corrects  for   bias  in  estimating  the  variance  by  dividing  by  N-­‐1  instead  of  N). countByValue(self)   Return  the  count  of  each  unique  value  in  this  RDD  as  a  dictionary  of  (value,   count)  pairs. top(self, num)   Get  the  top  N  elements  from  a  RDD. takeOrdered(self, num, key=None)   Get  the  N  elements  from  a  RDD  ordered  in  ascending  order  or  as  specified   by  the  optional  key  function. take(self, num)   Take  the  first  num  elements  of  the  RDD. first(self)   Return  the  first  element  in  this  RDD. source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code
saveAsTextFile(self, path)   source  code Save  this  RDD  as  a  text  file,  using  string  representations  of  elements. collectAsMap(self)   source  code Return  the  key-­‐value  pairs  in  this  RDD  to  the  master  as  a  dictionary. keys(self)   source  code Return  an  RDD  with  the  keys  of  each  tuple. values(self)   source  code Return  an  RDD  with  the  values  of  each  tuple. reduceByKey(self, func, numPartitions=None)   source  code Merge  the  values  for  each  key  using  an  associative  reduce  function. reduceByKeyLocally(self, func)   source  code Merge  the  values  for  each  key  using  an  associative  reduce  function,  but   return  the  results  immediately  to  the  master  as  a  dictionary. countByKey(self)   source  code Count  the  number  of  elements  for  each  key,  and  return  the  result  to  the   master  as  a  dictionary. join(self, other, numPartitions=None)   source  code Return  an  RDD  containing  all  pairs  of  elements  with  matching  keys   in  self  and  other. leftOuterJoin(self, other, numPartitions=None)   source  code Perform  a  left  outer  join  of  self  and  other. rightOuterJoin(self, other, numPartitions=None)   source  code Perform  a  right  outer  join  of  self  and  other. source  code e_hash)   Return  a  copy  of  the  RDD  partitioned  using  the  specified  partitioner. source   iners, numPartitions=None)   code Generic  function  to  combine  the  elements  for  each  key  using  a  custom  set  of   aggregation  functions. foldByKey(self, zeroValue, func, numPartitions=None)   Merge  the  values  for  each  key  using  an  associative  function  "func"  and  a  source  code combineByKey(self, createCombiner, mergeValue, mergeComb partitionBy(self, numPartitions, partitionFunc=portabl
neutral  "zeroValue"  which  may  be  added  to  the  result  an  arbitrary  number   of  times,  and  must  not  change  the  result  (e.g.,  0  for  addition,  or  1  for   multiplication.). groupByKey(self, numPartitions=None)   Group  the  values  for  each  key  in  the  RDD  into  a  single  sequence. flatMapValues(self, f)   Pass  each  value  in  the  key-­‐value  pair  RDD  through  a  flatMap  function   without  changing  the  keys;  this  also  retains  the  original  RDD's  partitioning. mapValues(self, f)   Pass  each  value  in  the  key-­‐value  pair  RDD  through  a  map  function  without   changing  the  keys;  this  also  retains  the  original  RDD's  partitioning. groupWith(self, other)   Alias  for  cogroup. cogroup(self, other, numPartitions=None)   For  each  key  k  in  self  or  other,  return  a  resulting  RDD  that  contains  a  tuple   with  the  list  of  values  for  that  key  in  self  as  well  as  other. subtractByKey(self, other, numPartitions=None)   Return  each  (key,  value)  pair  in  self  that  has  no  pair  with  matching  key   in  other. subtract(self, other, numPartitions=None)   Return  each  value  in  self  that  is  not  contained  in  other. keyBy(self, f)   Creates  tuples  of  the  elements  in  this  RDD  by  applying  f. repartition(self, numPartitions)   Return  a  new  RDD  that  has  exactly  numPartitions  partitions. coalesce(self, numPartitions, shuffle=False)   Return  a  new  RDD  that  is  reduced  into  `numPartitions`  partitions. zip(self, other)   Zips  this  RDD  with  another  one,  returning  key-­‐value  pairs  with  the  first   element  in  each  RDD  second  element  in  each  RDD,  etc. name(self)   Return  the  name  of  this  RDD. source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code source  code
setName(self, name)   Assign  a  name  to  this  RDD. toDebugString(self)   A  description  of  this  RDD  and  its  recursive  dependencies  for  debugging. getStorageLevel(self)   Get  the  RDD's  current  storage  level. source  code source  code source  code Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __r educe_ex__, __setattr__, __sizeof__,__str__, __subclasshook__ Properties Inherited from object: __class__ Method Details __init__(self, jrdd, ctx, jrdd_deserializer) (Constructor) x.__init__(...) initializes x; see help(type(x)) for signature Overrides:  object.__init__   (inherited  documentation) __repr__(self) (Representation operator) repr(x) Overrides:  object.__repr__   (inherited  documentation) context(self) The SparkContext that this RDD was created on. Decorators:   • @property persist(self, storageLevel) Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to source  code   source  code   source  code   source  code  
assign a new storage level if the RDD does not have a storage level set yet. checkpoint(self) Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext.setCheckpointDir()and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation. map(self, f, preservesPartitioning=False) Return a new RDD by applying a function to each element of this RDD. >>> rdd = sc.parallelize(["b", "a", "c"]) >>> sorted(rdd.map(lambda x: (x, 1)).collect()) [('a', 1), ('b', 1), ('c', 1)] flatMap(self, f, preservesPartitioning=False) Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. source  code   >>> rdd = sc.parallelize([2, 3, 4]) >>> sorted(rdd.flatMap(lambda x: range(1, x)).collect()) [1, 1, 1, 2, 2, 3] >>> sorted(rdd.flatMap(lambda x: [(x, x), (x, x)]).collect()) [(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)] source  code   source  code   source  code   source  code   source  code   mapPartitions(self, f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD. >>> rdd = sc.parallelize([1, 2, 3, 4], 2) >>> def f(iterator): yield sum(iterator) >>> rdd.mapPartitions(f).collect() [3, 7] mapPartitionsWithIndex(self, f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. >>> rdd = sc.parallelize([1, 2, 3, 4], 4) >>> def f(splitIndex, iterator): yield splitIndex >>> rdd.mapPartitionsWithIndex(f).sum() 6 mapPartitionsWithSplit(self, f, preservesPartitioning=False)
分享到:
收藏