Java Code Examples for org.apache.spark.api.java.JavaRDD#coalesce()
The following examples show how to use
org.apache.spark.api.java.JavaRDD#coalesce() .
You can vote up the ones you like or vote down the ones you don't like,
and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source File: Coalesce.java From SparkDemo with MIT License | 5 votes |
private static void coalesce(JavaSparkContext sc) { List<String> datas = Arrays.asList("hi", "hello", "how", "are", "you"); JavaRDD<String> datasRDD = sc.parallelize(datas, 4); System.out.println("RDD的分区数: " + datasRDD.partitions().size()); JavaRDD<String> datasRDD2 = datasRDD.coalesce(2, false); System.out.println("RDD的分区数: " + datasRDD2.partitions().size()); }
Example 2
Source File: SparkExport.java From DataVec with Apache License 2.0 | 5 votes |
public static void exportCSVSpark(String directory, String delimiter, String quote, int outputSplits, JavaRDD<List<Writable>> data) { //NOTE: Order is probably not random here... JavaRDD<String> lines = data.map(new WritablesToStringFunction(delimiter, quote)); lines.coalesce(outputSplits); lines.saveAsTextFile(directory); }
Example 3
Source File: SparkStorageUtils.java From DataVec with Apache License 2.0 | 5 votes |
/** * Save a {@code JavaRDD<List<List<Writable>>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record * is given a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link SequenceRecordWritable} instances. * <p> * Use {@link #restoreSequenceFileSequences(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the sequence file * @param rdd RDD to save * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions) * to limit the maximum number of output sequence files * @see #saveSequenceFile(String, JavaRDD) * @see #saveMapFileSequences(String, JavaRDD) */ public static void saveSequenceFileSequences(String path, JavaRDD<List<List<Writable>>> rdd, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<List<Writable>>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex JavaPairRDD<LongWritable, SequenceRecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new SequenceRecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, SequenceRecordWritable.class, SequenceFileOutputFormat.class); }
Example 4
Source File: PileupSpark.java From gatk with BSD 3-Clause "New" or "Revised" License | 5 votes |
@Override protected void processAlignments(JavaRDD<LocusWalkerContext> rdd, JavaSparkContext ctx) { JavaRDD<String> lines = rdd.map(pileupFunction(metadata, outputInsertLength, showVerbose)); if (numReducers != 0) { lines = lines.coalesce(numReducers); } lines.saveAsTextFile(outputFile); }
Example 5
Source File: SparkExport.java From deeplearning4j with Apache License 2.0 | 5 votes |
public static void exportCSVSpark(String directory, String delimiter, String quote, int outputSplits, JavaRDD<List<Writable>> data) { //NOTE: Order is probably not random here... JavaRDD<String> lines = data.map(new WritablesToStringFunction(delimiter, quote)); lines.coalesce(outputSplits); lines.saveAsTextFile(directory); }
Example 6
Source File: SparkStorageUtils.java From deeplearning4j with Apache License 2.0 | 5 votes |
/** * Save a {@code JavaRDD<List<List<Writable>>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record * is given a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link SequenceRecordWritable} instances. * <p> * Use {@link #restoreSequenceFileSequences(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the sequence file * @param rdd RDD to save * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions) * to limit the maximum number of output sequence files * @see #saveSequenceFile(String, JavaRDD) * @see #saveMapFileSequences(String, JavaRDD) */ public static void saveSequenceFileSequences(String path, JavaRDD<List<List<Writable>>> rdd, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<List<Writable>>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex JavaPairRDD<LongWritable, SequenceRecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new SequenceRecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, SequenceRecordWritable.class, SequenceFileOutputFormat.class); }
Example 7
Source File: SparkOperatorProfiler.java From rheem with Apache License 2.0 | 4 votes |
/** * If a desired number of partitions for the input {@link JavaRDD}s is requested, enforce this. */ protected <T> JavaRDD<T> partition(JavaRDD<T> rdd) { return this.numPartitions == -1 ? rdd : rdd.coalesce(this.numPartitions, true); }
Example 8
Source File: SparkReader.java From GeoTriples with Apache License 2.0 | 4 votes |
/** * Call the corresponding reader regarding the source of the input file * * @return a Spark's Dataset containing the data */ public JavaRDD<Row> read(String repartition){ long startTime = System.currentTimeMillis(); JavaRDD<Row> rowRDD = null; Dataset<Row> dt; try { switch (source) { case SHP: int p = StringUtils.isNumeric(repartition) ? Integer.parseInt(repartition) : 0; rowRDD = readSHP(p); break; case CSV: dt = readCSV(); // insert a column with ID dt = dt.withColumn(Config.GEOTRIPLES_AUTO_ID, functions.monotonicallyIncreasingId()); headers = dt.columns(); rowRDD = dt.javaRDD(); break; case TSV: dt = readTSV(); // insert a column with ID dt = dt.withColumn(Config.GEOTRIPLES_AUTO_ID, functions.monotonicallyIncreasingId()); headers = dt.columns(); rowRDD = dt.javaRDD(); break; case GEOJSON: dt = readGeoJSON(); // insert a column with ID dt = dt.withColumn(Config.GEOTRIPLES_AUTO_ID, functions.monotonicallyIncreasingId()); headers = dt.columns(); rowRDD = dt.javaRDD(); break; case KML: log.error("KML files are not Supported yet"); break; } /* repartition the loaded dataset if it is specified by user. if "repartition" is set to "defualt" the number of partitions is calculated based on input's size else the number must be defined by the user */ int partitions = rowRDD == null ? 0: rowRDD.getNumPartitions(); log.info("The input data was read into " + partitions + " partitions"); if (repartition != null && source != Source.SHP){ int new_partitions = 0; if (repartition.equals("default")) { try { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); for (String filename : filenames) { Path input_path = new Path(filename); double file_size = fs.getContentSummary(input_path).getLength(); new_partitions += Math.ceil(file_size / 120000000) + 1; } } catch(IOException e){ e.printStackTrace(); System.exit(1); } } else if (StringUtils.isNumeric(repartition)) new_partitions = Integer.parseInt(repartition); if(new_partitions > 0){ if(partitions > new_partitions) rowRDD = rowRDD.coalesce(new_partitions); else rowRDD = rowRDD.repartition(new_partitions); log.info("Dataset was repartitioned into: " + new_partitions + " partitions"); } } } catch (NullPointerException ex){ log.error("Not Supported file format"); ex.printStackTrace(); System.exit(1); } log.info("Input dataset(s) was loaded in " + (System.currentTimeMillis() - startTime) + " msec"); return rowRDD; }
Example 9
Source File: PathSeqPipelineSpark.java From gatk with BSD 3-Clause "New" or "Revised" License | 4 votes |
@Override protected void runTool(final JavaSparkContext ctx) { filterArgs.doReadFilterArgumentWarnings(getCommandLineParser().getPluginDescriptor(GATKReadFilterPluginDescriptor.class), logger); SAMFileHeader header = PSUtils.checkAndClearHeaderSequences(getHeaderForReads(), filterArgs, logger); //Do not allow use of numReducers if (numReducers > 0) { throw new UserException.BadInput("Use --readsPerPartitionOutput instead of --num-reducers."); } //Filter final Tuple2<JavaRDD<GATKRead>, JavaRDD<GATKRead>> filterResult; final PSFilter filter = new PSFilter(ctx, filterArgs, header); try (final PSFilterLogger filterLogger = filterArgs.filterMetricsFileUri != null ? new PSFilterFileLogger(getMetricsFile(), filterArgs.filterMetricsFileUri) : new PSFilterEmptyLogger()) { final JavaRDD<GATKRead> inputReads = getReads(); filterResult = filter.doFilter(inputReads, filterLogger); } JavaRDD<GATKRead> pairedReads = filterResult._1; JavaRDD<GATKRead> unpairedReads = filterResult._2; //Counting forces an action on the RDDs to guarantee we're done with the Bwa image and kmer filter final long numPairedReads = pairedReads.count(); final long numUnpairedReads = unpairedReads.count(); final long numTotalReads = numPairedReads + numUnpairedReads; //Closes Bwa image, kmer filter, and metrics file if used //Note the host Bwa image before must be unloaded before trying to load the pathogen image filter.close(); //Rebalance partitions using the counts final int numPairedPartitions = 1 + (int) (numPairedReads / readsPerPartition); final int numUnpairedPartitions = 1 + (int) (numUnpairedReads / readsPerPartition); pairedReads = repartitionPairedReads(pairedReads, numPairedPartitions, numPairedReads); unpairedReads = unpairedReads.repartition(numUnpairedPartitions); //Bwa pathogen alignment final PSBwaAlignerSpark aligner = new PSBwaAlignerSpark(ctx, bwaArgs); PSBwaUtils.addReferenceSequencesToHeader(header, bwaArgs.microbeDictionary); final Broadcast<SAMFileHeader> headerBroadcast = ctx.broadcast(header); JavaRDD<GATKRead> alignedPairedReads = aligner.doBwaAlignment(pairedReads, true, headerBroadcast); JavaRDD<GATKRead> alignedUnpairedReads = aligner.doBwaAlignment(unpairedReads, false, headerBroadcast); //Cache this expensive result. Note serialization significantly reduces memory consumption. alignedPairedReads.persist(StorageLevel.MEMORY_AND_DISK_SER()); alignedUnpairedReads.persist(StorageLevel.MEMORY_AND_DISK_SER()); //Score pathogens final PSScorer scorer = new PSScorer(scoreArgs); final JavaRDD<GATKRead> readsFinal = scorer.scoreReads(ctx, alignedPairedReads, alignedUnpairedReads, header); //Clean up header header = PSBwaUtils.removeUnmappedHeaderSequences(header, readsFinal, logger); //Log read counts if (scoreArgs.scoreMetricsFileUri != null) { try (final PSScoreLogger scoreLogger = new PSScoreFileLogger(getMetricsFile(), scoreArgs.scoreMetricsFileUri)) { scoreLogger.logReadCounts(readsFinal); } } //Write reads to BAM, if specified if (outputPath != null) { try { //Reduce number of partitions since we previously went to ~5K reads per partition, which // is far too small for sharded output. final int numPartitions = Math.max(1, (int) (numTotalReads / readsPerPartitionOutput)); final JavaRDD<GATKRead> readsFinalRepartitioned = readsFinal.coalesce(numPartitions, false); ReadsSparkSink.writeReads(ctx, outputPath, null, readsFinalRepartitioned, header, shardedOutput ? ReadsWriteFormat.SHARDED : ReadsWriteFormat.SINGLE, numPartitions, shardedPartsDir, true, splittingIndexGranularity); } catch (final IOException e) { throw new UserException.CouldNotCreateOutputFile(outputPath, "writing failed", e); } } aligner.close(); }
Example 10
Source File: SparkStorageUtils.java From DataVec with Apache License 2.0 | 3 votes |
/** * Save a {@code JavaRDD<List<Writable>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record is given * a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link RecordWritable} instances. * <p> * Use {@link #restoreSequenceFile(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the sequence file * @param rdd RDD to save * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions) * to limit the maximum number of output sequence files * @see #saveSequenceFileSequences(String, JavaRDD) * @see #saveMapFile(String, JavaRDD) */ public static void saveSequenceFile(String path, JavaRDD<List<Writable>> rdd, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<Writable>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex JavaPairRDD<LongWritable, RecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new RecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, RecordWritable.class, SequenceFileOutputFormat.class); }
Example 11
Source File: SparkStorageUtils.java From DataVec with Apache License 2.0 | 3 votes |
/** * Save a {@code JavaRDD<List<Writable>>} to a Hadoop {@link org.apache.hadoop.io.MapFile}. Each record is * given a <i>unique and contiguous</i> {@link LongWritable} key, and values are stored as * {@link RecordWritable} instances.<br> * <b>Note</b>: If contiguous keys are not required, using a sequence file instead is preferable from a performance * point of view. Contiguous keys are often only required for non-Spark use cases, such as with * {@link org.datavec.hadoop.records.reader.mapfile.MapFileRecordReader} * <p> * Use {@link #restoreMapFileSequences(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the MapFile * @param rdd RDD to save * @param c Configuration object, used to customise options for the map file * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions) * to limit the maximum number of output map files * @see #saveMapFileSequences(String, JavaRDD) * @see #saveSequenceFile(String, JavaRDD) */ public static void saveMapFile(String path, JavaRDD<List<Writable>> rdd, Configuration c, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<Writable>, Long> dataIndexPairs = rdd.zipWithIndex(); //Note: Long values are unique + contiguous, but requires a count JavaPairRDD<LongWritable, RecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new RecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, RecordWritable.class, MapFileOutputFormat.class, c); }
Example 12
Source File: SparkStorageUtils.java From DataVec with Apache License 2.0 | 3 votes |
/** * Save a {@code JavaRDD<List<List<Writable>>>} to a Hadoop {@link org.apache.hadoop.io.MapFile}. Each record is * given a <i>unique and contiguous</i> {@link LongWritable} key, and values are stored as * {@link SequenceRecordWritable} instances.<br> * <b>Note</b>: If contiguous keys are not required, using a sequence file instead is preferable from a performance * point of view. Contiguous keys are often only required for non-Spark use cases, such as with * {@link org.datavec.hadoop.records.reader.mapfile.MapFileSequenceRecordReader}<br> * <p> * Use {@link #restoreMapFileSequences(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the MapFile * @param rdd RDD to save * @param c Configuration object, used to customise options for the map file * @see #saveMapFileSequences(String, JavaRDD) * @see #saveSequenceFile(String, JavaRDD) */ public static void saveMapFileSequences(String path, JavaRDD<List<List<Writable>>> rdd, Configuration c, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<List<Writable>>, Long> dataIndexPairs = rdd.zipWithIndex(); JavaPairRDD<LongWritable, SequenceRecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new SequenceRecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, SequenceRecordWritable.class, MapFileOutputFormat.class, c); }
Example 13
Source File: SparkStorageUtils.java From deeplearning4j with Apache License 2.0 | 3 votes |
/** * Save a {@code JavaRDD<List<Writable>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record is given * a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link RecordWritable} instances. * <p> * Use {@link #restoreSequenceFile(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the sequence file * @param rdd RDD to save * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions) * to limit the maximum number of output sequence files * @see #saveSequenceFileSequences(String, JavaRDD) * @see #saveMapFile(String, JavaRDD) */ public static void saveSequenceFile(String path, JavaRDD<List<Writable>> rdd, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<Writable>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex JavaPairRDD<LongWritable, RecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new RecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, RecordWritable.class, SequenceFileOutputFormat.class); }
Example 14
Source File: SparkStorageUtils.java From deeplearning4j with Apache License 2.0 | 3 votes |
/** * Save a {@code JavaRDD<List<Writable>>} to a Hadoop {@link org.apache.hadoop.io.MapFile}. Each record is * given a <i>unique and contiguous</i> {@link LongWritable} key, and values are stored as * {@link RecordWritable} instances.<br> * <b>Note</b>: If contiguous keys are not required, using a sequence file instead is preferable from a performance * point of view. Contiguous keys are often only required for non-Spark use cases, such as with * {@link org.datavec.hadoop.records.reader.mapfile.MapFileRecordReader} * <p> * Use {@link #restoreMapFileSequences(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the MapFile * @param rdd RDD to save * @param c Configuration object, used to customise options for the map file * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions) * to limit the maximum number of output map files * @see #saveMapFileSequences(String, JavaRDD) * @see #saveSequenceFile(String, JavaRDD) */ public static void saveMapFile(String path, JavaRDD<List<Writable>> rdd, Configuration c, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<Writable>, Long> dataIndexPairs = rdd.zipWithIndex(); //Note: Long values are unique + contiguous, but requires a count JavaPairRDD<LongWritable, RecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new RecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, RecordWritable.class, MapFileOutputFormat.class, c); }
Example 15
Source File: SparkStorageUtils.java From deeplearning4j with Apache License 2.0 | 3 votes |
/** * Save a {@code JavaRDD<List<List<Writable>>>} to a Hadoop {@link org.apache.hadoop.io.MapFile}. Each record is * given a <i>unique and contiguous</i> {@link LongWritable} key, and values are stored as * {@link SequenceRecordWritable} instances.<br> * <b>Note</b>: If contiguous keys are not required, using a sequence file instead is preferable from a performance * point of view. Contiguous keys are often only required for non-Spark use cases, such as with * {@link org.datavec.hadoop.records.reader.mapfile.MapFileSequenceRecordReader}<br> * <p> * Use {@link #restoreMapFileSequences(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the MapFile * @param rdd RDD to save * @param c Configuration object, used to customise options for the map file * @see #saveMapFileSequences(String, JavaRDD) * @see #saveSequenceFile(String, JavaRDD) */ public static void saveMapFileSequences(String path, JavaRDD<List<List<Writable>>> rdd, Configuration c, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<List<Writable>>, Long> dataIndexPairs = rdd.zipWithIndex(); JavaPairRDD<LongWritable, SequenceRecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new SequenceRecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, SequenceRecordWritable.class, MapFileOutputFormat.class, c); }