Java Code Examples for org.apache.spark.api.java.JavaPairRDD#saveAsNewAPIHadoopFile()
The following examples show how to use
org.apache.spark.api.java.JavaPairRDD#saveAsNewAPIHadoopFile() .
You can vote up the ones you like or vote down the ones you don't like,
and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source File: ReadsSparkSink.java From gatk with BSD 3-Clause "New" or "Revised" License | 6 votes |
private static void writeReadsADAM( final JavaSparkContext ctx, final String outputFile, final JavaRDD<SAMRecord> reads, final SAMFileHeader header) throws IOException { final SequenceDictionary seqDict = SequenceDictionary.fromSAMSequenceDictionary(header.getSequenceDictionary()); final ReadGroupDictionary readGroups = ReadGroupDictionary.fromSAMHeader(header); final JavaPairRDD<Void, AlignmentRecord> rddAlignmentRecords = reads.map(read -> { read.setHeaderStrict(header); AlignmentRecord alignmentRecord = GATKReadToBDGAlignmentRecordConverter.convert(read, seqDict, readGroups); read.setHeaderStrict(null); // Restore the header to its previous state so as not to surprise the caller return alignmentRecord; }).mapToPair(alignmentRecord -> new Tuple2<>(null, alignmentRecord)); // instantiating a Job is necessary here in order to set the Hadoop Configuration... final Job job = Job.getInstance(ctx.hadoopConfiguration()); // ...here, which sets a config property that the AvroParquetOutputFormat needs when writing data. Specifically, // we are writing the Avro schema to the Configuration as a JSON string. The AvroParquetOutputFormat class knows // how to translate objects in the Avro data model to the Parquet primitives that get written. AvroParquetOutputFormat.setSchema(job, AlignmentRecord.getClassSchema()); deleteHadoopFile(outputFile, ctx.hadoopConfiguration()); rddAlignmentRecords.saveAsNewAPIHadoopFile( outputFile, Void.class, AlignmentRecord.class, AvroParquetOutputFormat.class, job.getConfiguration()); }
Example 2
Source File: SparkStorageUtils.java From deeplearning4j with Apache License 2.0 | 5 votes |
/** * Save a {@code JavaRDD<List<List<Writable>>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record * is given a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link SequenceRecordWritable} instances. * <p> * Use {@link #restoreSequenceFileSequences(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the sequence file * @param rdd RDD to save * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions) * to limit the maximum number of output sequence files * @see #saveSequenceFile(String, JavaRDD) * @see #saveMapFileSequences(String, JavaRDD) */ public static void saveSequenceFileSequences(String path, JavaRDD<List<List<Writable>>> rdd, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<List<Writable>>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex JavaPairRDD<LongWritable, SequenceRecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new SequenceRecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, SequenceRecordWritable.class, SequenceFileOutputFormat.class); }
Example 3
Source File: SparkStorageUtils.java From DataVec with Apache License 2.0 | 5 votes |
/** * Save a {@code JavaRDD<List<List<Writable>>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record * is given a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link SequenceRecordWritable} instances. * <p> * Use {@link #restoreSequenceFileSequences(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the sequence file * @param rdd RDD to save * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions) * to limit the maximum number of output sequence files * @see #saveSequenceFile(String, JavaRDD) * @see #saveMapFileSequences(String, JavaRDD) */ public static void saveSequenceFileSequences(String path, JavaRDD<List<List<Writable>>> rdd, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<List<Writable>>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex JavaPairRDD<LongWritable, SequenceRecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new SequenceRecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, SequenceRecordWritable.class, SequenceFileOutputFormat.class); }
Example 4
Source File: TestPairSequenceRecordReaderBytesFunction.java From DataVec with Apache License 2.0 | 4 votes |
@Test public void test() throws Exception { //Goal: combine separate files together into a hadoop sequence file, for later parsing by a SequenceRecordReader //For example: use to combine input and labels data from separate files for training a RNN JavaSparkContext sc = getContext(); ClassPathResource cpr = new ClassPathResource("/video/shapes_0.mp4"); String path = cpr.getFile().getAbsolutePath(); String folder = path.substring(0, path.length() - 12); path = folder + "*"; PathToKeyConverter pathConverter = new PathToKeyConverterFilename(); JavaPairRDD<Text, BytesPairWritable> toWrite = DataVecSparkUtil.combineFilesForSequenceFile(sc, path, path, pathConverter); Path p = Files.createTempDirectory("dl4j_rrbytesPairOut"); p.toFile().deleteOnExit(); String outPath = p.toString() + "/out"; new File(outPath).deleteOnExit(); toWrite.saveAsNewAPIHadoopFile(outPath, Text.class, BytesPairWritable.class, SequenceFileOutputFormat.class); //Load back into memory: JavaPairRDD<Text, BytesPairWritable> fromSeq = sc.sequenceFile(outPath, Text.class, BytesPairWritable.class); SequenceRecordReader srr1 = getReader(); SequenceRecordReader srr2 = getReader(); PairSequenceRecordReaderBytesFunction psrbf = new PairSequenceRecordReaderBytesFunction(srr1, srr2); JavaRDD<Tuple2<List<List<Writable>>, List<List<Writable>>>> writables = fromSeq.map(psrbf); List<Tuple2<List<List<Writable>>, List<List<Writable>>>> fromSequenceFile = writables.collect(); //Load manually (single copy) and compare: InputSplit is = new FileSplit(new File(folder), new String[] {"mp4"}, true); SequenceRecordReader srr = getReader(); srr.initialize(is); List<List<List<Writable>>> list = new ArrayList<>(4); while (srr.hasNext()) { list.add(srr.sequenceRecord()); } assertEquals(4, list.size()); assertEquals(4, fromSequenceFile.size()); boolean[] found = new boolean[4]; for (int i = 0; i < 4; i++) { int foundIndex = -1; Tuple2<List<List<Writable>>, List<List<Writable>>> tuple2 = fromSequenceFile.get(i); List<List<Writable>> seq1 = tuple2._1(); List<List<Writable>> seq2 = tuple2._2(); assertEquals(seq1, seq2); for (int j = 0; j < 4; j++) { if (seq1.equals(list.get(j))) { if (foundIndex != -1) fail(); //Already found this value -> suggests this spark value equals two or more of local version? (Shouldn't happen) foundIndex = j; if (found[foundIndex]) fail(); //One of the other spark values was equal to this one -> suggests duplicates in Spark list found[foundIndex] = true; //mark this one as seen before } } } int count = 0; for (boolean b : found) if (b) count++; assertEquals(4, count); //Expect all 4 and exactly 4 pairwise matches between spark and local versions }
Example 5
Source File: TestSequenceRecordReaderBytesFunction.java From deeplearning4j with Apache License 2.0 | 4 votes |
@Test public void testRecordReaderBytesFunction() throws Exception { //Local file path File f = testDir.newFolder(); new ClassPathResource("datavec-spark/video/").copyDirectory(f); String path = f.getAbsolutePath() + "/*"; //Load binary data from local file system, convert to a sequence file: //Load and convert JavaPairRDD<String, PortableDataStream> origData = sc.binaryFiles(path); JavaPairRDD<Text, BytesWritable> filesAsBytes = origData.mapToPair(new FilesAsBytesFunction()); //Write the sequence file: Path p = Files.createTempDirectory("dl4j_rrbytesTest"); p.toFile().deleteOnExit(); String outPath = p.toString() + "/out"; filesAsBytes.saveAsNewAPIHadoopFile(outPath, Text.class, BytesWritable.class, SequenceFileOutputFormat.class); //Load data from sequence file, parse via SequenceRecordReader: JavaPairRDD<Text, BytesWritable> fromSeqFile = sc.sequenceFile(outPath, Text.class, BytesWritable.class); SequenceRecordReader seqRR = new CodecRecordReader(); Configuration conf = new Configuration(); conf.set(CodecRecordReader.RAVEL, "true"); conf.set(CodecRecordReader.START_FRAME, "0"); conf.set(CodecRecordReader.TOTAL_FRAMES, "25"); conf.set(CodecRecordReader.ROWS, "64"); conf.set(CodecRecordReader.COLUMNS, "64"); Configuration confCopy = new Configuration(conf); seqRR.setConf(conf); JavaRDD<List<List<Writable>>> dataVecData = fromSeqFile.map(new SequenceRecordReaderBytesFunction(seqRR)); //Next: do the same thing locally, and compare the results InputSplit is = new FileSplit(f, new String[] {"mp4"}, true); SequenceRecordReader srr = new CodecRecordReader(); srr.initialize(is); srr.setConf(confCopy); List<List<List<Writable>>> list = new ArrayList<>(4); while (srr.hasNext()) { list.add(srr.sequenceRecord()); } assertEquals(4, list.size()); List<List<List<Writable>>> fromSequenceFile = dataVecData.collect(); assertEquals(4, list.size()); assertEquals(4, fromSequenceFile.size()); boolean[] found = new boolean[4]; for (int i = 0; i < 4; i++) { int foundIndex = -1; List<List<Writable>> collection = fromSequenceFile.get(i); for (int j = 0; j < 4; j++) { if (collection.equals(list.get(j))) { if (foundIndex != -1) fail(); //Already found this value -> suggests this spark value equals two or more of local version? (Shouldn't happen) foundIndex = j; if (found[foundIndex]) fail(); //One of the other spark values was equal to this one -> suggests duplicates in Spark list found[foundIndex] = true; //mark this one as seen before } } } int count = 0; for (boolean b : found) if (b) count++; assertEquals(4, count); //Expect all 4 and exactly 4 pairwise matches between spark and local versions }
Example 6
Source File: TestPairSequenceRecordReaderBytesFunction.java From deeplearning4j with Apache License 2.0 | 4 votes |
@Test public void test() throws Exception { //Goal: combine separate files together into a hadoop sequence file, for later parsing by a SequenceRecordReader //For example: use to combine input and labels data from separate files for training a RNN JavaSparkContext sc = getContext(); File f = testDir.newFolder(); new ClassPathResource("datavec-spark/video/").copyDirectory(f); String path = f.getAbsolutePath() + "/*"; PathToKeyConverter pathConverter = new PathToKeyConverterFilename(); JavaPairRDD<Text, BytesPairWritable> toWrite = DataVecSparkUtil.combineFilesForSequenceFile(sc, path, path, pathConverter); Path p = Files.createTempDirectory("dl4j_rrbytesPairOut"); p.toFile().deleteOnExit(); String outPath = p.toString() + "/out"; new File(outPath).deleteOnExit(); toWrite.saveAsNewAPIHadoopFile(outPath, Text.class, BytesPairWritable.class, SequenceFileOutputFormat.class); //Load back into memory: JavaPairRDD<Text, BytesPairWritable> fromSeq = sc.sequenceFile(outPath, Text.class, BytesPairWritable.class); SequenceRecordReader srr1 = getReader(); SequenceRecordReader srr2 = getReader(); PairSequenceRecordReaderBytesFunction psrbf = new PairSequenceRecordReaderBytesFunction(srr1, srr2); JavaRDD<Tuple2<List<List<Writable>>, List<List<Writable>>>> writables = fromSeq.map(psrbf); List<Tuple2<List<List<Writable>>, List<List<Writable>>>> fromSequenceFile = writables.collect(); //Load manually (single copy) and compare: InputSplit is = new FileSplit(f, new String[] {"mp4"}, true); SequenceRecordReader srr = getReader(); srr.initialize(is); List<List<List<Writable>>> list = new ArrayList<>(4); while (srr.hasNext()) { list.add(srr.sequenceRecord()); } assertEquals(4, list.size()); assertEquals(4, fromSequenceFile.size()); boolean[] found = new boolean[4]; for (int i = 0; i < 4; i++) { int foundIndex = -1; Tuple2<List<List<Writable>>, List<List<Writable>>> tuple2 = fromSequenceFile.get(i); List<List<Writable>> seq1 = tuple2._1(); List<List<Writable>> seq2 = tuple2._2(); assertEquals(seq1, seq2); for (int j = 0; j < 4; j++) { if (seq1.equals(list.get(j))) { if (foundIndex != -1) fail(); //Already found this value -> suggests this spark value equals two or more of local version? (Shouldn't happen) foundIndex = j; if (found[foundIndex]) fail(); //One of the other spark values was equal to this one -> suggests duplicates in Spark list found[foundIndex] = true; //mark this one as seen before } } } int count = 0; for (boolean b : found) if (b) count++; assertEquals(4, count); //Expect all 4 and exactly 4 pairwise matches between spark and local versions }
Example 7
Source File: TestDataVecDataSetFunctions.java From deeplearning4j with Apache License 2.0 | 4 votes |
@Test public void testDataVecSequencePairDataSetFunction() throws Exception { JavaSparkContext sc = getContext(); File f = testDir.newFolder(); ClassPathResource cpr = new ClassPathResource("dl4j-spark/csvsequence/"); cpr.copyDirectory(f); String path = f.getAbsolutePath() + "/*"; PathToKeyConverter pathConverter = new PathToKeyConverterFilename(); JavaPairRDD<Text, BytesPairWritable> toWrite = DataVecSparkUtil.combineFilesForSequenceFile(sc, path, path, pathConverter); Path p = testDir.newFolder("dl4j_testSeqPairFn").toPath(); p.toFile().deleteOnExit(); String outPath = p.toString() + "/out"; new File(outPath).deleteOnExit(); toWrite.saveAsNewAPIHadoopFile(outPath, Text.class, BytesPairWritable.class, SequenceFileOutputFormat.class); //Load from sequence file: JavaPairRDD<Text, BytesPairWritable> fromSeq = sc.sequenceFile(outPath, Text.class, BytesPairWritable.class); SequenceRecordReader srr1 = new CSVSequenceRecordReader(1, ","); SequenceRecordReader srr2 = new CSVSequenceRecordReader(1, ","); PairSequenceRecordReaderBytesFunction psrbf = new PairSequenceRecordReaderBytesFunction(srr1, srr2); JavaRDD<Tuple2<List<List<Writable>>, List<List<Writable>>>> writables = fromSeq.map(psrbf); //Map to DataSet: DataVecSequencePairDataSetFunction pairFn = new DataVecSequencePairDataSetFunction(); JavaRDD<DataSet> data = writables.map(pairFn); List<DataSet> sparkData = data.collect(); //Now: do the same thing locally (SequenceRecordReaderDataSetIterator) and compare String featuresPath = FilenameUtils.concat(f.getAbsolutePath(), "csvsequence_%d.txt"); SequenceRecordReader featureReader = new CSVSequenceRecordReader(1, ","); SequenceRecordReader labelReader = new CSVSequenceRecordReader(1, ","); featureReader.initialize(new NumberedFileInputSplit(featuresPath, 0, 2)); labelReader.initialize(new NumberedFileInputSplit(featuresPath, 0, 2)); SequenceRecordReaderDataSetIterator iter = new SequenceRecordReaderDataSetIterator(featureReader, labelReader, 1, -1, true); List<DataSet> localData = new ArrayList<>(3); while (iter.hasNext()) localData.add(iter.next()); assertEquals(3, sparkData.size()); assertEquals(3, localData.size()); for (int i = 0; i < 3; i++) { //Check shapes etc. data sets order may differ for spark vs. local DataSet dsSpark = sparkData.get(i); DataSet dsLocal = localData.get(i); assertNull(dsSpark.getFeaturesMaskArray()); assertNull(dsSpark.getLabelsMaskArray()); INDArray fSpark = dsSpark.getFeatures(); INDArray fLocal = dsLocal.getFeatures(); INDArray lSpark = dsSpark.getLabels(); INDArray lLocal = dsLocal.getLabels(); val s = new long[] {1, 3, 4}; //1 example, 3 values, 3 time steps assertArrayEquals(s, fSpark.shape()); assertArrayEquals(s, fLocal.shape()); assertArrayEquals(s, lSpark.shape()); assertArrayEquals(s, lLocal.shape()); } //Check that results are the same (order not withstanding) boolean[] found = new boolean[3]; for (int i = 0; i < 3; i++) { int foundIndex = -1; DataSet ds = sparkData.get(i); for (int j = 0; j < 3; j++) { if (ds.equals(localData.get(j))) { if (foundIndex != -1) fail(); //Already found this value -> suggests this spark value equals two or more of local version? (Shouldn't happen) foundIndex = j; if (found[foundIndex]) fail(); //One of the other spark values was equal to this one -> suggests duplicates in Spark list found[foundIndex] = true; //mark this one as seen before } } } int count = 0; for (boolean b : found) if (b) count++; assertEquals(3, count); //Expect all 3 and exactly 3 pairwise matches between spark and local versions }
Example 8
Source File: SparkColumnCardinality.java From kylin with Apache License 2.0 | 4 votes |
@Override protected void execute(OptionsHelper optionsHelper) throws Exception { String tableName = optionsHelper.getOptionValue(OPTION_TABLE_NAME); String output = optionsHelper.getOptionValue(OPTION_OUTPUT); int columnCnt = Integer.valueOf(optionsHelper.getOptionValue(OPTION_COLUMN_COUNT)); Class[] kryoClassArray = new Class[]{Class.forName("scala.reflect.ClassTag$$anon$1"), Class.forName("org.apache.kylin.engine.mr.steps.SelfDefineSortableKey")}; SparkConf conf = new SparkConf().setAppName("Calculate table:" + tableName); //set spark.sql.catalogImplementation=hive, If it is not set, SparkSession can't read hive metadata, and throw "org.apache.spark.sql.AnalysisException" conf.set("spark.sql.catalogImplementation", "hive"); //serialization conf conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); conf.set("spark.kryo.registrator", "org.apache.kylin.engine.spark.KylinKryoRegistrator"); conf.set("spark.kryo.registrationRequired", "true").registerKryoClasses(kryoClassArray); KylinSparkJobListener jobListener = new KylinSparkJobListener(); try (JavaSparkContext sc = new JavaSparkContext(conf)) { sc.sc().addSparkListener(jobListener); HadoopUtil.deletePath(sc.hadoopConfiguration(), new Path(output)); // table will be loaded by spark sql, so isSequenceFile set false final JavaRDD<String[]> recordRDD = SparkUtil.hiveRecordInputRDD(false, sc, null, tableName); JavaPairRDD<Integer, Long> resultRdd = recordRDD.mapPartitionsToPair(new BuildHllCounter()) .reduceByKey((x, y) -> { x.merge(y); return x; }) .mapToPair(record -> { return new Tuple2<>(record._1, record._2.getCountEstimate()); }) .sortByKey(true, 1) .cache(); if (resultRdd.count() == 0) { ArrayList<Tuple2<Integer, Long>> list = new ArrayList<>(); for (int i = 0; i < columnCnt; ++i) { list.add(new Tuple2<>(i, 0L)); } JavaPairRDD<Integer, Long> nullRdd = sc.parallelizePairs(list).repartition(1); nullRdd.saveAsNewAPIHadoopFile(output, IntWritable.class, LongWritable.class, TextOutputFormat.class); } else { resultRdd.saveAsNewAPIHadoopFile(output, IntWritable.class, LongWritable.class, TextOutputFormat.class); } } }
Example 9
Source File: TestSequenceRecordReaderBytesFunction.java From DataVec with Apache License 2.0 | 4 votes |
@Test public void testRecordReaderBytesFunction() throws Exception { //Local file path ClassPathResource cpr = new ClassPathResource("/video/shapes_0.mp4"); String path = cpr.getFile().getAbsolutePath(); String folder = path.substring(0, path.length() - 12); path = folder + "*"; //Load binary data from local file system, convert to a sequence file: //Load and convert JavaPairRDD<String, PortableDataStream> origData = sc.binaryFiles(path); JavaPairRDD<Text, BytesWritable> filesAsBytes = origData.mapToPair(new FilesAsBytesFunction()); //Write the sequence file: Path p = Files.createTempDirectory("dl4j_rrbytesTest"); p.toFile().deleteOnExit(); String outPath = p.toString() + "/out"; filesAsBytes.saveAsNewAPIHadoopFile(outPath, Text.class, BytesWritable.class, SequenceFileOutputFormat.class); //Load data from sequence file, parse via SequenceRecordReader: JavaPairRDD<Text, BytesWritable> fromSeqFile = sc.sequenceFile(outPath, Text.class, BytesWritable.class); SequenceRecordReader seqRR = new CodecRecordReader(); Configuration conf = new Configuration(); conf.set(CodecRecordReader.RAVEL, "true"); conf.set(CodecRecordReader.START_FRAME, "0"); conf.set(CodecRecordReader.TOTAL_FRAMES, "25"); conf.set(CodecRecordReader.ROWS, "64"); conf.set(CodecRecordReader.COLUMNS, "64"); Configuration confCopy = new Configuration(conf); seqRR.setConf(conf); JavaRDD<List<List<Writable>>> dataVecData = fromSeqFile.map(new SequenceRecordReaderBytesFunction(seqRR)); //Next: do the same thing locally, and compare the results InputSplit is = new FileSplit(new File(folder), new String[] {"mp4"}, true); SequenceRecordReader srr = new CodecRecordReader(); srr.initialize(is); srr.setConf(confCopy); List<List<List<Writable>>> list = new ArrayList<>(4); while (srr.hasNext()) { list.add(srr.sequenceRecord()); } assertEquals(4, list.size()); List<List<List<Writable>>> fromSequenceFile = dataVecData.collect(); assertEquals(4, list.size()); assertEquals(4, fromSequenceFile.size()); boolean[] found = new boolean[4]; for (int i = 0; i < 4; i++) { int foundIndex = -1; List<List<Writable>> collection = fromSequenceFile.get(i); for (int j = 0; j < 4; j++) { if (collection.equals(list.get(j))) { if (foundIndex != -1) fail(); //Already found this value -> suggests this spark value equals two or more of local version? (Shouldn't happen) foundIndex = j; if (found[foundIndex]) fail(); //One of the other spark values was equal to this one -> suggests duplicates in Spark list found[foundIndex] = true; //mark this one as seen before } } } int count = 0; for (boolean b : found) if (b) count++; assertEquals(4, count); //Expect all 4 and exactly 4 pairwise matches between spark and local versions }
Example 10
Source File: SparkColumnCardinality.java From kylin-on-parquet-v2 with Apache License 2.0 | 4 votes |
@Override protected void execute(OptionsHelper optionsHelper) throws Exception { String tableName = optionsHelper.getOptionValue(OPTION_TABLE_NAME); String output = optionsHelper.getOptionValue(OPTION_OUTPUT); int columnCnt = Integer.valueOf(optionsHelper.getOptionValue(OPTION_COLUMN_COUNT)); Class[] kryoClassArray = new Class[]{Class.forName("scala.reflect.ClassTag$$anon$1"), Class.forName("org.apache.kylin.engine.mr.steps.SelfDefineSortableKey")}; SparkConf conf = new SparkConf().setAppName("Calculate table:" + tableName); //set spark.sql.catalogImplementation=hive, If it is not set, SparkSession can't read hive metadata, and throw "org.apache.spark.sql.AnalysisException" conf.set("spark.sql.catalogImplementation", "hive"); //serialization conf conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); conf.set("spark.kryo.registrator", "org.apache.kylin.engine.spark.KylinKryoRegistrator"); conf.set("spark.kryo.registrationRequired", "true").registerKryoClasses(kryoClassArray); KylinSparkJobListener jobListener = new KylinSparkJobListener(); try (JavaSparkContext sc = new JavaSparkContext(conf)) { sc.sc().addSparkListener(jobListener); HadoopUtil.deletePath(sc.hadoopConfiguration(), new Path(output)); // table will be loaded by spark sql, so isSequenceFile set false final JavaRDD<String[]> recordRDD = SparkUtil.hiveRecordInputRDD(false, sc, null, tableName); JavaPairRDD<Integer, Long> resultRdd = recordRDD.mapPartitionsToPair(new BuildHllCounter()) .reduceByKey((x, y) -> { x.merge(y); return x; }) .mapToPair(record -> { return new Tuple2<>(record._1, record._2.getCountEstimate()); }) .sortByKey(true, 1) .cache(); if (resultRdd.count() == 0) { ArrayList<Tuple2<Integer, Long>> list = new ArrayList<>(); for (int i = 0; i < columnCnt; ++i) { list.add(new Tuple2<>(i, 0L)); } JavaPairRDD<Integer, Long> nullRdd = sc.parallelizePairs(list).repartition(1); nullRdd.saveAsNewAPIHadoopFile(output, IntWritable.class, LongWritable.class, TextOutputFormat.class); } else { resultRdd.saveAsNewAPIHadoopFile(output, IntWritable.class, LongWritable.class, TextOutputFormat.class); } } }
Example 11
Source File: SQLQueryFastq.java From ViraPipe with MIT License | 4 votes |
public static void main(String[] args) throws IOException { SparkConf conf = new SparkConf().setAppName("SQLQueryFastq"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(sc); Options options = new Options(); Option opOpt = new Option( "out", true, "HDFS path for output files. If not present, the output files are not moved to HDFS." ); Option queryOpt = new Option( "query", true, "SQL query string." ); Option samOpt = new Option( "format", true, "parquet or fastq" ); Option baminOpt = new Option( "in", true, "" ); options.addOption( new Option( "tablename", true, "Default sql table name is 'records'")); options.addOption( opOpt ); options.addOption( queryOpt ); options.addOption( samOpt ); options.addOption( baminOpt ); CommandLineParser parser = new BasicParser(); CommandLine cmd = null; try { // parse the command line arguments cmd = parser.parse( options, args ); } catch( ParseException exp ) { // oops, something went wrong System.err.println( "Parsing failed. Reason: " + exp.getMessage() ); } String outDir = (cmd.hasOption("out")==true)? cmd.getOptionValue("out"):null; String query = (cmd.hasOption("query")==true)? cmd.getOptionValue("query"):null; String format = (cmd.hasOption("format")==true)? cmd.getOptionValue("format"):"fastq"; String in = (cmd.hasOption("in")==true)? cmd.getOptionValue("in"):null; tablename = (cmd.hasOption("tablename")==true)? cmd.getOptionValue("tablename"):"records"; sc.hadoopConfiguration().setBoolean(BAMInputFormat.KEEP_PAIRED_READS_TOGETHER_PROPERTY, true); JavaPairRDD<Text, SequencedFragment> fastqRDD = sc.newAPIHadoopFile(in, FastqInputFormat.class, Text.class, SequencedFragment.class, sc.hadoopConfiguration()); JavaRDD<MyRead> rdd = fastqRDD.map(record -> { MyRead read = new MyRead(); read.setKey(record._1.toString()); read.setSequence(record._2.getSequence().toString()); read.setRead(record._2.getRead()); read.setQuality(record._2.getQuality().toString()); read.setTile(record._2.getTile()); read.setXpos(record._2.getXpos()); read.setYpos(record._2.getYpos()); read.setRunNumber(record._2.getRunNumber()); read.setInstrument(record._2.getInstrument()); read.setFlowcellId(record._2.getFlowcellId()); read.setLane(record._2.getLane()); read.setControlNumber(record._2.getControlNumber()); read.setFilterPassed(record._2.getFilterPassed()); return read; }); Dataset df = sqlContext.createDataFrame(rdd, MyRead.class); df.registerTempTable(tablename); //eq. count duplicates "SELECT count(DISTINCT(sequence)) FROM records" //"SELECT key,LEN(sequence) as l FROM records where l<100;" if(query!=null) { //JavaRDD<MyAlignment> rdd = samRDD.map(bam -> new MyAlignment(bam.getReadName(), bam.getStart(), bam.getReferenceName(), bam.getReadLength(), new String(bam.getReadBases(), StandardCharsets.UTF_8), bam.getCigarString(), bam.getReadUnmappedFlag(), bam.getDuplicateReadFlag(), bam)); //Save as parquet file Dataset<Row> resultDF = sqlContext.sql(query); resultDF.show(100, false); if(outDir!=null){ if(format.equals("fastq")){ JavaPairRDD<Text, SequencedFragment> resultRDD = dfToFastqRDD(resultDF); resultRDD.saveAsNewAPIHadoopFile(outDir, Text.class, SequencedFragment.class, FastqOutputFormat.class, sc.hadoopConfiguration()); } else resultDF.write().parquet(outDir); } } sc.stop(); }
Example 12
Source File: MergeFastq.java From ViraPipe with MIT License | 4 votes |
public static void main(String[] args) throws IOException { if (args.length < 1) { System.err.println("Usage: MergeFastq <input path> <output path> <number of partitions>"); System.exit(1); } SparkConf conf = new SparkConf().setAppName("MergeFastq"); JavaSparkContext sc = new JavaSparkContext(conf); JavaPairRDD<Text, SequencedFragment> fastqRDD = sc.newAPIHadoopFile(args[0], FastqInputFormat.class, Text.class, SequencedFragment.class, sc.hadoopConfiguration()); JavaPairRDD<Text, SequencedFragment> coalesced = fastqRDD.coalesce(Integer.valueOf(args[2])); coalesced.saveAsNewAPIHadoopFile(args[1], Text.class, SequencedFragment.class, FastqOutputFormat.class, sc.hadoopConfiguration()); sc.stop(); }
Example 13
Source File: SamToFastq.java From ViraPipe with MIT License | 4 votes |
public static void main(String[] args) throws IOException { SparkConf conf = new SparkConf().setAppName("SamToFastq"); sc = new JavaSparkContext(conf); String in = args[0]; String out = args[1]; JavaPairRDD<LongWritable, SAMRecordWritable> bamPairRDD = sc.newAPIHadoopFile(in, AnySAMInputFormat.class, LongWritable.class, SAMRecordWritable.class, sc.hadoopConfiguration()); //Map to SAMRecord RDD JavaRDD<SAMRecord> samRDD = bamPairRDD.map(v1 -> v1._2().get()); JavaPairRDD<Text, SequencedFragment> fastqrdd = mapSAMRecordsToFastq(samRDD); fastqrdd.saveAsNewAPIHadoopFile(out, Text.class, SequencedFragment.class, FastqOutputFormat.class, sc.hadoopConfiguration()); sc.stop(); }
Example 14
Source File: RepartitionFastq.java From ViraPipe with MIT License | 4 votes |
public static void main(String[] args) throws IOException { if (args.length < 1) { System.err.println("Usage: RepartitionFastq <input path> <output path> <number of partitions>"); System.exit(1); } SparkConf conf = new SparkConf().setAppName("RepartitionFastq"); //conf.set("spark.default.parallelism", String.valueOf(args[2])); JavaSparkContext sc = new JavaSparkContext(conf); JavaPairRDD<Text, SequencedFragment> fastqRDD = sc.newAPIHadoopFile(args[0], FastqInputFormat.class, Text.class, SequencedFragment.class, sc.hadoopConfiguration()); JavaPairRDD<Text, SequencedFragment> repartitioned = fastqRDD.repartition(Integer.valueOf(args[2])); repartitioned.saveAsNewAPIHadoopFile(args[1], Text.class, SequencedFragment.class, FastqOutputFormat.class, sc.hadoopConfiguration()); sc.stop(); }
Example 15
Source File: SparkStorageUtils.java From DataVec with Apache License 2.0 | 3 votes |
/** * Save a {@code JavaRDD<List<List<Writable>>>} to a Hadoop {@link org.apache.hadoop.io.MapFile}. Each record is * given a <i>unique and contiguous</i> {@link LongWritable} key, and values are stored as * {@link SequenceRecordWritable} instances.<br> * <b>Note</b>: If contiguous keys are not required, using a sequence file instead is preferable from a performance * point of view. Contiguous keys are often only required for non-Spark use cases, such as with * {@link org.datavec.hadoop.records.reader.mapfile.MapFileSequenceRecordReader}<br> * <p> * Use {@link #restoreMapFileSequences(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the MapFile * @param rdd RDD to save * @param c Configuration object, used to customise options for the map file * @see #saveMapFileSequences(String, JavaRDD) * @see #saveSequenceFile(String, JavaRDD) */ public static void saveMapFileSequences(String path, JavaRDD<List<List<Writable>>> rdd, Configuration c, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<List<Writable>>, Long> dataIndexPairs = rdd.zipWithIndex(); JavaPairRDD<LongWritable, SequenceRecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new SequenceRecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, SequenceRecordWritable.class, MapFileOutputFormat.class, c); }
Example 16
Source File: SparkStorageUtils.java From DataVec with Apache License 2.0 | 3 votes |
/** * Save a {@code JavaRDD<List<Writable>>} to a Hadoop {@link org.apache.hadoop.io.MapFile}. Each record is * given a <i>unique and contiguous</i> {@link LongWritable} key, and values are stored as * {@link RecordWritable} instances.<br> * <b>Note</b>: If contiguous keys are not required, using a sequence file instead is preferable from a performance * point of view. Contiguous keys are often only required for non-Spark use cases, such as with * {@link org.datavec.hadoop.records.reader.mapfile.MapFileRecordReader} * <p> * Use {@link #restoreMapFileSequences(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the MapFile * @param rdd RDD to save * @param c Configuration object, used to customise options for the map file * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions) * to limit the maximum number of output map files * @see #saveMapFileSequences(String, JavaRDD) * @see #saveSequenceFile(String, JavaRDD) */ public static void saveMapFile(String path, JavaRDD<List<Writable>> rdd, Configuration c, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<Writable>, Long> dataIndexPairs = rdd.zipWithIndex(); //Note: Long values are unique + contiguous, but requires a count JavaPairRDD<LongWritable, RecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new RecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, RecordWritable.class, MapFileOutputFormat.class, c); }
Example 17
Source File: SparkStorageUtils.java From DataVec with Apache License 2.0 | 3 votes |
/** * Save a {@code JavaRDD<List<Writable>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record is given * a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link RecordWritable} instances. * <p> * Use {@link #restoreSequenceFile(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the sequence file * @param rdd RDD to save * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions) * to limit the maximum number of output sequence files * @see #saveSequenceFileSequences(String, JavaRDD) * @see #saveMapFile(String, JavaRDD) */ public static void saveSequenceFile(String path, JavaRDD<List<Writable>> rdd, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<Writable>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex JavaPairRDD<LongWritable, RecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new RecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, RecordWritable.class, SequenceFileOutputFormat.class); }
Example 18
Source File: HDFSWriter.java From ViraPipe with MIT License | 3 votes |
public void writeRecords(JavaRDD<SAMRecord> records, Broadcast<SAMFileHeader> header, String outpath, SparkContext sc) { JavaPairRDD<SAMRecord, SAMRecordWritable> bamWritableRDD = readsToWritable(records, header); //Distribute records to HDFS as BAM bamWritableRDD.saveAsNewAPIHadoopFile(outpath, SAMRecord.class, SAMRecordWritable.class, BAMHeaderOutputFormat.class, sc.hadoopConfiguration()); }
Example 19
Source File: SparkStorageUtils.java From deeplearning4j with Apache License 2.0 | 3 votes |
/** * Save a {@code JavaRDD<List<Writable>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record is given * a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link RecordWritable} instances. * <p> * Use {@link #restoreSequenceFile(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the sequence file * @param rdd RDD to save * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions) * to limit the maximum number of output sequence files * @see #saveSequenceFileSequences(String, JavaRDD) * @see #saveMapFile(String, JavaRDD) */ public static void saveSequenceFile(String path, JavaRDD<List<Writable>> rdd, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<Writable>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex JavaPairRDD<LongWritable, RecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new RecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, RecordWritable.class, SequenceFileOutputFormat.class); }
Example 20
Source File: SparkStorageUtils.java From deeplearning4j with Apache License 2.0 | 3 votes |
/** * Save a {@code JavaRDD<List<Writable>>} to a Hadoop {@link org.apache.hadoop.io.MapFile}. Each record is * given a <i>unique and contiguous</i> {@link LongWritable} key, and values are stored as * {@link RecordWritable} instances.<br> * <b>Note</b>: If contiguous keys are not required, using a sequence file instead is preferable from a performance * point of view. Contiguous keys are often only required for non-Spark use cases, such as with * {@link org.datavec.hadoop.records.reader.mapfile.MapFileRecordReader} * <p> * Use {@link #restoreMapFileSequences(String, JavaSparkContext)} to restore values saved with this method. * * @param path Path to save the MapFile * @param rdd RDD to save * @param c Configuration object, used to customise options for the map file * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions) * to limit the maximum number of output map files * @see #saveMapFileSequences(String, JavaRDD) * @see #saveSequenceFile(String, JavaRDD) */ public static void saveMapFile(String path, JavaRDD<List<Writable>> rdd, Configuration c, Integer maxOutputFiles) { path = FilenameUtils.normalize(path, true); if (maxOutputFiles != null) { rdd = rdd.coalesce(maxOutputFiles); } JavaPairRDD<List<Writable>, Long> dataIndexPairs = rdd.zipWithIndex(); //Note: Long values are unique + contiguous, but requires a count JavaPairRDD<LongWritable, RecordWritable> keyedByIndex = dataIndexPairs.mapToPair(new RecordSavePrepPairFunction()); keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, RecordWritable.class, MapFileOutputFormat.class, c); }