Java Code Examples for org.apache.spark.api.java.JavaRDD#takeSample()
The following examples show how to use
org.apache.spark.api.java.JavaRDD#takeSample() .
You can vote up the ones you like or vote down the ones you don't like,
and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source File: AnalyzeSpark.java From DataVec with Apache License 2.0 | 5 votes |
/** * Randomly sample a set of invalid values from a specified column. * Values are considered invalid according to the Schema / ColumnMetaData * * @param numToSample Maximum number of invalid values to sample * @param columnName Same of the column from which to sample invalid values * @param schema Data schema * @param data Data * @param ignoreMissing If true: ignore missing values (NullWritable or empty/null string) when sampling. If false: include missing values in sampling * @return List of invalid examples */ public static List<Writable> sampleInvalidFromColumn(int numToSample, String columnName, Schema schema, JavaRDD<List<Writable>> data, boolean ignoreMissing) { //First: filter out all valid entries, to leave only invalid entries int colIdx = schema.getIndexOfColumn(columnName); JavaRDD<Writable> ithColumn = data.map(new SelectColumnFunction(colIdx)); ColumnMetaData meta = schema.getMetaData(columnName); JavaRDD<Writable> invalid = ithColumn.filter(new FilterWritablesBySchemaFunction(meta, false, ignoreMissing)); return invalid.takeSample(false, numToSample); }
Example 2
Source File: AnalyzeSpark.java From deeplearning4j with Apache License 2.0 | 5 votes |
/** * Randomly sample a set of invalid values from a specified column. * Values are considered invalid according to the Schema / ColumnMetaData * * @param numToSample Maximum number of invalid values to sample * @param columnName Same of the column from which to sample invalid values * @param schema Data schema * @param data Data * @param ignoreMissing If true: ignore missing values (NullWritable or empty/null string) when sampling. If false: include missing values in sampling * @return List of invalid examples */ public static List<Writable> sampleInvalidFromColumn(int numToSample, String columnName, Schema schema, JavaRDD<List<Writable>> data, boolean ignoreMissing) { //First: filter out all valid entries, to leave only invalid entries int colIdx = schema.getIndexOfColumn(columnName); JavaRDD<Writable> ithColumn = data.map(new SelectColumnFunction(colIdx)); ColumnMetaData meta = schema.getMetaData(columnName); JavaRDD<Writable> invalid = ithColumn.filter(new FilterWritablesBySchemaFunction(meta, false, ignoreMissing)); return invalid.takeSample(false, numToSample); }
Example 3
Source File: AnalyzeSpark.java From DataVec with Apache License 2.0 | 3 votes |
/** * Randomly sample values from a single column * * @param count Number of values to sample * @param columnName Name of the column to sample from * @param schema Schema * @param data Data to sample from * @return A list of random samples */ public static List<Writable> sampleFromColumn(int count, String columnName, Schema schema, JavaRDD<List<Writable>> data) { int colIdx = schema.getIndexOfColumn(columnName); JavaRDD<Writable> ithColumn = data.map(new SelectColumnFunction(colIdx)); return ithColumn.takeSample(false, count); }
Example 4
Source File: AnalyzeSpark.java From deeplearning4j with Apache License 2.0 | 3 votes |
/** * Randomly sample values from a single column * * @param count Number of values to sample * @param columnName Name of the column to sample from * @param schema Schema * @param data Data to sample from * @return A list of random samples */ public static List<Writable> sampleFromColumn(int count, String columnName, Schema schema, JavaRDD<List<Writable>> data) { int colIdx = schema.getIndexOfColumn(columnName); JavaRDD<Writable> ithColumn = data.map(new SelectColumnFunction(colIdx)); return ithColumn.takeSample(false, count); }
Example 5
Source File: AnalyzeSpark.java From DataVec with Apache License 2.0 | 2 votes |
/** * Randomly sample a set of examples * * @param count Number of samples to generate * @param data Data to sample from * @return Samples */ public static List<List<Writable>> sample(int count, JavaRDD<List<Writable>> data) { return data.takeSample(false, count); }
Example 6
Source File: AnalyzeSpark.java From DataVec with Apache License 2.0 | 2 votes |
/** * Randomly sample a number of sequences from the data * @param count Number of sequences to sample * @param data Data to sample from * @return Sequence samples */ public static List<List<List<Writable>>> sampleSequence(int count, JavaRDD<List<List<Writable>>> data) { return data.takeSample(false, count); }
Example 7
Source File: AnalyzeSpark.java From deeplearning4j with Apache License 2.0 | 2 votes |
/** * Randomly sample a set of examples * * @param count Number of samples to generate * @param data Data to sample from * @return Samples */ public static List<List<Writable>> sample(int count, JavaRDD<List<Writable>> data) { return data.takeSample(false, count); }
Example 8
Source File: AnalyzeSpark.java From deeplearning4j with Apache License 2.0 | 2 votes |
/** * Randomly sample a number of sequences from the data * @param count Number of sequences to sample * @param data Data to sample from * @return Sequence samples */ public static List<List<List<Writable>>> sampleSequence(int count, JavaRDD<List<List<Writable>>> data) { return data.takeSample(false, count); }