Java Code Examples for org.apache.tika.io.TikaInputStream#getFile()

The following examples show how to use org.apache.tika.io.TikaInputStream#getFile() . You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.

Example 1

Source File: TikaPoweredMetadataExtracter.java From alfresco-repository with GNU Lesser General Public License v3.0

5 votes

/**
 * There seems to be some sort of issue with some downstream
 *  3rd party libraries, and input streams that come from
 *  a {@link ContentReader}. This happens most often with
 *  JPEG and Tiff files.
 * For these cases, buffer out to a local file if not
 *  already there
 */
protected InputStream getInputStream(ContentReader reader) throws IOException
{
   // Prefer the File if available, it's generally quicker
   if(reader instanceof FileContentReader) 
   {
      return TikaInputStream.get( ((FileContentReader)reader).getFile() );
   }
   
   // Grab the InputStream for the Content
   InputStream input = reader.getContentInputStream();
   
   // Images currently always require a file
   if(MimetypeMap.MIMETYPE_IMAGE_JPEG.equals(reader.getMimetype()) ||
      MimetypeMap.MIMETYPE_IMAGE_TIFF.equals(reader.getMimetype())) 
   {
      TemporaryResources tmp = new TemporaryResources();
      TikaInputStream stream = TikaInputStream.get(input, tmp);
      stream.getFile(); // Have it turned into File backed
      return stream;
   }
   else
   {
      // The regular Content InputStream should be fine
      return input; 
   }
}

Example 2

Source File: TesseractOCRParser.java From CogStack-Pipeline with Apache License 2.0

4 votes

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
        throws IOException, SAXException, TikaException {
    TesseractOCRConfig config = context.get(TesseractOCRConfig.class, DEFAULT_CONFIG);

    // If Tesseract is not on the path with the current config, do not try to run OCR
    // getSupportedTypes shouldn't have listed us as handling it, so this should only
    //  occur if someone directly calls this parser, not via DefaultParser or similar
    if (! hasTesseract(config))
        return;

    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);

    TemporaryResources tmp = new TemporaryResources();
    File output = null;
    try {
        TikaInputStream tikaStream = TikaInputStream.get(stream, tmp);
        File input = tikaStream.getFile();
        long size = tikaStream.getLength();

        if (size >= config.getMinFileSizeToOcr() && size <= config.getMaxFileSizeToOcr()) {

            output = tmp.createTemporaryFile();
            doOCR(input, output, config);

            // Tesseract appends .txt to output file name
            output = new File(output.getAbsolutePath() + ".txt");

            if (output.exists())
                extractOutput(new FileInputStream(output), xhtml);

        }

        // Temporary workaround for TIKA-1445 - until we can specify
        //  composite parsers with strategies (eg Composite, Try In Turn),
        //  always send the image onwards to the regular parser to have
        //  the metadata for them extracted as well
        _TMP_IMAGE_METADATA_PARSER.parse(tikaStream, handler, metadata, context);
    } finally {
        tmp.dispose();
        if (output != null) {
            output.delete();
        }
    }
}