Aperture Framework

Aperture is a Framework that deals with multiples type of documents and formats. It’s is written in java and under heavy development. It tries to deal with all the multiple formats via a unique interface, it’s quite useful for crawling resources on a standard way. From its website,

“Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems. “

The following class will let you extract full text from your files and some additional metadata (like the creator, the title and the language),

I found it quite useful and simple at the same time, here you have an example on how it works, (the code is a modification of the extractor example)

import info.aduna.io.IOUtil;
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Set;
import org.semanticdesktop.aperture.extractor.Extractor;
import org.semanticdesktop.aperture.extractor.ExtractorException;
import org.semanticdesktop.aperture.extractor.ExtractorFactory;
import org.semanticdesktop.aperture.extractor.ExtractorRegistry;
import org.semanticdesktop.aperture.extractor.impl.DefaultExtractorRegistry;
import org.semanticdesktop.aperture.mime.identifier.MimeTypeIdentifier;
import org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier;
import org.semanticdesktop.aperture.rdf.RDFContainer;
import org.semanticdesktop.aperture.rdf.impl.RDFContainerFactoryImpl;
import org.semanticdesktop.aperture.vocabulary.DATA;

public class Normalizer {

	private RDFContainer container;

	public void normalize(String filename) throws IOException, ExtractorException  {
		// create a MimeTypeIdentifier
		MimeTypeIdentifier identifier = new MagicMimeTypeIdentifier();

		// create an ExtractorRegistry containing all Extractors
		ExtractorRegistry extractorRegistry = new DefaultExtractorRegistry();

		// create a stream of the specified file
		File file = new File(filename);
		FileInputStream stream = new FileInputStream(file);

		// read as many bytes of the file as desired by the MIME type identifier
		int minimumArrayLength = identifier.getMinArrayLength();
		int bufferSize = Math.max(minimumArrayLength, 8192);
		BufferedInputStream buffer = new BufferedInputStream(stream, bufferSize);
		buffer.mark(minimumArrayLength + 10); // add some for safety
		byte[] bytes = IOUtil.readBytes(buffer, minimumArrayLength);

		// let the MimeTypeIdentifier determine the MIME type of this file
		String mimeType = identifier.identify(bytes, file.getPath(), null);

		// skip the extraction phase when the MIME type could not be determined
		if (mimeType == null) {
			System.err.println(”WARNING: MIME type could not be established.”);
		} else {
			// create the RDFContainer that will hold the RDF model
			RDFContainerFactoryImpl containerFactory = new RDFContainerFactoryImpl();
			container = containerFactory.newInstance(file.toURI().toString());

			// determine and apply an Extractor that can handle this MIME type
			Set factories = extractorRegistry.get(mimeType);
			if (factories != null && !factories.isEmpty()) {
				// just fetch the first available Extractor
				ExtractorFactory factory = (ExtractorFactory) factories.iterator().next();
				Extractor extractor = factory.get();

				// apply the extractor on the specified file
				buffer.reset();
				extractor.extract(container.getDescribedUri(), buffer, null, mimeType, container);
			}

			// add the MIME type as an additional statement to the RDF model
			container.add(DATA.mimeType, mimeType);
		}
		buffer.close();
	}

	public String getCreator(){
		return container.getString(DATA.creator);
	}

	public String getTitle(){
		return container.getString(DATA.title);
	}

	public String getLanguge(){
		return container.getString(DATA.language);
	}

	public String getFullTetx(){
		return container.getString(DATA.fullText);
	}
}

vmware-player in ubuntu gutsy

For those looking for vmware-player in the last version of ubuntu, you wont find it it was removed.

But you can install it from a ppa repo.

Add these lines to your /etc/apt/sources.list

deb http://ppa.launchpad.net/cschieli/ubuntu gutsy main restricted universe multiverse
deb-src http://ppa.launchpad.net/cschieli/ubuntu gutsy main restricted universe multiverse

Then update,

sudo aptitude update

Finally, install

sudo aptitude install vmware-player

After installation you will probably want to comment that repo.

I love jython!

So if you thought that python was good, well jython is even better! You have all the great stuff that you have in java with out the need of programing in java :D For example, the other day i was in need of reading some xmls from a web. In python you have httplib for doing HTTP GETs and retrieve the documents and you have nice libraries like Beautiful Soup or pyxml. But in this case i needed something even simpler, as i could use the xsd for this xmls, something like xmlbeans would be great.

I already had the code for downloading the xmls in python, it was something like this,

import httplib, os

def get_listing(conn):
	conn.request("GET", "/oai2?verb=ListIdentifiers&metadataPrefix=oai_dc")
	response = conn.getresponse()

def get_papers(out):
	conn = httplib.HTTPConnection("export.arxiv.org")
	listids = get_listing(conn)
	for id in listids:

That URL (export.arxiv.org/oai2?verb=ListIdentifiers&metadataPrefix=oai_dc) returns a xml defined by a xsd, so with the help of xmlbean we can generate a jar to handle that xml. The code looks just like a python script and will probably run fine….. Now we can get those identifiers just with the following modifications,

import httplib, os
import org.openarchives.oai.x20

import org.apache.xmlbeans
def get_listing(conn):
	conn.request("GET", "/oai2?verb=ListIdentifiers&metadataPrefix=oai_dc")

	response = conn.getresponse()

	listidsdoc = org.openarchives.oai.x20.OAIPMHDocument.Factory.parse(response.read())

	listids = listidsdoc.getOAIPMH().getListIdentifiers().getHeaderArray()

	return listids

def get_papers(out):
	conn = httplib.HTTPConnection("export.arxiv.org")

	listids = get_listing(conn)
	for id in listids:

So, the only thing left was the export of the classpath (something like “export CLASSPATH=”"/path/to/xmlbean/jar”") and adding some additional logic to the script to actually do what i was looking for :P and my jython hello world was ready :D

That was a great time with jython and you can have it to, go and download it from here. They have a really painless “next, next, next” installation wizard. You can read something else, here, here and here.

how-to delete 0 bytes files

ls -lahS | awk ‘{if( $5==0 ) print $8}’ | xargs rm

gotta love it

Exception in thread “AWT-EventQueue-0″ java.lang.OutOfMemoryError: Java heap space