Enter the World of Semantics: using Jena to convert your data to RDF

Posted on July 21st, 2016 in RDF, Metadata, Semantic Data, Open Web by Ayman Elserafi

The semantic web of data and the realm of Linked Open Data (LOD) is growing every day … and at its core is the Resource Description Framework (or RDF for short). RDF is a W3C recommended standard for representing data and metadata (i.e. description of data like its schema and business-meaning) which can support semantics-aware Business Intelligence. The idea of RDF is to represent everything in terms of triples. But what are triples in layman’s term? Well, they are simply the data model of RDF which has three components: 1. Subject (S) 2. Predicate (P) 3. Object (O) … or SPOs in short. An example would be something like “Java” rdf:type “programming_language”. This means that the subject “Java” can be semantically understood to be an instantiation (which is the definition of an rdf:type predicate) of the “programming_languages” class (the object of the triple). Another valid triple giving another semantic meaning of Java would be: “Java” rdf:type “Island” as it is also the name of a nice big island in Indonesia! Note here that usually the components of a triple are expressed as URIs (more about that here).

Expressing such triples help in defining the semantic meaning of data items and how they should be understood … which makes the semantic world a natural and expected evolvement of Business Intelligence for supporting the data scientists in understanding the data under their hands. Also, RDF and semantically-represented data are really useful for sharing your data with the community and the world … it is one of the recommended approaches for having your data officially recognized as sharable open-data or what is commonly called by the inventors of the www 5-stars data (think of it like a 5-stars hotel which provides more and better services … see: 5-star-data).

Several serialization options exist to store the RDF triples, e.g. in XML representation. A very common and human-readable format is called N-Triples or another one called Turtle. They represent the triples in an easy to use (and commonly compressed) format for persisting the RDF data. So, if the above sounds convincing for entering the world of semantics, we need to get some serialization of the data into a RDF format like N-Triples. But, how can we simply do that?

The answer is easy: using the available and popular RDF libraries and APIs for handling your data and converting it into RDF. A very commonly used open-source library provided to help you in this task is Jena for Java (here we mean “Java” the “programming_language” and not the island if you recall from our earlier discussion … our RDF triples above would clarify this to you and could also facilitate distinguishing this meaning for a computer utilizing our data too!) Jena is very simple to use and can also be easily integrated in your Java-applications using Maven.

To start using Jena within a Maven-built Java-project is simple. It is done by adding the following to the Maven POM.xml configuration file between the <dependencies> tags:

<groupId>org.apache.jena</groupId>

<artifactId>apache-jena-libs</artifactId>

</dependency>

Now that the Jena library is identified within your Java application, you can use the methods provided by Jena to convert most of the common data formats to RDF-enabled formats like N-Triples. Let’s say you want to convert some CSV data files into RDF triples. Then we can do this using code similar to the snippet below. To give some extra flavours of what Jena-provided methods can support, we add some extra tasks like querying the triples generated, manually deleting some triples you have queried, and manually adding triples to the final RDF output file (in this case we ask Jena to convert the file to N-Triples). Note, we are also using here the Apache FileUtils library for loading and saving files on the computer system (the library is part of the Apache commons IO libraries, also supports Maven integration).

//Some imports below of the Java classes we will use

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.util.Scanner;

import org.apache.commons.io.FileUtils;
import org.apache.jena.graph.Node;
import org.apache.jena.graph.NodeFactory;
import org.apache.jena.graph.Triple;
import org.apache.jena.propertytable.graph.GraphCSV;
import org.apache.jena.propertytable.lang.CSV2RDF;
import org.apache.jena.query.Query;
import org.apache.jena.query.QueryExecution;
import org.apache.jena.query.QueryExecutionFactory;
import org.apache.jena.query.QueryFactory;
import org.apache.jena.query.QuerySolution;
import org.apache.jena.query.ResultSet;
import org.apache.jena.rdf.model.Model;
import org.apache.jena.rdf.model.ModelFactory;
import org.apache.jena.rdf.model.RDFNode;
import org.apache.jena.rdf.model.ResourceFactory;
import org.apache.jena.rdf.model.Statement;
import org.apache.jena.util.FileManager;
import org.apache.jena.vocabulary.RDF;

...

public static void convertCSVToRDF (String file, String inputFilename, String outputFilename,

String outputType) {

//Just a few lines below to convert the data from CSV to an RDF graph, see how easy?!

CSV2RDF.init();//Initialise the CSV conversion engine in Jena

GraphCSV newGraph = new GraphCSV(inputFilename);

Model model = ModelFactory.createModelForGraph(newGraph);

//Manually insert class triples for each instance in the CSV file

String sparqlQueryString = "select distinct ?s where {?s ?p ?o}";

Query query = QueryFactory.create(sparqlQueryString);

QueryExecution qexec = QueryExecutionFactory.create(sparqlQueryString, model);

ResultSet s = qexec.execSelect();

Model m2 = ModelFactory.createDefaultModel();

while(s.hasNext()) {

QuerySolution so = s.nextSolution();

Triple t = new Triple(so.getResource("s").asNode(),RDF.type.asNode(),

NodeFactory.createBlankNode(file));

Statement stmt = ResourceFactory.createStatement(so.getResource("s"), RDF.type,

ResourceFactory.createResource(file));

m2.add(stmt);

}

Model m3 = ModelFactory.createUnion(model, m2); //create a new RDF graph which "unions"

//the old graph with the new graph containing

//the new rows

//Now serialize the RDF graph to an output file using the outputType input variable

you specify. It should be “N-Triple” in our case.

try {

FileWriter out = new FileWriter(outputFilename);

m3.write(out,outputType);

} catch (Exception e) {

System.out.println("Error in the file output process!");

e.printStackTrace();

}

//Delete specific triples of a specific predicate called ¨row¨

File output = new File(outputFilename);

File tempFile = new File("C:/Users/user1/SampleFile/temp.nt");

BufferedReader reader = null;

BufferedWriter writer = null;

try {

reader = new BufferedReader(new FileReader(output));

writer = new BufferedWriter(new FileWriter(tempFile));

String currentLine;

//Delete triples from the old file by skipping it while reading the input N-Triple

file from the last step, otherwise write the triple to a new temp file!

while ((currentLine = reader.readLine()) != null) {

if (currentLine.contains("http://w3c/future-csv-vocab/row")) {

continue;

} else {

writer.write(currentLine);

writer.newLine();

}

writer.close();

reader.close();

PrintWriter printer = new PrintWriter(output);

printer.print("");

printer.close();

//copy content from temp file to final output file, overwriting it.

FileUtils.copyFile(tempFile, output);

} catch (FileNotFoundException e1) {

// TODO Auto-generated catch block

e1.printStackTrace();

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

Hope you have enjoyed your first (maybe?) ride to the semantic world … welcome onboard!