Monday, September 6, 2010

Convert DBLP(XML format) to a relational DB (MySQL)

I need a data set on which my new algorithm can be tested. I am posting here the small work of converting DBLP (in XML format) into MySQL.

Data source, http://www.informatik.uni-trier.de/~ley/db/ . I downloaded its XML format file, which contain information about thousands of articles.

Target data storage, A simple database designed for MySQL. Below is the SQLs for creating table.

CREATE TABLE `DBLP`.`articles` (
`id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT  PRIMARY KEY ,
`title` VARCHAR( 250 ) NOT NULL ,
`bookTitle` VARCHAR( 150 )  NOT NULL ,
`pages` CHAR( 13 ) ,
`year` CHAR( 4 ) NOT NULL ,
`mdate`  CHAR( 10 ) NOT NULL ,
UNIQUE (
`title`
)
) ENGINE = MYISAM  ;



CREATE TABLE `DBLP`.`authors` (
`id` BIGINT UNSIGNED NOT NULL  AUTO_INCREMENT PRIMARY KEY ,
`name` VARCHAR( 50 ) NOT NULL
UNIQUE (
`name`
)
) ENGINE =  MYISAM ;

CREATE TABLE `DBLP`.`linkTbl` (
`article_id` BIGINT  UNSIGNED NOT NULL ,
`author_id` BIGINT UNSIGNED NOT NULL ,
`position`  TINYINT UNSIGNED NOT NULL ,
INDEX ( `article_id` , `author_id` )
) ENGINE  = MYISAM ;
Then, I made a java program to access the XML file and insert each article info into new created RDBMS tables.

Java code is using xerces SAX parser to read and parse the XML file. We use SAX parser not DOM parser because the XML has very large file size. It is over 700M on the disc.

Java code is small. But it is too long to be post here. So, I am only put important SQL query in Java code here.

String articleInsertSql = "INSERT INTO `articles` (`id` ,`title` ,`bookTitle` ,`pages` ,`year` ,`mdate`)"
+ " VALUES (NULL , ?, ?, ?, ?, ?)"
+ " ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)";

String articles_lastId = "set @articleId=LAST_INSERT_ID()";

String  authorReplaceSql = "INSERT INTO `authors` (`id` ,`name`)"
+ " VALUES (NULL , ?)"
+ " ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)";

String authors_lastId = "set @authorId=LAST_INSERT_ID()";

String lookTableInsertSql = "INSERT INTO `linkTbl` (`article_id` ,`author_id` ,`position`)"
+ " VALUES (@articleId, @authorId, ?)";


Here, string articleInsertSql , articles_lastId, authorReplaceSql, authors_lastId, lookTableInsertSql are executed in PreparedStatement sequentially. Here, " ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)" can be highlighted because it let following "set @articleId=LAST_INSERT_ID()" or "set @authorId=LAST_INSERT_ID()" to assign right ID values to @articleId or @authorId .

3 comments:

  1. Could you please repost the netbeans project because there is a problem with the download

    ReplyDelete
  2. Hi! Can you fix the download link of the Netbeans project?? i'm working with the xml of dblp and this can be of a lot of help to me... Thanks!

    ReplyDelete
  3. sorry. my laptop's hard driver physically damaged. So, I lost all my work. But, I you can easily parse the XML by using SAX parser which has been part of JDK.

    ReplyDelete