Sunday, April 15, 2007

XML Databases - SQL Server 2005 v. DB2

I used to laugh at the term "XML Database," because usually that meant a bunch of semi-organized XML documents on a filesystem with no index.

Today, however, there are RDBMSes that have incorporated XML functionality that make this a more viable method of storing data. If there's anything I've learned about tech, it's that a not-so-great solution that people adopt will be optimized and a great solution no one knows about will eventually die. In this case, people adopted XML, so they'll keep hammering away on the storage tech until it's viable. Likewise, this is how we're eventually going to see JITted Javascript -- something I predicted around the time that Web 2.0 was taking hold.

We've been using SQL 2005's XML functionality on a project because the people entering data wanted the luxury of freeform data in XML. This has turned out to be pretty good, although we could have taken it even further. These databases allow you to insert and replace Xml nodes right in the document. Of course, this is nowhere near as fast as using SQL, but the flexibility is pretty great. Anyone who has had to deal with changing schemas on a DB should be familiar with this. If your data is never locked down in the first place, it turns into a nightmare.

I've been reading for a while that DB2 supports Native XML better than SQL 2005, so I decided to give both a try this weekend and see how they compared.


DB2 Express-C v9.1
SQL Server Express 2005 SP 2

  • Single row with 6mb of XML, queried via XQuery
  • Same XML file shredded into ~6000 rows in a table with an XML column, queried via XQuery
  • Traditional primary key test.
  • No additional tuning except creating XML indices.
  • All queries done through ADO.NET for consistency.

DB2 Pros:

  • Deployable on Linux, AIX, Solaris, Windows, etc.
  • Consistently the best performance at doing Xquery across shredded XML.
  • XQuery with index across shredded XML better than or equal to typical sql primary key!
  • Consistent query speed even without an XML index.
  • Inserts are very fast.
  • XML indices can be set up for individual XPaths, targeting exactly what you need.
  • XQUERY is a direct command that can be used, rather than querying via SQL functions.
  • Index on 6,000 rows of XML did speed up XQuery across them.
  • Still works well with ADO.NET out of the box.

DB2 Cons:
  • I was able to crash the entire database... yes, I got a C++ runtime error dialog for db2syscs.exe, when I tried to delete a couple columns from an existing table. Nothing fancy, and the DB was completely hosed at that point and I couldn't even stop or restart the thing using their tools. I had to go into Task Mgr and kill all DB2 processes by hand and restart. Obviously, this is inexcusable for a shipping database product.
  • DB2 never ever shuts down gracefully for me. It always complains there are still connections even when there aren't. Back to task manager to kill it!
  • XQuery on 6mb XML blob was never very impressive, speed-wise.
  • Setting up XML index this seemed to have no effect on my 6mb XML file in a single row speed.
  • Documentation is a near-unusuable collection of one page descriptions with no samples.
  • Java-based database management tools suck hard. It's barely usable and has a huge memory footprint to boot. Oracle's Raptor tool is better, though, that too being in Java, it is only slightly better.

SQL Server 2005 Pros:

  • Better command cache and query cache. Doing the same query twice results in queries taking microseconds rather than milliseconds.
  • Fastest at XQuery within one 6 megabyte row of XML data.
  • Far better documentation.
  • Far better tool for database administration (SQL Server Management Studio Express)

SQL Server 2005 Cons:

  • Poor performance at xquery, using exists() across shredded XML.
  • Windows only.

XML/XQuery Pros:

  • Flexibility.
  • Ability to push XML straight from DB to XSLT, if you're into that sort of thing.
  • Did I mention flexibility?

XML/XQuery Cons:

  • Typically slower than traditional primary keys (Except on DB2).
  • Never got SqlParameters feeding into Xqueries for neither Sql Server nor DB2.

So after all that messing around, I decided to do a little control study with MySQL, which has no XQuery or native XML support. In order to weed out the hype of "Native XML", I thought this necessary.

Obviously it's tough to navigate the large 6MB blob of XML on MySql, so I decided to put it up against the shredded XML using "WHERE xmldata LIKE 'lots of percent sighs'. And actually...

  • MySQL outperforms SQL Server when working on shredded XML. (by an order of magnitude)
  • MySQL underperforms DB2 when working with shredded XML. (by an order of magnitude)

Sad, but true. Microsoft's XQuery hype gets killed by using "WHERE ... LIKE" in MySQL on shredded data.

Based on this, here's a guide for while DB to use:

  • Non-shredded, huge XML blobs: SQL Server 2005.
  • Shredded, XML in rows: DB 2.
  • Pure read performance: MySQL.
  • Best in show: DB 2.

I wish I didn't have to write that DB 2 was the best in show here given the crashes and trouble I had with it, but: in general, if you need to use XML in your database, DB2 is probably the best choice.


Brandon said...

How much data in your 6k rows? Still only 6MB, or 6*6M? Either sounds like a pretty small dataset for a test... unless thats your typical volume.

Trimbo said...

1K per row in 6K rows. Yeah, it's not a lot of data compared to what people usually sic DB2 on.

I didn't really set out to benchmark as much as I wanted to take a look at the ease of use. It turned into a benchmarking thing though when I started getting some numbers back from the query tools that looked very odd. DB2 uses this measurement called "Timerons", which don't relate to anything. My queries were taking 7.52 "timerons" or something like that. So I wrote a .NET app to do my queries, wherein I used the StopWatch class to measure what the queries really took. That's how this ended up being a benchmark thing a little bit.

The dataset isn't big but it is one that's really practical for our work. Mostly I use it to do manual queries of different sorts. The data doesn't change much, so I could probably use Excel to query against it, but feeding it into SQL Server has just been pretty easy. SQL Server Management Studio is a pretty decent tool for working through queries.

BTW, the dataset mentioned is not the production data we're mainly putting in SQL Server. That data has revision and user tracking, relationships, etc. Most of the rows end up being in the history tables, though the DB is a hyrbid and uses traditional primary/foreign keys as well as an XML column.