Friday, January 26, 2007

Following up on the ANTs Data Server issues, their Tech Support guys promptly sent me a email saying that the Bugs have been logged and that they are working on it:

  1. Bug number [1415]: JDBC getParameterTypeName() returns unknown for Date Datatype
  2. Bug number [1416]: JDBC getParameterClassName() method returns hard-coded string as "java.sql.ParameterMetaData"
  3. Bug number [401]: In ANTs, BIGINT is not of the standard size as compared to other Databases

StreamCruncher 1.06 Beta is now available!

This release contains 2 new features, both of them have to do with improvements in the Query language. The first addition is the case..when..then..else..end clause. Most Databases support this in the select.. clause. An extremely useful feature to handle null column values, to rewrite column values etc.

The second addition is the first n or top n or the limit m offset n clauses. SC does not validate which clause you should use for the Database being used underneath. If the Database you are using supports the first n clause like Oracle TimesTen, then use it. If you are using H2 Database and you want to truncate the ResultSet, then you should use the limit m offset n clause. Check your DB's manual to find out which clause to use.

Both features in this release work only if the Database being used supports those clauses. Have a look at CaseWhenClauseTest and TopOrLimitClauseTest to see how it works.

The first m or its equivalent clause will prove to very useful if you just want to sample just a few Rows/Events without having to retrieve all the Events/Rows, which is always time consuming.

Tuesday, January 23, 2007

StreamCruncher 1.05 Beta is out!

Ah, finally..I got the time to re-do the Partitioning and Pre-Filtering logic. Until this 1.05 release, Partitions had to pull Events from the source Stream/Table just before Query Processing - one step before the final Query could be executed. So, the overhead of fetching the Events and Pre-Filtering them ("where.. " clause in the Partition definition) would add some latency to the overall Query processing. Even though each Partition used to run in its own Thread, the whole process would have had to wait for all the Partitions to draw the new Events into their respective Partitions. From the 1.05 release, the Events are pre-fetched for each Partition by a separate group of Threads. This should improve the speed and CPU utilization on Multi-Core/Multi-Processor Hardware. Overall Latency should reduce noticeably on such Hardware.

As a consequence of this change in architecture, Partitions with the Pre-Filter clause do not trigger the Query unless the Events have passed through the Pre-Filter. In previous releases, unfiltered Events would trigger the Query (if the total Event Weight reached 1.0 or higher) an then would get filtered before reaching the Partition. This was very awkward for Queries with "New Events Windows", because Events that would trigger the Query spuriously would result in the "New Events Windows" to discard their contents.

Load distribution across multiple Queries is possible now without any untoward consequences (like the "New Events Windows" problem) because the Pre-Filtering works in a Thread pool of its own.

Saturday, January 13, 2007

Following up on what I wrote on Marco's blog, StreamCruncher now supports the Solid BoostEngine, which is a dual-engine Database. Dual-engine means that it supports both In-memory Tables and Disk-based Tables. All the Streams (Input and Output) created via StreamCruncher are created on the Memory Engine and the Queries can combine data from both Disk-based and Memory Tables.

The ReStockAlertTest in the examples, is a perfect example of this; where it combines the "stock_level" Disk-based table (default in Solid) and the "test_str" Stream defined on the In-memory "test" Table. StreamCruncher creates its artifacts using the "STORE MEMORY" clause.

Something similar is done for MySQL Databases, where SC creates artifacts using the "engine = MEMORY" clause (MySQL is a multi-engine DB).

For both Solid and MySQL, SC transparently adds this special clause to the Input and Output Table/Stream definitions.

If you've tried H2 Database, the pure-Java, embedded, in-memory, in-process Database that StreamCruncher ships with, you'd be amazed at how much work has gone into its development. Well, it supports other modes as well, but this mode is the default/recommended setting in SC.

Thomas Mueller, the guy who owns/develops H2 was also the same chap who developed HSQL DB before someone else took over the responsibility from him. And now, HSQL DB is part of OpenOffice's Base application - Sun's answer to MS Access. That's really something.

H2 is turning out into a full fledged DB, what with its Spatial indexing, support for large (several GB) Database sizes, support for all sorts of twisted but very handy SQL syntax.. From StreamCruncher's point of view, the embedded mode in H2 is highly suitable. The less latency at the DB, the better. Last time I spoke to Thomas (over email), H2 did not have any support for concurrency. H2 was thread-safe, but didn't allow concurrent operations - just like HSQL DB. I hope he spends some more time removing that major bottleneck. Atleast, Table-level locking/concurrency will be very good. HSQL DB on the other hand locks the whole Database, which is ughh..

In any case, I admire the effort that he has put into H2.

PS: It's "relocation time" for me. I'm moving out of Singapore after having spent a 1 year and almost 10 months doing Consulting work. Nice little city, Singapore.

Monday, January 08, 2007

I thought I should share my Sci-Fi reading list. I'll start with the Stephen Baxter books I've enjoyed reading. Now, Baxter is considered to be a Hard Sci-Fi writer, like Clarke. But Baxter's writings are even more futuristic, and absolutely mind bending. Naturally, because Baxter is more like Clarke's successor, carrying the baton into the 21st Century. Certainly not for the faint hearted and semi-luddites. His works are based on extrapolations of our current understanding of Quantum Mechanics. His novels stretch across timescales one would never have imagined. From 500,000 years away, all the way up to several billion years into the future. How mankind will've evolved, what kind of entities we might encounter - not the usual sort of man-eating super roaches you see in B-grade movies, but civilizations that have evolved from Dark matter..and ideas like that, which really stretch your imagination and force you to re-think your philosophy, if you have any that is.

But I found his prose to be a little juddering with haiku-like short sentences and abrupt context switches from chapter to chapter, especially when I read Manifold: Origin, which was the first Baxter novel I read. Subsequent novels were better, probably because I must've got used to his style by then. Baxter's stories are very unique in that he constantly keeps hitting the boundaries of our understanding of the Universe, our purpose here, if there really is any, puts his chatacters in extraordinary situations like encountering a whole galaxy that is miniaturized into a small box because their Sun was about to go Nova or meet a civilization that is millions of years ahead of us and they completely ignore us until the end of the Universe where they leave a small condescending token behind for the poor Humans, like how we throw crumbs at pigeons or a human being grafted onto an AI and then suspended inside the Sun to study why the Sun is dying so fast instead of hanging around for another 5 billion years. But his characters seem to lack depth because there are usually dwarfed by the engineering and astronomical marvels in the story working on colossal scales like the aliens who are re-engineering the Milky Way in the Ring. Some books like the Ring, especially leave you reeling under the concepts.

The Light of Other Days, was a lot more enjoyable. A lot of his novels are interlinked. You have to read all the novels in the right order, when you finally get this "a-ha!" moment when all the pieces fall together - all episodes fall in line sometime along the "Time-like infinity". Start with "The Light of Other Days", which was a collaborative work with Clarke. Then move to Manifold: Origin and then Ring. If it still leaves you thirsting for more Hard Sci-Fi, read Coalescent - an entirely different thread. If the Ring and Manifold leaves you numb and staring into deep space, wondering what your descendants 80,000 years from now will be doing, then you should read Coalescent to bring you back to present day. And if you are curious about Hiveminds you will like this book. Exultant, I felt was too much like Orson Scott Card's - Ender's Game.

You'll also notice Baxter recycling some of his stuff in other novels. But don't miss The Time Ships, a sequel to Wells' Time Machine. I loved this book, probably because he had to continue with Wells' style of writing instead of using his natural style. If you are interested in Evolutionary Biology, Genetic engineering, liked Huxley's Brave New World and are willing to make that leap of faith where a lot of things that we've come to accept as Society, Religion, Culture are all challenged; you should read this book. Well, Faith is the wrong word in this context, I suppose.

Saturday, January 06, 2007

StreamCruncher 1.04 Beta is out! A hasty release I must add. There was a Connection leak in 1.03 Beta, for ANTs and Solid Databases. In my previous Post I had mentioned the use of Proxies to bypass the Solid and ANTs Driver limitations. 1.04 onwards, Proxies are not used. I'm using concrete Classes to wrap the Connections, PreparedStatements etc.

But I'm still flummoxed by the ANTs Server. The SLAAlertTest keeps failing about 50% of the time. I spent the whole day trying to figure out why and I just couldn't. One thing I noticed was that the Timestamps get completely messed up in the results Table and even before that, along the way. The Kernel seems to be working fine, as demonstrated on the other 6 Databases. It's just ANTs. The Timestamps keep mysteriously jumping randomly to the "distant future". Maybe it's trying to hint at something - about my love for Science Fiction. Hmmm..? I've given up on ANTs for now.

Somebody also pointed out that SC does not work on JRE 1.5. I've tried to fix that by compiling the files with the "target 1.5" option. But, I have no way of checking if it works on 1.5. I strongly recommend 1.6, now that the Release Candidate is ready. 1.5 has some Memory Leaks in the Concurrent classes - in the parkAndWait(..) method or something like that. It was quite serious, last time I checked. After that I switched to 1.6.

And the Apache Commons library that is packaged with SC has been upgraded to 3.2 from 3.1.

Well, gotta go. Have a nice weekend.

Thursday, January 04, 2007

StreamCruncher 1.03 Beta is out! Changes include the "pinned" Aggregate Partition and the "entrance only" option for Aggregates. With this, the ReStockAlertTest has been modified to use the "Latest Rows/Events" Window instead of the "600 day Window", which in hindsight looks like it was not the right way to implement the Use Case.

I would've been thrilled to announce the support for ANTs Data Server and SOLID BoostEngine Databases. But the work involved to get those 2 DBs to work with SC was quite tiresome. So, yes StreamCruncher 1.03 Beta now supports SOLID BoostEngine and ANTs Data Server. That's a total of 7 Databases! Woo-hoo (Homer Simpson style)!

And if you want to be notified when there is a new release, I suggest you sign-up with FreshMeat and "Subscribe to new releases" at the bottom of their page.

However, there is a bug I've noticed in the Connection Pooling mechanism. SOLID integration also needed the Proxy classes just like how I had to for ANTs in my previous Post.

SOLID Driver has a bug in the ResultSet.getTimestamp(..) method. It throws an "Invalid Date" error. But the same column when retrieved through the getDate(..) or getTime(..) returns the corresponding Date/Time components. getString(..) on that column also works. So, I had to add a whole series of Proxies for the Connection, Statements (plain Statement, Prepared and Callable) and finally the ResultSet to intercept the getTimestamp(..) and getObject(..) calls on TIMESTAMP columns. After intercepting the call, I use the getString(..) method to retrieve the String form of the Timestamp, parse it back to java.sql.Timestamp and then return that to the caller. Mmmm..And obviously, because of all these Proxies, there is a 10-20 millisecond overhead for each method call. I might have to replace these Proxies with proper implementations and then forward the calls like in the GoF Adapter Pattern. That's for later, anyway.

And SOLID does not support the PreparedStatement.getParameterMetadata() method. It just hasn't been implemented.

But the biggest problem I've noticed here is the damn Connection leak! I noticed this while testing the SOLID DB integration. The SOLID installation I have is a Demo copy and so the number of Connections are limited. Even though the Test I was running used at most 2 simultaneous Connections, the Pool kept throwing a "Datasource rejected request. Too many connections" error. And then after adding a whole lot of Sys-Out calls I realised that the Pool never closes the Connections. D'oh! It just kept creating new Connections, which makes it unusable except for Demos, until it is fixed. I noticed this in ANTs and SOLID. I'm wondering if the Proxies I've used are causing this problem. Not sure if the same thing is happening in the other DBs. Arghh...more debugging work for this weekend. I suspect Apache DBCP or the Commons Pool.

Wednesday, January 03, 2007

This weekend I worked on integrating the ANTs Database with StreamCruncher (SC). Well, where shall I begin? Unless I'm mistaken, their JDBC Driver is awful. To start with the Driver does not even get registered with the DriverManager. Hard to believe, but that's how it is. ANTs is accessed using the JDBC-ODBC Bridge and their ANTs Driver never appears in the StackTraces. Because their (at least ANTs 3.6 GA) Driver does not get registered, the Driver Manager falls back to the Sun implementation.

And as it is, Sun's default JDBC-ODBC Driver is littered with bugs. For the first 1 and a half days, I kept banging my head, trying to get SC to work with the Database. The Sun implementation has this weird bug where you can't access any VARCHAR columns in the ResultSet in any order other than the sequence in which they are specified in the "SELECT A, B.." clause. And, you can't access them more than once. It keeps barfing with a "No data found" error. After looking at the StackTraces, I realized that ANTs was nowhere in the picture (er, StackTrace rather). It also had other such peculiarities, which I think was partly because of ANTs underneath. Batch Insert was not working correctly, when the Table had BIGINT columns.

After all the "head banging", I had enough lumps on my head to make me stop and find a better solution. So, after some clever coding to forcibly register and instantiate the ANTs Driver, I thought I was getting somewhere. But alas, ANTs compliance with JDBC is quite broken. Their PreparedStatement.getParameterMetaData(..) returns a fixed String - "java.sql.ParameterMetaData", instead of the fully qualified Java Class name of the Parameter/Column. But then there were bigger problems. SC makes extensive use of BIGINT columns through the setLong(..) method. In ANTs, however BIGINT and INT are not of the standard sizes as compared to other Databases. If a Long value is set to a BIGINT column using the PreparedStatement's setObject(..) or setLong(..) methods, like what SC does - you either get a "Null constraint violated error" or the Query won't return any results or if it is an Insert statement, the value in the Table will be entirely different from the one you set. And, the largest negative value you can s fely use on BIGINT columns is "-2147483647" (Integer.MIN_VALUE + 1). Anything larger will produce strange results or none at all! Their BIGINT actually should store Float, as per their error message when you use the PreparedStatement.setObject(new Long("some large number")).

It was quite frustrating to hit a roadblock at every stage. I couln't re-write the whole Kernel just to make it work with ANTs. So, I chose the easier and messier way out. When the Kernel is configured to use ANTs, all the SQL Connections and PreparedStatements are intercepted and replaced with Proxies, transparently. So, whenever the Kernel or any Client code using the Connections obtained from the Kernel's Connection Pool uses the setObject(..) or setLong() on a BIGINT column, the Proxy silently replaces it with the Integer form of the number. The Kernel uses a reduced number range for ANTs - this is the only ANTs specific code I had to make. If the Integer part of the Long value does not match the original Long, because of "overflow", then the Kernel logs a Warning and continues with the Long value without replacing it with the Integer form, by assuming that it was intentional. Most DBs have java.lang.Long as their Java equivalent for the SQL BIGINT data type.

Phew! It was very painful indeed. I guess using the Java Proxy for ANTs PreparedStatements and Connections might slow down the Kernel. Too bad, but it can't be helped unless those guys at ANTs do something about their Driver.

Next on the agenda is the SOLID Boost Engine. Some initial analysis revealed that their ResultSet.getTimestamp() on a TIMESTAMP column throws an "Invalid date" error. But getDate(), getTime(), getString() on the same column returns a result. Arrghh.. I think I see a few more days of coding with the Proxy classes.