Saturday, December 29, 2007

Probably a power gen plant - Page, AZ

Friday, December 28, 2007

Sunset - Grand Canyon


Sunset - Grand Canyon, originally uploaded by ashwin.jayaprakash.

Thursday, December 27, 2007

Sunrise - Grand Canyon


Sunrise - Grand Canyon, originally uploaded by ashwin.jayaprakash.

Full moon over Page, AZ

Bryce Canyon

Didn't notice the back drop was so beautiful until I got back home

Wednesday, December 26, 2007

X'mas trip to Grand Canyon and Bryce

This weekend I drove up to Grand Canyon and Bryce Canyon with friends. Spent a lot of time driving though. The weather was surprisingly good. It had snowed a few days ago, but not when we were there. But it was obviously, very cold. Unbearably so, during the early mornings and evenings when we stayed out in the open to watch the sunrise and set.

The North rim of Grand Canyon was closed for the season. We went to the South Rim. We didn't hike. Don't think the trails were open anyway.

From GC, we went to Bryce Canyon. On the way, we stopped at Page to visit Antelope Canyon. Another wonderful water carved canyon. I forgot to carry my Camera tripod with me and missed a good photo op. But I'm sure it wouldn't have made a big difference (a beginner that I am). This place must be visited during summer when the sun shines into the canyon in narrow shafts and lights up the canyon walls, making for a splendid display.

We should also have visited Lake Powell, Vermillion Caves and other places nearby but didn't have the time to.

Bryce Canyon was an altogether different experience. Hard to believe 3 vastly different looking Canyons can be found within a day's drive of each other. Bryce was open for hiking. We did the Navajo loop. We started from Sunrise point and hiked up to Sunset point. The weather was good here too. The skies were clear with a color so blue that I had never imagined was possible.

On our way back, we tried to visit Meteor Crater but it was closed for X'mas.

Well, all in all it was a very pleasant trip.

Sunday, December 02, 2007

Mission Peak, CA

Ok..one last entry for today. In case you were wondering, yes I've decided to log my hikes/trips/travels since I moved to the Bay Area.

Today we hiked up Mission Peak. Very close to Mountain View, CA. A little too close if you ask me. If you feel like driving, this is not the place. But if you feel like doing a moderately strenuous hike (for occasional hikers like me) this is not bad. No trees around and on the hills, it was very windy on top. A good thing that we carried warm clothes. Carry water and hats/caps. Do not be fooled by all those wire thin people carrying nothing but hiking poles and iPods. Those people probably go there every weekend and it is not an easy hike. The view of the entire Bay Area from the top was well worth the effort.

Mt. Hamilton, South Bay Area

A few weeks ago I drove up to Mt. Hamilton, a little east of San Jose. There's an Observatory there, which closes are 5 PM. Not too far from the city. A lazy sunday afternoon's drive is all it takes. Nice view of San Jose city by night on the way back.

Link: Mt. Hamilton, CA

Thursday, November 29, 2007

Thanksgiving Road trip

This Thanksgiving I want on a Road trip with friends. We drove almost 1400 miles in 4 days.

Started from the South Bay Area, drove via Redwood State Park, took a short detour from 101 through CA 1, which was very picturesque and that accompanied by a bright, sunny day made it great for shooting the shore/landscapes.

Drove up to Crater Lake and Diamond Lake. Lots of snow there..mmmm snow ball fights. If you drive through Shady Cove, make sure that you stop and eat at Two Pines Smokehouse (or similar sounding name, if I've got the name wrong) for Brunch. Great service and good food and you will need it because there is nothing else to eat on the way.

On the way back we drove via Klamath Nature Reserve. A short drive-by. Pretty decent Hawk sightings. Nothing much to see there in this season. Well except for the Pheasants (And it was Hunting season. Nope did not spot Elmer Fudd). The Bald Eagle Reserve would've been nice, if it were open to the public at this time of the year, when they nest in large numbers.

The highlight of the return trip was the Lava Beds National Monument with its vast expanses of eerie Volcanic landscapes and long Lava tubes/tunnels. Black porous lava deposits capped with snow..Reminded me of Chocolate fudge with whipped cream on top.

This is a rough trip record:



View Larger Map

Wednesday, November 21, 2007

Hiking this weekend?

In case you were planning to go camping/hiking this weekend (long weekend in the US - Thanksgiving) then you should probably have a look at REI's checklist and advice columns. It's quite good.

Wednesday, October 31, 2007

Loose metal-strap watch: DIY

Have you ever bought a metal-strap watch and found the strap to be too loose? Well, loose but not so loose that removing another link in the strap would solve it?

I had this problem recently and my new watch kept slipping down my wrist all the time. I couldn't remove any more links because it would make it very tight. So, I added a strip of "Mounting tape" to the inner side of the clasp - the part of the strap that touches the underside of the wrist and - problem solved! Just wait for the glue on the double side sticky tape to dry though.

And, I found those plastic topped push-pins to be very useful while removing those links. All you need a mallet or something similar to force the link-pins out. Here's some info - How to.

Sunday, October 21, 2007

Radio - Classic Rock

If you are a Classic Rock fan, you'll love this Station - KFRC. Of course, if you live in the Bay Area you probably already know - 106.9 FM. You can even listen online for free. I absolutely love the songs they play. Now I don't have to spend a small fortune on CDs.

Saturday, October 20, 2007

ClassLoader (ab)use

A weird piece of code it was and I had the misfortune of having to get around this very piece of code - a hand me down, from an old version of the product.

It was some kind of a store/map where the keys were Classes instead of Strings as keys, which you would've seen in most "regular" Maps. So, it automatically limited the range of keys one could use. New Classes had to be created and compiled and then used as keys, if required. I had to use this package because there a lot of business logic tied to this store and it couldn't be modified. There were a lot of versions that were already in production.

I needed to use the very same store but my keys were going to be dynamic. The only choices I had was to either compile new classes beforehand or generate new Classes on the fly in the JVM using BCEL or ASM or any of those scary Bytecode libraries. That was when I realized that there was a simpler solution. One that didn't really need an over-the-top solution. I chose to use (rather abuse) ClassLoaders.

I realized that I didn't really need the Key-Classes anywhere other than to store data into the map. So, I created an empty Class and then every time I needed a new Key-Class I would just load my fixed Class from a new ClassLoader. Since the same Class can be loaded into the JVM by multiple ClassLoaders, the Store/map would have no idea that they were all the same Classes but from different ClassLoaders. So, I created a small custom ClassLoader that would do this everytime a new Key was required.

I must mention that I was allowed to sub-Class this store.


This is what the "weird" old Store looks like:


public class AbsurdLegacyStore {
protected final Map<Class, Set<Object>> perTypeStore;

public AbsurdLegacyStore() {
this.perTypeStore = new HashMap<Class, Set<Object>>();
}

public Map<Class, Set<Object>> getPerTypeStore() {
return perTypeStore;
}

public void addToStoreAndDoWork(Employee e) {
Set<Object> set = perTypeStore.get(Employee.class);
if (set == null) {
set = new HashSet<Object>();
perTypeStore.put(Employee.class, set);
}

set.add(e);
}

public void addToStoreAndDoWork(Customer c) {
Set<Object> set = perTypeStore.get(Customer.class);
if (set == null) {
set = new HashSet<Object>();
perTypeStore.put(Customer.class, set);
}

set.add(c);
}

public void addToStoreAndDoWork(Supplier s) {
Set<Object> set = perTypeStore.get(Supplier.class);
if (set == null) {
set = new HashSet<Object>();
perTypeStore.put(Supplier.class, set);
}

set.add(s);
}
}


All the fixed key Classes:


public class Customer {
protected final String name;

public Customer(String name) {
this.name = name;
}

public String getName() {
return name;
}

@Override
public String toString() {
return name;
}
}



public class Employee {
protected final String name;

public Employee(String name) {
this.name = name;
}

public String getName() {
return name;
}

@Override
public String toString() {
return name;
}
}



public class Supplier {
protected final String name;

public Supplier(String name) {
this.name = name;
}

public String getName() {
return name;
}

@Override
public String toString() {
return name;
}
}



public class ChannelPartner {
protected final String name;

public ChannelPartner(String name) {
this.name = name;
}

@Override
public String toString() {
return name;
}
}


The "hacky" sub-Class of the old Store with its own custom ClassLoader:


public class HackyTypeBasedStore extends AbsurdLegacyStore {
protected Map<String, Class> dynamicSalesRegionTypes;

public HackyTypeBasedStoere() {
this.dynamicSalesRegionTypes = new HashMap<String, Class>();
}

public void registerNewSalesRegion(String salesRegionName) throws ClassNotFoundException {
if (dynamicSalesRegionTypes.containsKey(salesRegionName)) {
return;
}

ClassLoader customLoader = new CustomClassLoader();
Class dynamicType = customLoader.loadClass(SalesRegion.class.getName());
dynamicSalesRegionTypes.put(salesRegionName, dynamicType);
}

public void addToStoreAndDoWork(String salesRegionName, ChannelPartner partner) {
Class regionType = dynamicSalesRegionTypes.get(salesRegionName);

Set<Object> set = perTypeStore.get(regionType);
if (set == null) {
set = new HashSet<Object>();
perTypeStore.put(regionType, set);
}

set.add(partner);
}

public static void main(String[] args) throws ClassNotFoundException {
HackyTypeBasedStore store = new HackyTypeBasedStore();

String salesRegion1 = "Europe";
String salesRegion2 = "Americas";
String salesRegion3 = "Asia";

store.registerNewSalesRegion(salesRegion1);
store.registerNewSalesRegion(salesRegion2);
store.registerNewSalesRegion(salesRegion3);

ChannelPartner partnerABCD = new ChannelPartner("ABCD");
store.addToStoreAndDoWork(salesRegion1, partnerABCD);
store.addToStoreAndDoWork(salesRegion3, partnerABCD);

ChannelPartner partnerKLMN = new ChannelPartner("KLMN");
store.addToStoreAndDoWork(salesRegion2, partnerKLMN);
store.addToStoreAndDoWork(salesRegion3, partnerKLMN);

ChannelPartner partnerWXYZ = new ChannelPartner("WXYZ");
store.addToStoreAndDoWork(salesRegion2, partnerWXYZ);

// --------

Customer customer = new Customer("Customer-A");
store.addToStoreAndDoWork(customer);

Employee employee = new Employee("Employee-1");
store.addToStoreAndDoWork(employee);

Supplier supplier5 = new Supplier("Supplier-5");
store.addToStoreAndDoWork(supplier5);

Supplier supplier6 = new Supplier("Supplier-6");
store.addToStoreAndDoWork(supplier6);

// ---------

Map<Class, Set<Object>> wickedMap = store.getPerTypeStore();
for (Class key : wickedMap.keySet()) {
Set<Object> set = wickedMap.get(key);

System.out.println("Key: " + key.getName() + ", ClassLoader: " + key.getClassLoader());
System.out.println("Data: " + set);
System.out.println();
}
}

// --------

public static class CustomClassLoader extends URLClassLoader {
protected final String nameAsPath;

public CustomClassLoader() {
super(new URL[0]);

String name = SalesRegion.class.getName().replace('.', '/');
this.nameAsPath = "/" + name + ".class";
}

@Override
public Class<?> loadClass(String name) throws ClassNotFoundException {
if (name.equals(SalesRegion.class.getName()) == false) {
// Delegate to System ClassLoader.
return getParent().loadClass(name);
}

Class clazz = null;

try {
InputStream inputStream = SalesRegion.class.getResourceAsStream(nameAsPath);
if (inputStream == null) {
throw new NullPointerException("Could not find Class file");
}

ByteArrayOutputStream outputStream = new ByteArrayOutputStream();

byte[] buffer = new byte[512];
int c = 0;
while ((c = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, c);
}

byte[] classAsBytes = outputStream.toByteArray();

clazz = defineClass(name, classAsBytes, 0, classAsBytes.length);
}
catch (Exception e) {
throw new ClassNotFoundException(e.getMessage(), e);
}

return clazz;
}
}
}


The dummy Key that would be loaded by multiple ClassLoaders:


public interface SalesRegion {
}


Sample output:


Key: temp.demo.SalesRegion, ClassLoader: temp.demo.HackyTypeBasedStore$CustomClassLoader@c2ea3f
Data: [ABCD]

Key: temp.demo.Customer, ClassLoader: sun.misc.Launcher$AppClassLoader@fabe9
Data: [Customer-A]

Key: temp.demo.Supplier, ClassLoader: sun.misc.Launcher$AppClassLoader@fabe9
Data: [Supplier-6, Supplier-5]

Key: temp.demo.SalesRegion, ClassLoader: temp.demo.HackyTypeBasedStore$CustomClassLoader@1034bb5
Data: [KLMN, WXYZ]

Key: temp.demo.SalesRegion, ClassLoader: temp.demo.HackyTypeBasedStore$CustomClassLoader@b162d5
Data: [KLMN, ABCD]

Key: temp.demo.Employee, ClassLoader: sun.misc.Launcher$AppClassLoader@fabe9
Data: [Employee-1]

Wednesday, October 10, 2007

Expression Languages (OGNL and MVEL)

For those of you using Expression Languages in your programs to add that bit of pluggable logic fragments, I'm sure you've evaluated or probably even use OGNL. StreamCruncher uses OGNL 2.7+ to handle some of the messier parts of Expression evaluation and I've found it to be a huge time saver. What's even better is that the 2.7 version also converts the Expressions into dynamically generated Bytecode.

So, if you want to evaluate OGNL, here is a list of useful links. They are not easily locatable, so I thought this would also be a good place for me to bookmark them for future use.

Here's the old link - OGNL
The 2.7+ versions are handled by this guy Jesse - his Blog
The latest releases - on OpenSymphony

I also strongly suggest evaluating another very well done Expression Language - MVEL,which is written by Mike - his Blog

Saturday, October 06, 2007

StreamCruncher 2.2 Release Candidate

The 2.2 Release Candidate is now available. This has some important performance related changes over the 2.2 Beta version. Over the past few releases, I've spent a considerable amount of time working on parts of the Kernel to perform end-to-end processing without having to go the Database. This version performs Correlation Query processing and single Stream Query processing entirely in Memory. As a result, there are some things that don't work - like the "Order by" and "Group by" clauses. You might call it laziness, but there's only so much time I can spend on this, what with a day job and all.

Anyway, the performance has shot up to very respectable figures. The CorrelationPerfTest that I spoke about in my previous blog can now process a total of 168,000 Events per second on a single Processor, dual Core 1.8 GHz Centrino with 2 GB Memory. The Test has 3 Correlation Queries. Two of them correlate 3 Streams each and one Query correlates 2 Streams.

It's been a long journey. I'm so glad that SC can do this many Events per second now. I remember being quite worried a year and a half ago, when it could not do more than a few hundred events per second.

Monday, August 20, 2007

StreamCruncher 2.2 Beta

Over the past few weeks I've been meddling with the Correlation engine inside StreamCruncher. The 2.2 Release is a result of that ongoing work. "Ongoing" - thus the Beta status. Nevertheless, this release includes changes to the Correlation code (alert..using..when.. clause), with rather drastic changes, in that the Kernel no longer requires the use of the Database to perform Pattern matching! Thereby, increasing performance and decreasing latency several fold.

A new TestCase has been added - CorrelationPerfTest, that demonstrates this wonderful improvement. In this TestCase, there are 4 Streams of Events that are monitored by 3 different Queries, each looking for a distinct Pattern. Since each Query runs in its own group of Threads, the Test scales well on Multi-Core/CPU systems. This Test also generates and consumes large amounts of data and thus serves as a Stress Test too.

Since this is an interim release, there are a few features that are still being developed - like the "case...when.." clause that used to work in previous releases. Now that the Database is no longer being used for Pattern matching, those features that were freely available in the Database are yet to be replicated by the Kernel.

Thursday, August 16, 2007

Instantiate and initialize a HashMap in one statement

There must've been quite a few occasions when you found yourself writing an Interface for your project that stored all the Constants - a centralized place for all the names and values.

In my case, I've had to store not just a bunch of names, but associate some additional descriptions with the names as well. So, how do you do it if you are using an Interface to store these Constants and at the same time initialize them?

Well.. there's a way of creating a HashMap and initializing it with values in a single Statement. All it needs is an Anonymous Inner Class that extends the HashMap and an unnamed code block between Curly Braces. Don't mistake this for the Static code block that we use to load JDBC Drivers. This unnamed block is executed after the instance is created (i.e after the HashMap's Constructor is invoked)

This is how you do it:
[Updated Apr 12, 2012]

public interface ProjectConstants {
    /* Remember, all constants in an interface 
       are "public static final" so you don't have 
       to declare it. */

    Map<String, String> NAMES = 
            Collections.unmodifiableMap(
                    /* This is really an anonymous 
                       inner class - a sub-class of j.u.HashMap */ 
                    new HashMap<String, String>() {
                        {
                            //Unnamed init block.
                            put("homer", "simpson");
                            put("mickey", "mouse");
                            put("eric", "cartman");
                        }

                    });

    .. .. .. .. more constants .. 

    int HIGHEST_ALLOWED_VALUE = 37;
}

Who is James Randi?

James Randi is a world renowned rationalist and skeptic. A "MacArthur Genius Grant" winner, famous for his debunking of Uri Geller, Homeopathy, Auras, Psychics, Faith Healers, Psychic Crimes solvers ... the list goes on. Here's an entertaining list of Videos on YouTube where he does all these.

Friday, July 20, 2007

Some fascinating videos I found on the Internet about Lost Civilizations, Archeology, strange artifacts...

Might be improbable, but not impossible. Enjoy!

Wednesday, July 18, 2007

StreamCruncher 2.1 Release Candidate

The follow up to the 2.0 Beta release is now ready. This release has some minor bug-fixes and feature enhancements:

1) Pre-Filter for Input Event Streams support <, >, !=, =, *, /, +, -, in (..), not in (..), and, or. The in clause can refer to an SQL Sub-Query. Such Sub-Queries are cached by the Kernel to improve performance

2) An additional property cacherefresh.threads.num can be configured to specify the number of Sub-Query Cache processing Threads to use

3) 2 new Test cases have been added to test the new features - H2StartupShutdown3Test and ThreeEventOrderTest

Saturday, July 07, 2007

There's a Website called LibraryThing where you can create your own Reading list and look at what others are reading, search for books based on what you've read and a lot of other nice things. You can even share your list for others to see.

Wednesday, July 04, 2007

StreamCruncher 2.0 Beta is ready!

This 2.0 version is the result of a major refactoring job.

1) The API has been greatly simplified. The internal architecture has changed considerably, resulting in a vast improvement in performance. The TimeWindowFPerfTest (single Query on a Stream) can do 25,500 Events per second on a 1.6 GHz Centrino! More details and log files in my next Blog

2) Plain Windows constructs that were available in previous version has been removed entirely. Partitions are the only construct now. The syntax for Simple (anonymous) Window Partitions has changed slightly, in that there is no by keyword between partition and store

3) An additional property db.schema can be specified, as the Database Schema in which the Kernel creates its internal artifacts

4) Chained Partitions from now on, must always have a Pre-Filter clause starting with $row_status is new/dead

5) The Kernel can now accommodate Events that arrive out-of-order. OutOfOrderEventTest demonstrates this new ability

6) Pre-Filter temporarily does not support the complete SQL grammar like in, exists

Phew! It took me more than a month for make these changes. The internal Event stores for Input Event Streams and Partitions have changed considerably. The idea was to keep the Events inside the Kernel's process for as long as possible and delay inserting the Events into the Database to the last stages. As a result, Events get passed around as references most of the time. The latency per Event has dropped dramatically.

There are a few things that need to be completed/cleaned up - like the Pre-Filter clause syntax, Kernel restarts do not work correctly. This version comes with the latest version of the H2 Database, which now supports Table-level concurrency - a much needed feature for StreamCruncher.

Thursday, June 28, 2007

As mentioned in my previous Blog, StreamCruncher 2.0, when it will be ready (hopefully, in a week or two) will have a much simpler API, will be easier to setup Input Streams, publish Events and also to consume the results from the Output Stream. The end User will not have to deal with the Database at all.

Even inside the Kernel, the use of the underlying Database has been minimized to the extent possible. Most of the operations are done inside the Kernel! Which means that the Events are passed around as references, which reduces the overall GC pause times because of reduced garbage (Event copies). Pre-Filtering in the Partition-Where clauses are also done inside the Kernel itself, instead of relying on the Database. Chained-Partitions will be much faster because the intermediate Database Tables have been got rid of!

As a result of all these changes, the performance has gone up considerably and the average latency has gone down.

Saturday, June 16, 2007

Coming soon...a leaner and meaner StreamCruncher 2.0. With rather major changes to the internal architecture and API. Some preliminary tests on the TimeWindowFPerfTest revealed a 75% increase in performance!! Previous results.

Tuesday, May 29, 2007

StreamCruncher 1.14 is available! This release comes with a new feature/syntax - the self# clause, to perform efficient Self-Joins over Streams. Self-Joins are useful when the Events in a Window have to be scanned/matched against Events from the same or other Windows defined as part of the same Partition clause.

The StockPriceComparisonTest TestCase demonstrates the use of this new syntax. More details in the "StreamCruncher Basics" - documentation.

Sunday, April 22, 2007

StreamCruncher 1.13 Release Candidate is ready!

1) This version includes support for Oracle 10g and has been tested on Oracle Enterprise 10.2.0.

10g being an Enterprise grade Database, requires Tuning by a DB Expert before you start using it as the underlying Database for the StreamCruncher. I don't claim to be an Oracle expert and so I'd ask my DBA to setup the Database for very low Latency, deferred Disk flush, Larger Page and Cache sizes - so that he/she will translate that into the necessary Oracle Configuration changes. People have been creating TableSpaces on RAM Drives mostly to host Indexes for Tables that are constantly modified and heavily contended. I'd also think of creating the whole Database on such a RAM Drive.

StreamCruncher also creates Tables and Indexes for internal purposes. You'll have to ensure that the Schema in which these get created (usually the User name provided in the StreamCruncher DB Config file) are on the TableSpace that is Tuned & Configured for this purpose.

Another good thing to remember to tell the DBA would be the nature in which Events/Rows are operated upon in the Database via StreamCruncher. In any Internal Table, Events are mostly pumped by one Thread and consumed by another Thread - very similar to a Queue or a Conveyor Belt. Updates are done on an Indexed Column mostly on a small set of Rows that are usually in the Page Cache. All DB access (Insert/Update/Delete) by StreamCruncher on its Internal Tables are through Indexes - some Unique and some are not.

Remember, Oracle Database Tuning is an Industry in itself. Make sure you've tuned your setup well!

2) There was another small Concurrency issue in the Kernel that has also been fixed. The last of these kind of issues, hopefully. So, I've finally got rid of the "Unique Index Violation" errors I used to get only on Multi-processor machines. Version 1.12 had it fixed for Single Processor machines. I also have to admit that this fix affects the performance on single Processor machines too, though an increase of only by about 10-13%.

Tuesday, April 17, 2007

Event Processing in 2007 and beyond

An Essay, by Ashwin Jayaprakash (Apr 2007)
Website:
http://www.streamcruncher.com

A sure sign of a technology becoming mature and widely accepted is when developers start losing interest in it and crave for something more. How many of us still take pride in designing and writing applications that use Stateless Session Beans, Message Driven Beans and ActionForms anymore? Let's face it: the tools have matured, we have Blueprints and Design Patterns that guide us at every step of a project's phase. All the (remaining) vendors now offer more or less similar sets of features. Clustering, WebService stacks, and basic tooling are givens now.
We have grudgingly moved towards some sort of interoperability. Some of us feel that XML is overused and misused, but still love XQuery and XPath. SOAP took its own time to evolve from Draft specifications. When it did, most of us realized that it was too raw and bloated to do anything really useful. So, we had to wait for all the sister specifications like Addressing, Transaction, Delivery to reach a decent level of sophistication - or switched to service APIs that don't have formal specifications, like REST or XML-RPC. In the meantime, there were other specifications fighting for relevance - BPEL, Workflow and others to handle Orchestration... that relied on WebServices to do the dirty work. And when we were not looking somebody slid in SOA and ESB.

Where are we really headed?

What will we poor Developers have to fiddle around with over the next few years? After years of manoeuvering and fighting, we have specifications for almost everything, finally and enough for the CTO and his/her team of Architects in companies where Technology is not a core business, to think at a higher, abstract level and treat IT Projects merely as "enablers". If we filter out all the fluff from the ESB, SOA, WS, WSA and other technologies that hide behind catchy acronyms, we have to acknowledge that Software is slowly moving towards commoditization. Electronics/Hardware industry achieved that several decades ago. It was probably easier for them to do it because they left all the hard part of "Programmable logic and customization" to the Software industry.
We have 2 widely used programming languages. XML, XPath and XQuery to transform and massage data from one format to another. WebServices and its related protocols and formats to send and receive data which usually happens to be XML payload, but mostly within the Company's Intranet. BPEL to handle more complex workflows, some of which extend beyond the Intranet. ESB provides the platform to streamline all the different formats and protocols, that we use both internally and externally. So on and so forth. Essentially, we finally have things which finally resemble "Software Components". The situation sounds similar to the Motorcar revolution we've all heard of, which happened in the beginning of the previous century. The Internal combustion engine became more efficient (think of Java/JEE and .Net), Coach builders started making Car bodies, accessories and trimmings (think Packaging - XML and related technologies), good roads and extensive Highway networks got built (think WebServices, ESB etc) and then gradually Traffic cops got replaced by complex Traffic lights and other control networks (think BPEL and competing Workflow/Orchestration standards).
What we will see in the future will be more intelligent IT Systems, and if you want to carry the Motorcar analogy even further - something like Cars that can drive themselves and a "Big-Brother" like network that co-ordinates all traffic. We in the Software industry are definitely heading that way. RFID is finally, commercially viable. RFID and the technologies that it will foster will lead to more prevalent and intelligent Warehouses, smarter Assembly lines, better integration between Traffic control, Accident avoidance and GPS Navigation systems. A host of systems that can automatically respond to feedback and act quickly. Automatic Trading is one among them. One of those technologies crucial to such development is Event Processing. This is a broad category of complimentary technologies - Business Activity Monotoring, Event Stream Processing, Complex Event Processing, Event Driven Architecture. Some of these might appear to be combinations of the others. To keep it simple, we'll label them as Event Processing technologies.
Some of these technologies have been around for several decades in one form or another. Control Systems have custom, in-built, realtime, hardwired Event Processing. Some companies have been using Rule Engines either directly as Inference Engines or just to respond to Facts (as Events). Simple Database Triggers and Stored Procedures are still popular. In short, Event Processing in its new avatar is going to become more widely used in the future. We have also been observing many Event Processing Products and Projects stepping out of their niches by offering general purpose stacks that can be integrated into almost any kind of application.

The tenets of Event Processing

A facility to consume the state(s) of a System as discrete Events, a language or User Interface to specify the conditions/properties to look for in those Events from one System or across Systems (Co-relation) and provide actions to be performed when such situations arise.
It is no accident that Event Processing looks like a cross between Database Query processing and Rule Engines. There is no denying that the technology evolved from existing technologies. However the distinguishing characteristics of Event Processing technology are - the constructs it provides to specify Time based boundaries, fixed size moving Windows, Aggregation and Partitioning/Grouping Events - on a stream of incoming Events and a facility to continously scan these Events for telltale signs, all this expressed as a concise Query.
Database Triggers which in turn execute SQL/Stored Procedures everytime an Event/Row is inserted can provide rudimentary Event Processing functionality. Rule Engines can be used to discern information from Facts being asserted. But none of these technologies alone provide a concise and yet very expressive Domain Specific Language along with a platform, specifically built to constantly monitor/process live streams of Events.
There are just a handful of Event Processing Stacks today. Some are very high end, some very high end but address a specific vertical like Automatic Trading, some general purpose stacks meant for every one and also a few that have been re-packaged and/or re-labelled and shoved into the fray.
Each Stack is implemented in a different way, which shows in the way the Query constructs have different syntax, but the semantics are similar.
Let's look at how one such Event Processor can be used to solve those real-world problems that have been nagging us. No, I don't mean Global Warming, Poverty or the AIDS epidemic. We'll leave that to SOA and ESB.
StreamCruncher is an Event Processor written in Java. It uses a Database to maintain the Events and other artifacts. Since latency and performance are crucial to Event Processing, Embedded and In-Memory Databases are preferred. It exposes the SQL language and its capabilities that we have come to rely so much on, by extending it with an "Event Processing Query Language". The Event Processing Kernel is multi-threaded, and processes Events (and Queries) in stages, to leverage the power of Multi-Core Processors. The Database does some of the heavy lifting by doing the parts it does best like Joining, Sorting etc.
Imagine that there is a small community of farmers who grow Organic Oat, Wheat, Corn and other stuff. They sell their produce and a few other products like Premium Breakfast Cereal, Peanut Butter, Free-range Eggs etc as a Co-operative. To reach a wider customer base, they opened a small Website 2 years ago to cater to online orders. Recently, when an international TV Station featured this Community's products on their Health show, their online sales sky rocketed and almost crashed and burned their Servers. So, they are compelled to upgrade their Website and add a few features to it to ease the burden on their overworked, small IT team.

Online Retail Store

This is what they want for their Online Store: They have quite a few Customers like Spas and Department store chains around the nation, which have arrangements with the Farm, where they place regular, automatic, bulk Orders. These Customers are treated as Priority Customers and they have Service Level Agreements to fulfill the Orders (Verify Payment, alert the Warehouse and Shipment team etc) within 30 minutes of placing the Order. If 30 minutes pass without an Order being "Fulfilled", then an alert is raised for further investigation.
We create 2 Streams to receive Events - one for Orders placed and another for Order Fulfillments. They receive a constant stream of Events, which have to be monitored for SLA failures. It can be accomplished by a simple Event Processing Query in StreamCruncher, by using the Window constructs and the "Alert.. Using..When.." Correlation clause. This is a classic Co-relation example.
To fully understand all the features and syntax of the Query language and the procedure to setup Input and Output Event Streams and other operations, please consult the Documentation. Here's a skinny on the syntax: A Sliding Window is defined like this "store last N", a Time based Window like this "store last N milliseconds/seconds/minutes/hours/days" and a Tumbling Window like this "store latest N". The Stream can be divided into small Partitions with Windows defined at specific levels like this "partition by country, state, city store last N seconds". This creates Time based Windows at the "Country > State > City" levels.
SLA Failure Alert:
select unful_order_id, unful_cust_id
from
alert
order_event.order_id as unful_order_id,
order_event.customer_id as unful_cust_id
using
cust_order (partition by store last 30 minutes
          where customer_id in
          (select customer_id from priority_customer))
         as order_event correlate on order_id,
fulfillment (partition by store last 30 minutes
           where customer_id in
           (select customer_id from priority_customer))
         as fulfillment_event correlate on order_id
when
present(order_event and not fulfillment_event);
Let's analyze the Query. There are 2 Tables cust_order and fulfillment, into which the 2 Event streams flow. The Query looks like an SQL Join Query, although this syntax is more expressive, yet concise. The Alert clause is specifically for performing Event Correlation across multiple Streams. Other scenarios described further ahead use syntax that are only minor extensions of SQL, unlike the Alert clause. The Query keeps running on the 2 Streams, monitoring the 2 Streams and generating Output Events as SLA Failure Alerts. The "Alert..Using.." clause consumes Events from the 2 Streams. The "correlate on" clause is like the "join on" clause in SQL which specifies the column/Event property to use to match Events from different Streams. The "When.." clause specifies the Pattern to look for. This Query is meant to output an Event when the Customer Order Event expires without receiving a corresponding Fulfillment Order within 30 minutes since the Order was placed, thus producing an SLA Breach Alert. Such Queries are also called Pattern Matching Queries as they are configured to look for the presence or absence of specific combinations of Events.
Since the Farm has these contracts with their corporate consumers, they have to maintain a minimum number of items in stock in their Warehouses distributed around the nation, to avoid an SLA breach.
We have to setup a mechanism which constantly monitors the items in stock in the Warehouses. When an Order is placed, the products and their quantities are deducted from the Warehouse's stock. Products are supplied from different Warehouses based on the location closest to the Ship-to address. Warehouses are also stocked periodically and the products and quantities are added to the Warehouse's stock. So, we receive 2 kinds of Events - Orders, which reduce the Stock and so carry a negative number for quantity and Re-stocks, which increase the Stock.
We also use a regular Database Table, to store the minimum stock that must be maintained for each product. When the stock for that product goes below the prescribed quantity an alert should be raised.
Warehouse Re-Stock Alert:
select country, state, city, item_sku, sum_item_qty,
 stock_min_level
from warehouse (partition by country, state, city, item_sku
   store lastest 500
   with pinned sum(item_qty) entrance only as sum_item_qty)
   as stock_level_events,
   stock_level_master
where stock_level_events.$row_status is new
    and stock_level_events.item_sku =
    stock_level_master.stock_item_sku
    and stock_level_events.sum_item_qty <
    stock_level_master.stock_min_level;
 
The "partition by country, state, city, item_sku" means that separate Windows are maintained based on the Event's Country, State, City and Product. An Aggregate function - "sum(item_qty)" is used to maintain the sum of the quantities of the products in the Window.
The Window is a Tumbling Window, where Events are consumed in one cycle and discarded in the next cycle. Their lifespan is just one cycle. Since the aggregate here, is the Sum of the "item_qty" of all the Events currently in the Window, it changes even when Events expire. But we don't want the Sum to change when an Event expires and we just want to keep the sum of all the items passing through the Window. The "entrance only" clause in the Aggregate function informs the Kernel to calcualte the aggregate when an Event enters the Window and to ignore it when the Event is expelled from the Window. Since the Orders and re-stock quantities are negative and positive numbers, the aggregate will automatically store the total quantity in the Warehouse.
Whenever the Sum goes below the minimum value specified in the "stock_level_master" Table, an alert is raised.
The "pinned" clause is used to hold/pin up all the Aggregates even if all Events in the Window expire. Otherwise, for "Time based windows" and "Tumbling Windows", the Aggregate if declared, expires along with the Window when all Events in that Window expire. It gets re-created when new Events arrive.
The ability to merge an Event Stream with a regular Table reduces the maintenance required to maintain the Master data of minimum values, adding/removing Products etc because it is a regular DB Table. And the Query's behaviour is driven by the Master Table (stock_level_master).
Orders are placed from all over the nation. To reduce the overhead of shipping smaller orders, the process is streamlined by aggregating orders being shipped to the same City and are then taken care of by local distributers. This way, bulk-shipment offers can be availed by the Farm.
Shipment directives are held for 30 minutes. If more than 5 such directives are received within those 30 minutes, to the same City, they are dispatched in batches of 5 even before the 30 minutes expire. Thus saving the overhead of shipping smaller orders individually.
Shipment Aggregator:
select country, state, city, item_sku, item_qty,
 order_time, order_id
from cust_order
(partition by country, state, city
store last 30 minutes max 5)
as order_events
where order_events.$row_status is dead;
This Query produces Output Events when an Event gets expelled from the Window, as indicated by the "$row_status is dead" clause after it has stayed in the Window for 30 minutes or expelled before that as part of an "Aggregated Order" when the Window fills up with 5 Order Events.
The Website itself has to be spruced up with some nifty features. To draw the attention of Online visitors, the top 5, best selling items over the past 30 days are to be displayed. This is accomplished by a Query that uses a Chained Partition clause, where the results from one Partition are re-directed to another Partition, which slices and dices the intermediate data into the final form.
Best Selling Items:
select order_country, order_state, order_category,
 order_item_sku, order_total_qty
from cust_order
  (partition by order_country, order_state, order_category,
   order_item_sku store last 30 days
   with sum(order_quantity) as order_total_qty)
to
  (partition by order_country, order_state, order_category
   store highest 5 using order_total_qty
   with update group order_country, order_state, order_category,
   order_item_sku where $row_status is new) as order_events
where order_events.$row_status is not dead;
This Query uses a "Chained Partition", where 2 different Partition clauses are chained be means of the the "to" keyword and the results from the first Partition are fed to the second Partition, whose results form the input to the Query. Such chaining is required when a single Partition cannot accomplish the task. The first Partition stores a rolling sum of the total Quantity sold at the "Country > State > Category > SKU" level and those Sums are sent to the second Partition which uses the "..is new" clause to pick up only the newly calculated/changed Sums. The second Partition uses a "Highest Rows/Events" window, which is a slight variation of the Sliding Window to maintain an up-to-date list of the "top 5" selling items in each "Country > State > Category".
As can be seen from the examples above, Event Processing is quite versatile and can be used to implement many Use Cases concisely and efficiently. All Event Processors are built to handle good data rates, which greatly simplifies the development and maintenance cost of the applications using them.

Links

Friday, April 13, 2007

Finally!! StreamCruncher 1.12 is ready and it's no longer a Beta version. This is the Release Candidate.

(Also, performance test results - read further. Hint: 8,000 TPS !! on 1.6 GHz Laptop)

I found the time to review some parts of the Kernel code. It turned out that there were small things here and there that needed fixing. Since the Kernel is heavily multi-threaded, it was important that locking be reduced. As a result, the CAS (Compare and Set) operations (Java 1.5+) are used in many places. This is much faster than actually waiting on a lock and then realising that the logic in the protected section does not have to be executed anyway.

After fixing these issues, I modified the "TimeWindowFPerfTest" class to capture more metrics. Apart from just calculating the Average Latency added to each Event by the Kernel in a Straight/Simple processing case, this Test now calculates the average total time it takes to insert rows into the Database and for the Kernel to publish them.

The Test was already described before. This time, with the bug fixes, there were no Index-violation exceptions. So, on my Laptop running Windows XP Home with 1 GB RAM and a single 1.6Ghz Intel Centrino Processor, I ran the "TimeWindowFPerfTest" performance test using the Sun JDK 1.6 and StreamCruncher 1.12 with H2 Database.

I redirected the verbose Console output to a log file and thereby eliminated the otherwise excessive overhead added by the Console logging. This way, I also have proof of all the Tests that were performed.

The Test uses a Thread to generate and pump 'X' events in one shot without pausing. A Query with Time based Partition is defined on this Stream. The Window size is 5 seconds. A "$row_status is new" clause is used to output only the new Events that arrive at the Window and not the ones that exit the Window when their 5 seconds are over. This way, an accurate measurement of how much overhead the Kernel is imposing can be calculated. The total time taken for the entire batch to be inserted and for it to be pumped out of the Kernel can also be calculated. This can then be used to calculate the Transactions per second - the most important metric.

The Test pumps these 'X' events and then waits for some time that is sufficient for the Events to clear the area and then pumps the same number again...and again..At the end of the Test, the results are retrieved, verified and then the Averages are calculated.

Ok, here it comes..Keep in mind that this is a single CPU and the Event "pumper" and the Kernel are running in parallel. The H2 Database (current version) is completely single-Threaded - and so there's no concurrency at all, even though StreamCruncher supports concurrent operations.

I ran 3 rounds for each configuration and here are the results:

Set 1 (4000 Events per Batch):

Set 1 - Round 1
Total events published: 36000. Each batch was of size:4000. Avg time to publish each event (Latency in Msecs): 224.0
Avg time (in Msecs) to insert 4000 Events into the DB: 418.0
Avg time (in Msecs) to process 4000 Events by the Kernel: 376.0
Avg time (in Msecs) for the insertion of first Event in the batch of 4000 Events into DB to publication of last Event in batch by Kernel: 598.0

Set 1 - Round 2
Total events published: 36000. Each batch was of size:4000. Avg time to publish each event (Latency in Msecs): 199.0
Avg time (in Msecs) to insert 4000 Events into the DB: 428.0
Avg time (in Msecs) to process 4000 Events by the Kernel: 397.0
Avg time (in Msecs) for the insertion of first Event in the batch of 4000 Events into DB to publication of last Event in batch by Kernel: 600.0

Set 1 - Round 3
Total events published: 36000. Each batch was of size:4000. Avg time to publish each event (Latency in Msecs): 261.0
Avg time (in Msecs) to insert 4000 Events into the DB: 387.0
Avg time (in Msecs) to process 4000 Events by the Kernel: 336.0
Avg time (in Msecs) for the insertion of first Event in the batch of 4000 Events into DB to publication of last Event in batch by Kernel: 591.0

Set 2 (8000 Events per Batch):

Set 2 - Round 1
Total events published: 64000. Each batch was of size:8000. Avg time to publish each event (Latency in Msecs): 378.0
Avg time (in Msecs) to insert 8000 Events into the DB: 603.0
Avg time (in Msecs) to process 8000 Events by the Kernel: 699.0
Avg time (in Msecs) for the insertion of first Event in the batch of 8000 Events into DB to publication of last Event in batch by Kernel: 1044.0

Set 2 - Round 2
Total events published: 64000. Each batch was of size:8000. Avg time to publish each event (Latency in Msecs): 457.0
Avg time (in Msecs) to insert 8000 Events into the DB: 533.0
Avg time (in Msecs) to process 8000 Events by the Kernel: 666.0
Avg time (in Msecs) for the insertion of first Event in the batch of 8000 Events into DB to publication of last Event in batch by Kernel: 1013.0

Set 2 - Round 3
Total events published: 64000. Each batch was of size:8000. Avg time to publish each event (Latency in Msecs): 392.0
Avg time (in Msecs) to insert 8000 Events into the DB: 593.0
Avg time (in Msecs) to process 8000 Events by the Kernel: 839.0
Avg time (in Msecs) for the insertion of first Event in the batch of 8000 Events into DB to publication of last Event in batch by Kernel: 1064.0

Set 3 (10,000 Events per Batch):

Set 3 - Round 1
Total events published: 70000. Each batch was of size:10000. Avg time to publish each event (Latency in Msecs): 491.0
Avg time (in Msecs) to insert 10000 Events into the DB: 705.0
Avg time (in Msecs) to process 10000 Events by the Kernel: 783.0
Avg time (in Msecs) for the insertion of first Event in the batch of 10000 Events into DB to publication of last Event in batch by Kernel: 1220.0

Set 3 - Round 2
Total events published: 70000. Each batch was of size:10000. Avg time to publish each event (Latency in Msecs): 518.0
Avg time (in Msecs) to insert 10000 Events into the DB: 689.0
Avg time (in Msecs) to process 10000 Events by the Kernel: 845.0
Avg time (in Msecs) for the insertion of first Event in the batch of 10000 Events into DB to publication of last Event in batch by Kernel: 1198.0

Set 3 - Round 3
Total events published: 70000. Each batch was of size:10000. Avg time to publish each event (Latency in Msecs): 513.0
Avg time (in Msecs) to insert 10000 Events into the DB: 647.0
Avg time (in Msecs) to process 10000 Events by the Kernel: 743.0
Avg time (in Msecs) for the insertion of first Event in the batch of 10000 Events into DB to publication of last Event in batch by Kernel: 1151.0

While the Tests were running, I kept noticing that the CPU even for the 10K Set was not rising above ~15% and that for just 1 second periods. Which is quite puzzling. It might be because the Producer and the Consumer Threads are not really running in parallel because the one common resource - the Database is always locked by one of these 2 (sets of) Threads. I was expecting the CPU to peak and the Tests to crumble at the 10K Set. But it didn't, which is a very good sign.

This means that StreamCruncher can do 8000 Transactions Per Second (Straight/Simple Processing) on a very ordinary setup and perform exponentially better on better hardware (more Cores and/or CPUs) and commercial Databases. This, combined with Horizontal Partitioning of the Stream data (using the Pre-Filters and multiple Queries to split the Events and process in parallel) should produce fantastic performance.

The test results/logs can be downloaded from here.

Monday, April 09, 2007

StreamCruncher 1.11 Beta is available. No changes to the code though. I had forgotten to update the Syntax diagram in 1.10 Beta which included the "$diff" and other custom Provider changes.

Saturday, April 07, 2007

StreamCruncher 1.10 Beta is ready! This release includes:

1) $diff clause for In-built Aggregate Functions along with custom Baseline Provider feature. ClusterHealthTest demonstrates this new feature.

2) Custom Window size Provider should now be declared in the Query, statically. TimeWFPartitionWinSizeProviderTest demonstrates this change.

3) A simple class StandAloneDemo shows how to run a sample using just the Java main(..) method. sc_run_standalone.bat can be used to run the Demo

Monday, March 26, 2007

Here's another reason why using an RDBMS in StreamCruncher was not a bad design decision at all. RDBMSes (what's the plural of RDBMS?) have attained a status (in Middleware) that is unequalled by any other technology. We trust RDBMS to store our Bank balances, a nation's entire Census data, Criminal records and the list goes on and on. So, when such a technology is always within easy reach, common sense tells us that it should not just be used, but embraced...Ok, enough of that and moving to the point ...

Many people who are familiar with ESP and CEP, when they learn that StreamCruncher uses an RDBMS underneath, probably wince at the mere mention of a Database. To them Databases are good but not good enough for Stream Processing. To many RDBMS is the anti-thesis of Stream Processing. Why? They begin to express vehemently about the "presumed" drawbacks of such an architecture. They start lecturing about Performance. Speed. "Sub-millisecond latency cannot not be achieved in regular RDBMS".. and so on.

Well, they're not entirely true. Even if you ignore the fact that StreamCruncher already supports several Embedded, In-Memory, Real-time Databases that are routinely used in Telecom (the DBs), there is a classic solution for regular RDBMS that will make them as good as any In-Memory Database. The biggest hurdle is the Hard disk - Persistence. Databases are meant to store data for posterity. But storing data to the Disk also adds a lot of latency. All other things offered by the Database like concurrency, scalability etc are very much required for Event Processing. Persistence is not required for Event Stream Processing. So, a regular Enterprise class Database can still be used as the base for StreamCruncher, which provides a solid/robust foundation on which to perform ESP/CEP - all this by creating the Database Tablespace on what is called a RAMDisk. Which is just a soft-drive, where a portion of the Physical Memory/RAM is turned into a Storage Drive (like C:\ or /usr/home/myname). This acts as any other Drive where files can be created etc but everything gets wiped out when the Machine is rebooted. This technique is not something new, but fits perfectly well in the context of StreamCruncher. RAMDrives can be created for almost any Operating System. A simple search on the Internet reveals several products and techniques for creating RAM Drives.

So, all the licensing costs that a Company has incurred on acquiring and maintaining these "Big Daddy" Databases can still be leveraged for Event Stream Processing. In the end, StreamCruncher gets to use a time tested Database that Developers have been using in their other regular Projects and can rely on the stability of such DBs. The important Sorting, Pre-Filtering, Joining and other CPU-intensive operations happen in the Database Engine, which is in Native code. And of course the Developers' familiarity with such Databases also plays a crucial role in adoption and integration with other parts of the project.

Sunday, March 18, 2007

StreamCruncher 1.09 Beta is ready and it includes a major feature addition. From version 1.09 onwards, Multi-Stream Correlation a.k.a Pattern matching is possible. This feature enables the monitoring of and correlation across multiple Streams of Events using a simple SQL-like Query.

Although simple 2 Stream Correlation is possible using regular SQL plus Partitions, as demonstrated in the SLAAlertTest, watching for multiple Patterns across more than 2 Streams is now very easy with this new "alert..using..when.." clause.

MultiStreamEventGeneratorChainTest demonstrates this new feature.

Thursday, March 15, 2007

People have asked me how StreamCruncher performs. Until today I did not have a clear answer because I did not have access to a good Server/PC. Last week I managed to borrow (just for a few minutes) a high end Intel/Window2K Server - 4 CPU, 3.6 GHz each with a total of 3 GB RAM.

Out of curiosity, I ran the streamcruncher.test.func.h2.H2TimeWindowFPerfTest test with all the default configurations to JVM, StreamCruncher 1.08 Beta, H2 Database etc, Parallel GC but with just 1 Collector Thread etc.

The test pumps 250 events in one burst, pauses for 6 seconds and repeats this for 4 cycles. So, it does 4 iterations of 250 events each. The test is a simple Straight Through Processing. A simple Time Based Anonymous Partition, which holds Events for 5 seconds. The Test measures the average time it takes for an Event to get expelled from the Kernel (publish by Kernel) since the time it was created. This does not include the 5 seconds it stays in the Window. It measures the Overhead added by the Kernel processing.

Keep in mind that StreamCruncher is still in Beta. There are still a few rough edges to be taken care of - StreamCruncher has neither been Profiled nor have load & performance tests ever been run.

On my 1.6GHz Win XP Laptop with 1 Gig RAM and JDK 1.6, the average Latency per Event for this test was a disappointing 380 Milliseconds. I was quite sure that the single CPU was bogging down the performance, because the Test creates and inserts live, randomly generated Data and the Kernel which is heavily multi-threaded also shares the same CPU. Performance was below expectations. Since all these operations were in memory, there was practically no opportunity for the CPU to switch between Threads while another Thread was blocking on the Network or Disk. So, there was practically no concurrency.

But when I ran the same test on the 4 CPU Box, the average CPU utilization did not even change noticeably while spitting out Events at a fantastic 7-9 Millisecond latency. I was blown away by the numbers. Preliminary though it may be, the multi-threading change in Release 1.05 must've really paid off. Another important thing was that the Tests were printing verbose output to the Console, which slows down the whole application. So, with Zero logging or logging to a file, latency might've improved even further. However, there was an ugly bug that reared its head when I re-directed Output to the File. I kept getting a "Unique Constraint Violated" error quite often. This has to be fixed soon. Since H2 Database was used in these tests, which in its current version does not support Multi-threaded access I'm hoping performance of StreamCruncher will be exceptional on other In-Memory Databases like Oracle TimesTen.

Sometime over the next few months I'll conduct proper tests on a stabler version of StreamCruncher and publish my findings.

Wednesday, February 28, 2007

The JDBC Engineers at ANTs have been busy fixing the bugs (a & b) in their Driver. Today, I received an email from them saying:

We have been tracking your issue as case 1206. As per your statements, we had logged the following bugs, which have been resolved, fixed & verified.

Bug 1415: JDBC getParameterTypeName() returns unknown for Date Datatype.

Bug 1416: JDBC getParameterClassName() method returns hard-coded string as "java.sql.ParameterMetaData"

Bug 1596: ANTs JDBC driver doesn't get registered

Bug 1597: ANTs JDBC Issue - NULL is inserted when Long.MIN_VALUE is inserted into bigint column
Cool! So, we should be seeing these changes in their next release.

Saturday, February 24, 2007

Ah, a harrowing 46 hour travel with long stops in between and now I'm in the US.

Here's a small list of things that one might consider to ensure a relatively easier flight - especially the Long-haul ones. First time travellers on such flights might find it useful.

  • Buy one or two Calling cards before you fly. This way you can call home later to say that you've reached safely (Mumbai/India: Airtel Cards or Reliance cards are sold in the airport)
  • Carry sufficient cash, but store them safely in different Cabin bags. Carry coins to use in public phones - for calling a Cab etc
  • Visit the Airline's website to check what you can carry and what you cannot. There are some stuff you cannot carry in Cabin bags, but are ok in Check-in bags
  • Long-haul flights wreak havoc on your lips, face and hands. The dry Cabin air and at Airports will cause serious lip cracks and very dry skin. Carry a small bottle of moisturizer and lip balm to rehydrate your skin. Believe me, it might sound silly, but it'll make you feel miserable later. Especially for kids. New rules specify that such liquids/creams cannot exceed a certain quantity if you are carrying them in your hand/cabin bag. Check with your Airlines
  • Carry some basic medication for Headaches, something for Colds and Muscle pain too (Neck strain). Ask your Doc
  • Be prepared to be stranded in the Airport. Flights can get re-scheduled or cancelled. Carry a change of clothes in your Cabin bags. You might have everything in your Check-in bags. But what if you've already checked them in and then the flight gets delayed? Warm clothing is also important. Put these in your Cabin bags
  • Make Photocopies of your tickets, visa, passport, contact details in the foreign country and your home in a cabin bag separate from the one with the originals. Emergency numbers in your wallet and a copy in your bag. Preserve your Boarding Passes until you've reached your final destination
  • If you are planning to drive, then get an International Driving Permit. It is accepted in some countries. Again do some research beforehand. Read up on the local driving rules. You can get a copy of the Driver's Handbook online - http://www.drivershandbook.com
  • Carry an unopened box of Breakfast Cereal or something like that in case you arrive at an odd time, like midnight and you can't go anywhere to eat
Bon voyage!

Wednesday, February 14, 2007

It's relocation time again. Singapore to Bangalore and now to the US. Which means that there probably won't be any StreamCruncher activity for the next few weeks. Not until I've settled down there.

Saturday, February 10, 2007

After a looong time, I spruced up my old Website - www.JavaForU.com. Over the past 5 years, I had struggled to keep my bookmarks on my Website uptodate. Exporting from Netscape/Firefox to HTML, then cleaning it up to XML, even wrote a couple of programs to convert XML to a DOM Tree viewer..it was such a painful experience. So much so that I had entirely stopped updating my Bookmarks....until today.

Today, I bumped into a few cool Tools and Web Services that make Bookmark sharing a breeze. No, I'm not talking about those Social Bookmarking sites, I just didn't like them. They don't preserve any hierarchy, everything gets dumped into a big Tag cloud. I'm talking about Grazr and other OPML renderers. Take a look at this. Cool!! Isn't it?!

In just 4 steps. I'll tell you how:

  1. If you are using Firefox, get this BookmarksSynchronizer plug-in. It allows you to export your Bookmarks into XBEL format (Don't ask me what that is)
  2. Upload your XBEL exported Bookmarks into this YabFog Site, which converts XBEL into OPML (Again, don't look at me)
  3. Save the OPML format file and upload it into your Website
  4. Link your OPML file (the one you uploaded to your Site) through its URL from Grazr
That's it! You now have a super cool Widget to show off your Bookmarks.

This is what it'll look like - My Bookmarks. You can even make it look like it's part of your Web Page, like I have in "www.JavaForU.com > My Favorites".

Friday, February 09, 2007

StreamCruncher 1.08 Beta is ready, with support for the pure Java, PointBase Database. PointBase is probably the most widely used embedded Java Database. HSQL might beat it in terms of sheer number of Desktops it is deployed on, as part of OpenOffice.

Even though PointBase does not support an In-Memory mode, the Database Page Size, Cache Flush Time and other settings can be modified to reduce Latency.

Wednesday, February 07, 2007

Speaking of computationally wasteful processing, yesterday's post listed a Query that needed re-writing.

    select .... from 
StreamA (partition by store last 10 minutes) as FirstStream,
StreamB (partition by store latest 25) as SecondStream

where FirstStream.eventId = SecondStream.eventId

and FirstStream.$row_status is not dead
and FirstStream.someColumn > 10
and SecondStream.$row_status is new
and SecondStream.otherColumn is not null;

Streams using the "$row_status is not dead" clause must be careful not to perform too much filtering inside the main body of the Query because the Criteria would get re-evaluated everytime the Query runs. So, if an Event remains in the Window for 20 Query Execution cycles, this "somecolumn > 10" would get evaluated for that Event 20 times.

Had we pushed this Filter criteria into the Pre-Filter clause, it would've been evaluated only once per Event. And, the Windows would not get polluted with Events that do not match the criteria, because they do not even make it into the Partitions.

Tuesday, February 06, 2007

There are a few things I've been meaning to write about how to write StreamCruncher Queries to achieve good performance.

If your Query filters Events based on some criteria in the Where clause, while co-relating it with Events from other Streams like this:


select .... from

StreamA (partition by store last 10 minutes) as FirstStream,
StreamB (partition by store latest 25) as SecondStream

where FirstStream.eventId = SecondStream.eventId

and FirstStream.$row_status is not dead
and FirstStream.someColumn > 10
and SecondStream.$row_status is new
and SecondStream.otherColumn is not null;

Here are some simple tips. The same concepts that you learned while optimizing SQL Queries apply here too.
a) Re-arrange the Filter conditions in the Where clause before the Join (Co-relation) to reduce the candidate Rows. In the Query above, the First and Second Streams are Joined on eventId and the resulting combined Rows/Events are filtered using the subsequent Filter criteria like "FirstStream is not dead and .. .someColumn > 10..". This is computationally wasteful because the Database joinsall those additional Rows/Events from the 2 Streams and then removes the ones that do not match the Criteria.
b) If the Events need to be Filtered, use the Pre-Filter clause in the Partition definition to consume only the required Events

Thus, the final Query should look like this:

select .... from

StreamA (partition by store last 10 minutes
where someColumn > 10) as FirstStream,

StreamB (partition by store latest 25
where otherColumn is not null) as SecondStream


where SecondStream.$row_status is new

and FirstStream.$row_status is not dead
and FirstStream.eventId = SecondStream.eventId;

You will notice that the Events are Pre-filtered in the Partition clause itself, where the un-necessary Events are weeded out even before they enter the Partitions. And since Events are fetched into Partitions in-Parallel with Query execution, you can shave off previous milliseconds by Pre-Filtering.

Another trick is to use the Table/EventStream with the least number of Rows/Events as the "Driving Table" (first Table in the Join) i.e the "SecondStream.$row_status is new" will have fewer Events because it picks up only the newly arrived Events to join with the other Stream. This speeds up Join processing time, if the Database underneath uses Hash-Joins.

It is also recommended to have the filter critera like "$row_status .." (the ones that cannot be Pre-Filtered) before the Join clause. So, use "FirstStream.eventId = SecondStream.eventId" in the end, after culling the Rows that are not needed so that only the required Rows are presented to the Join clause.

StreamCruncher 1.07 Beta is ready to be downloaded!

In this release:

  1. Solid BoostEngine's newer JDBC Driver (Build 04.50.0110) and above, does not have issues with Result.getTimestamp() on Timestamp columns [Old Post]. The StreamCruncher patch for this has been removed as there is no need for it anymore
  2. Another cool feature in StreamCruncher is the ability to have Windows in the same Partition with different Window sizes. WindowSizeProvider and TimeWindowSizeProvider classes in the streamcruncher.api package can be used for such customizations. A detailed example is provided in TimeWFPartitionWinSizeProviderTest

Friday, February 02, 2007

2 more Bug reports for the ANTs driver:

  • Bug number [1596]: ANTs JDBC driver doesn't get registered
  • Bug number [1597]: ANTs JDBC Issue - NULL is inserted when Long.MIN_VALUE is inserted into bigint column

Friday, January 26, 2007

Following up on the ANTs Data Server issues, their Tech Support guys promptly sent me a email saying that the Bugs have been logged and that they are working on it:

  1. Bug number [1415]: JDBC getParameterTypeName() returns unknown for Date Datatype
  2. Bug number [1416]: JDBC getParameterClassName() method returns hard-coded string as "java.sql.ParameterMetaData"
  3. Bug number [401]: In ANTs, BIGINT is not of the standard size as compared to other Databases

StreamCruncher 1.06 Beta is now available!

This release contains 2 new features, both of them have to do with improvements in the Query language. The first addition is the case..when..then..else..end clause. Most Databases support this in the select.. clause. An extremely useful feature to handle null column values, to rewrite column values etc.

The second addition is the first n or top n or the limit m offset n clauses. SC does not validate which clause you should use for the Database being used underneath. If the Database you are using supports the first n clause like Oracle TimesTen, then use it. If you are using H2 Database and you want to truncate the ResultSet, then you should use the limit m offset n clause. Check your DB's manual to find out which clause to use.

Both features in this release work only if the Database being used supports those clauses. Have a look at CaseWhenClauseTest and TopOrLimitClauseTest to see how it works.

The first m or its equivalent clause will prove to very useful if you just want to sample just a few Rows/Events without having to retrieve all the Events/Rows, which is always time consuming.

Tuesday, January 23, 2007

StreamCruncher 1.05 Beta is out!

Ah, finally..I got the time to re-do the Partitioning and Pre-Filtering logic. Until this 1.05 release, Partitions had to pull Events from the source Stream/Table just before Query Processing - one step before the final Query could be executed. So, the overhead of fetching the Events and Pre-Filtering them ("where.. " clause in the Partition definition) would add some latency to the overall Query processing. Even though each Partition used to run in its own Thread, the whole process would have had to wait for all the Partitions to draw the new Events into their respective Partitions. From the 1.05 release, the Events are pre-fetched for each Partition by a separate group of Threads. This should improve the speed and CPU utilization on Multi-Core/Multi-Processor Hardware. Overall Latency should reduce noticeably on such Hardware.

As a consequence of this change in architecture, Partitions with the Pre-Filter clause do not trigger the Query unless the Events have passed through the Pre-Filter. In previous releases, unfiltered Events would trigger the Query (if the total Event Weight reached 1.0 or higher) an then would get filtered before reaching the Partition. This was very awkward for Queries with "New Events Windows", because Events that would trigger the Query spuriously would result in the "New Events Windows" to discard their contents.

Load distribution across multiple Queries is possible now without any untoward consequences (like the "New Events Windows" problem) because the Pre-Filtering works in a Thread pool of its own.

Saturday, January 13, 2007

Following up on what I wrote on Marco's blog, StreamCruncher now supports the Solid BoostEngine, which is a dual-engine Database. Dual-engine means that it supports both In-memory Tables and Disk-based Tables. All the Streams (Input and Output) created via StreamCruncher are created on the Memory Engine and the Queries can combine data from both Disk-based and Memory Tables.

The ReStockAlertTest in the examples, is a perfect example of this; where it combines the "stock_level" Disk-based table (default in Solid) and the "test_str" Stream defined on the In-memory "test" Table. StreamCruncher creates its artifacts using the "STORE MEMORY" clause.

Something similar is done for MySQL Databases, where SC creates artifacts using the "engine = MEMORY" clause (MySQL is a multi-engine DB).

For both Solid and MySQL, SC transparently adds this special clause to the Input and Output Table/Stream definitions.

If you've tried H2 Database, the pure-Java, embedded, in-memory, in-process Database that StreamCruncher ships with, you'd be amazed at how much work has gone into its development. Well, it supports other modes as well, but this mode is the default/recommended setting in SC.

Thomas Mueller, the guy who owns/develops H2 was also the same chap who developed HSQL DB before someone else took over the responsibility from him. And now, HSQL DB is part of OpenOffice's Base application - Sun's answer to MS Access. That's really something.

H2 is turning out into a full fledged DB, what with its Spatial indexing, support for large (several GB) Database sizes, support for all sorts of twisted but very handy SQL syntax.. From StreamCruncher's point of view, the embedded mode in H2 is highly suitable. The less latency at the DB, the better. Last time I spoke to Thomas (over email), H2 did not have any support for concurrency. H2 was thread-safe, but didn't allow concurrent operations - just like HSQL DB. I hope he spends some more time removing that major bottleneck. Atleast, Table-level locking/concurrency will be very good. HSQL DB on the other hand locks the whole Database, which is ughh..

In any case, I admire the effort that he has put into H2.

PS: It's "relocation time" for me. I'm moving out of Singapore after having spent a 1 year and almost 10 months doing Consulting work. Nice little city, Singapore.

Monday, January 08, 2007

I thought I should share my Sci-Fi reading list. I'll start with the Stephen Baxter books I've enjoyed reading. Now, Baxter is considered to be a Hard Sci-Fi writer, like Clarke. But Baxter's writings are even more futuristic, and absolutely mind bending. Naturally, because Baxter is more like Clarke's successor, carrying the baton into the 21st Century. Certainly not for the faint hearted and semi-luddites. His works are based on extrapolations of our current understanding of Quantum Mechanics. His novels stretch across timescales one would never have imagined. From 500,000 years away, all the way up to several billion years into the future. How mankind will've evolved, what kind of entities we might encounter - not the usual sort of man-eating super roaches you see in B-grade movies, but civilizations that have evolved from Dark matter..and ideas like that, which really stretch your imagination and force you to re-think your philosophy, if you have any that is.

But I found his prose to be a little juddering with haiku-like short sentences and abrupt context switches from chapter to chapter, especially when I read Manifold: Origin, which was the first Baxter novel I read. Subsequent novels were better, probably because I must've got used to his style by then. Baxter's stories are very unique in that he constantly keeps hitting the boundaries of our understanding of the Universe, our purpose here, if there really is any, puts his chatacters in extraordinary situations like encountering a whole galaxy that is miniaturized into a small box because their Sun was about to go Nova or meet a civilization that is millions of years ahead of us and they completely ignore us until the end of the Universe where they leave a small condescending token behind for the poor Humans, like how we throw crumbs at pigeons or a human being grafted onto an AI and then suspended inside the Sun to study why the Sun is dying so fast instead of hanging around for another 5 billion years. But his characters seem to lack depth because there are usually dwarfed by the engineering and astronomical marvels in the story working on colossal scales like the aliens who are re-engineering the Milky Way in the Ring. Some books like the Ring, especially leave you reeling under the concepts.

The Light of Other Days, was a lot more enjoyable. A lot of his novels are interlinked. You have to read all the novels in the right order, when you finally get this "a-ha!" moment when all the pieces fall together - all episodes fall in line sometime along the "Time-like infinity". Start with "The Light of Other Days", which was a collaborative work with Clarke. Then move to Manifold: Origin and then Ring. If it still leaves you thirsting for more Hard Sci-Fi, read Coalescent - an entirely different thread. If the Ring and Manifold leaves you numb and staring into deep space, wondering what your descendants 80,000 years from now will be doing, then you should read Coalescent to bring you back to present day. And if you are curious about Hiveminds you will like this book. Exultant, I felt was too much like Orson Scott Card's - Ender's Game.

You'll also notice Baxter recycling some of his stuff in other novels. But don't miss The Time Ships, a sequel to Wells' Time Machine. I loved this book, probably because he had to continue with Wells' style of writing instead of using his natural style. If you are interested in Evolutionary Biology, Genetic engineering, liked Huxley's Brave New World and are willing to make that leap of faith where a lot of things that we've come to accept as Society, Religion, Culture are all challenged; you should read this book. Well, Faith is the wrong word in this context, I suppose.

Saturday, January 06, 2007

StreamCruncher 1.04 Beta is out! A hasty release I must add. There was a Connection leak in 1.03 Beta, for ANTs and Solid Databases. In my previous Post I had mentioned the use of Proxies to bypass the Solid and ANTs Driver limitations. 1.04 onwards, Proxies are not used. I'm using concrete Classes to wrap the Connections, PreparedStatements etc.

But I'm still flummoxed by the ANTs Server. The SLAAlertTest keeps failing about 50% of the time. I spent the whole day trying to figure out why and I just couldn't. One thing I noticed was that the Timestamps get completely messed up in the results Table and even before that, along the way. The Kernel seems to be working fine, as demonstrated on the other 6 Databases. It's just ANTs. The Timestamps keep mysteriously jumping randomly to the "distant future". Maybe it's trying to hint at something - about my love for Science Fiction. Hmmm..? I've given up on ANTs for now.

Somebody also pointed out that SC does not work on JRE 1.5. I've tried to fix that by compiling the files with the "target 1.5" option. But, I have no way of checking if it works on 1.5. I strongly recommend 1.6, now that the Release Candidate is ready. 1.5 has some Memory Leaks in the Concurrent classes - in the parkAndWait(..) method or something like that. It was quite serious, last time I checked. After that I switched to 1.6.

And the Apache Commons library that is packaged with SC has been upgraded to 3.2 from 3.1.

Well, gotta go. Have a nice weekend.