Wednesday, December 29, 2010

Proximity search using SQLite's FTS feature

A few months ago I was playing with SQLite's Full Text Search feature. I was especially interested in the Match-Near-Term operator - which allows you to search for a bunch of terms that are with 'm' words of each other. Lucene also has this feature (obviously) called SpanQuery. This is called Proximity search if you didn't already know.

This kind of search has its limitations - so does SQLite, especially performance problems for large data sets. I chose SQLite with the SQLite-JDBC driver because of its simplicty of setup and SQL interface (duh!). I created the FTS table in an in-memory database and tried some simple queries. It's not too bad. I'll just file it for later.

Here's the code. I just create 2 streams of stock ticks (all contrived, just like the rest of the code) and try to search for patterns in the 2 series. It does not exactly do what I wanted it to, but it was fun to play with the concept.

Friday, December 24, 2010

Enterprise technology landscape from 10,000 ft above - circa 2010

[Updated: Dec 25, 2010]

My personal view of where some technologies stand, today:





Happy holidays and a happy new year!

Wednesday, December 22, 2010

If you are going to lose your socks, make sure that you lose it in pairs.

Sunday, December 12, 2010

What they did not teach in OOP/OOAD class

For over a decade the Gang Of Four Design Patterns or the GoF patterns are they are fondly referred to have become one of the favorite topics in job interviews. And for good reason. It has even led to the rise in the popularity and acceptance of spin offs like - J2EE Blueprints, Enterprise Integration Patterns and even Anti-patterns.

Would it be too much to ask for to teach these concepts in school today in the final semester before sending graduates off into the real world?

However, knowledge of the GoF patterns is not sufficient to build elegant systems. In fact it must be said that an over-reliance and blind following of the GoF patterns quite often leads to bloated and over-engineered systems. Case in point - Spring superseded a bloated J2EE. Google Guice and JEE CDI are in turn attempts to improve upon Spring which itself has gained a lot of weight over the years.

In my experience, I have come to realize that in order to insure a more complete and proper understanding of the art of design and its application in a complex software system, there are 2 other essential sets of design patterns. The lesser known and under used ones:
    1) GRASP - General Responsibility Assignment Software Patterns
    2) SOLID - Single responsibility, Open-closed, Liskov substitution, Interface segregation and Dependency inversion

Where the GoF patterns and their spin offs explain "how" to build the components; SOLID and GRASP help in understanding "why" those components, packages, abstractions, interfaces and dependencies have to be built and assembled in a certain way.

To drive home the point, just sending off graduates with partial knowledge i.e GoF = "How" and without SOLID + GRASP = "Why" would be like teaching automobile engineers how to machine parts of a car and not teaching them how to assemble the parts to make a drivable car.

If students are expected to figure out the "why" on their own at their first jobs, they will unwittingly build and design Rube Goldberg type of software, inflicting the source code with factories of factories, bad adapters that don't do anything, redundant interfaces, singletons that resist unit testing and the list goes on.... until they learn from experience (if at all). For it takes years of experience and guidance to gain a holistic view of complex systems. Systems thinking also helps a great deal.

While we are discussing the subject of good design there is another important aspect that is even less understood - API design. There is a remedy for that too. Joshua Bloch's - How to Design a Good API and Why it Matters.

In conclusion, I'd like to list down some famous quotations and rules of thumb that help me when I'm designing a particularly tricky system:
   - Simple things should be simple, complex things should be possible (Alan Kay)
   - Simplicity before generality, use before reuse (97 Things ..)
   - Ask yourself if a feature or its design is necessary and sufficient. Anything more is a waste. Anything less means the job is not complete

Until next time!

Tuesday, December 07, 2010

LinkedIn's Kafka messaging project

Kudos to the LinkedIn team for making another highly focused and elegant project available as open source - Kafka. In spite of its name it is anything but Kafkaesque.

Kafka seems to be a serious attempt to address the messaging problem by starting from first principles. Not having played with the project yet, but from just reading the design doc it looks like a well thought out design.

I have written about the scalability limits of push-systems that are somewhat common to JMS implementations - here about polling from NoSql instead of push, a little here about JMS spec needing an upgrade and vaguely here when talking about alternatives to 2 phase transactions.

The alternative systems like Flume, Scribe, Hedgwig, Chukwa and such are too log-file-collection focused. Whereas Kafka looks more like a regular messaging system with a clean polling mechanism. Explicit polling with good storage automatically solves many of the problems that I had written about here like retries, slow consumers/flow control and durable subscriptions. I'm particularly glad to see that they've read the Varnish article on OS disk caching which Redis seems to have somewhat muddled up (comment #29). Funny, zero-copy was something I was exploring just a few weeks ago with Netty.

I don't however foresee any enterprise projects switching to Kafka immediately. Its performance and cost of license (ASL) might not be enough to motivate people from trying it out. The strangely simplistic yet clever design does require some careful reading of the docs and understanding of the APIs. Hopefully it will gain a wider user base unlike their other nice project Voldemort - another simple and elegant project.

Also be sure to have a look at their new disk store - Krati. I'm even more glad to see that all these projects are in Java (actually Scala).

Until next time!

Saturday, December 04, 2010

Clever Enum tricks and some things to read over the weekend

Here's a Java goody - changing the default maximum compile errors reported by Javac:
   http://forums.sun.com/thread.jspa?messageID=2250634#2250634

Clever things that you can do with Java Enums:
   Create a hierarchy - http://java.dzone.com/articles/enum-tricks-hierarchical-data
   Make it implement an interface - http://api.neo4j.org/current/org/neo4j/graphdb/RelationshipType.html

And some bizarre things to do like - converting Java to native:
    http://nestedvm.ibex.org/
    http://llvm.org/svn/llvm-project/java/trunk/docs/java-frontend.txt
    http://www.vishia.org/Java2C/html/features.html

What would happen if you tried to bypass the Sql engine in MySql? You'd get 750,000 reads per second!
    http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html

Some Hadoop and cloud related presentations worth reading:
    http://www.cloudera.com/resource/hw10_hbase_in_production_at_facebook
    http://www.cloudera.com/resource/hw10_apache_zookeeper_at_yahoo
    http://www.slideshare.net/adrianco/netflix-on-cloud-combined-slides-for-dev-and-ops