{ Make this readable }

Sunday, December 15, 2013

Java/tech stuff I found on the internet (Dec 2013 edition)

Networking and big data:

Java/JVM perf:
Java memory model + arrays + visibility/ordering:
Happy holidays!

Sunday, November 24, 2013

Analyzing large Java heap dumps when Eclipse Memory Analyzer (MAT) UI fails

If you find yourself trying to analyze a big heap dump (20-30GB) downloaded from your production server to your staging/test machines.. only to find out that X-over-SSH is too slow then this article is for you.

As of Nov 2013, we have 2 options - Eclipse MAT and a hidden gem called Bheapsampler.

Option 1:
Eclipse Memory Analyzer is obviously the best tool for this job. However, trying to get the UI to run remotely is very painful. Launching Eclipse and updating the UI is an extra load on the JVM that is already busy analyzing a 30G heap dump. Fortunately, there is a script that comes with MAT to parse the the heap dump and generate HTML reports without ever having to launch Eclipse! It's just that the command line option is not well advertised.

Command line heap analysis using Eclipse MAT:

Assuming Eclipse MAT is installed and we are inside the mat/ directory, modify MemoryAnalyzer.ini heap settings to use a large heap to handle large dumps:


Run MAT against the heap dump:

    ./ParseHeapDump.sh ../today_heap_dump/jvm.hprof

This takes a while to execute and generates indices and other files to make repeated analysis faster. Then use the indices created in the previous step and run a "Leak suspects" report on the heap dump.

    ./ParseHeapDump.sh ../today_heap_dump/jvm.hprof org.eclipse.mat.api:suspects

The output is a small and easy to download jvm_Leak_Suspects.zip. This has HTML files just like the MAT Eclipse UI. It can be easily SCP'ed/emailed around.

Other report types possible.

More details - http://wiki.eclipse.org/index.php/MemoryAnalyzer/FAQ.

Option 2:
http://dr-brenschede.de/bheapsampler is something I chanced upon. It is a sampling heap dump reader and so it works for very large heap dumps where MAT sometimes fails. Being a sampling reader, the output is also a little imprecise but helps a great deal when you have nothing else. The tool seems to be closed source and is very sensitive to heap dump corruptions.

As an aside, here's something that might be useful for the initial heap dump quickly - https://blogs.atlassian.com/2013/03/so-you-want-your-jvms-heap/.

Sunday, November 17, 2013

Book review: Getting Started with Hazelcast

A few weeks ago Packt Publishing sent me a free copy of their new publication - Getting Started with Hazelcast by Mat Johns to read and write about. I have used distributed caches and compute grids quite a bit at work. So, I was happy to do a quick review of this book. I've used Oracle Coherence quite a lot and Hazelcast for some experiments.

The book is a gentle guide to building distributed compute and data grids. It assumes nothing about the reader and hence does a good job of doing what it says in the book's title - "getting started". I'd advice this book for anyone who is completely new to this area which is not to be confused with Hadoop, Storm, Cassandra or the other more "popular/hyped" cousins. I would say that for medium sized data, logic heavy, transactional/near real time applications, compute grids are the way to scale out.

Obviously this book is about using Hazelcast, which is a nice Apache software licensed, Java, distributed grid/cache. It is surprisingly feature rich and in terms of usability, features and elegance it comes very close to its more expensive, older, rock solid cousin which is Oracle Coherence.

The book explores the essential aspects of using such frameworks effectively. Such as - distributed maps, replication, network partitions, fault tolerance, data affinity, moving code closer to where data is etc. It does this without being too overwhelming for first timers.

For a full and more thorough treatment I would obviously recommend the Hazelcast documentation. And if you are curious to know about other frameworks check out my old write up - Scalable compute & storage frameworks - A Refcard.


Friday, October 11, 2013

JVM memory management speed, performance related stuff and other links

Here's this season's link fest. Let's start with Java:

Some JavaOne related posts:
Other cool algos and stuff:
Until next time!

Sunday, September 08, 2013

Fort Bragg, Point Cabrillo lighthouse, Mendocino county trip

Fort Bragg's Glass Beach is (well.. how shall I put it) completely skippable. The Botanical garden is totally worth the visit.

En route to Mendocino

Little River - close to our rented cottage

Point Cabrillio Lighthouse


Mendocino sunset

Fort Bragg Botanical Garden

Until next time!

Java, GPU, interesting JVMLS talks, graphs, timing wheel etc.

Here's another dump of interesting/useful tech stuff I read over the past couple of months. 
(Notice how I'm getting lazier over time? I'm just dumping links and not even adding my notes or cleaning them up... just raw links)
Coherence rack safe: 

Floating point...yuk:

Saturday, August 10, 2013

Hiking around Stevens Creek Reservoir

We've been to this place quite a few times. I like the peace and quiet here. It's close to Mountain View, like San Antonio Rancho.

There are multiple trails here. Our favorite is the one around the reservoir. You have to walk on the reservoir wall, near the boat ramp to the inner side of the reservoir. The trail then reaches Stevens Canyon Road again but at the other end of the reservoir. You can turn around and come back the same way or walk back on the road along the reservoir's outer edge, completing a full circle.

Thursday, August 01, 2013

Some good Cassandra, Lucene presentations and misc Comp-Sci posts

About 2 months ago I attended the Cassandra Summit at San Francisco. Yes, I've been meaning to write this blog for a while now. I was surprised (pleasantly) to see such a good turn out. Lot of energy and real world use cases. I didn't get to attend all the talks of course, but all the slides and videos are online. Here are some good ones:

A few JVM related posts worth reading:
Go language and reactions:
If you like the Markdown syntax and want a good, no fuss editor for writing documents:
Some Comp-Sci stuff to keep your (my) mind fit:
Scala and Spark related videos:
Until next time!

Sunday, June 23, 2013

Reading list (and RIP Mr. Iain Banks)

Here's my list of books I read these past few months:

  • RIP - Mr. Iain Banks
  • Seeker by Jack McDevitt -Watered down scifi. Like a direct-to-DVD sci-fi movie. If you can stay awake through the chapter after chapter of filler - like one long, boring episode of Star Trek
  • Mirror Dance Miles Vorkosigan Adventures - Smart, clever, crisp. Surprisingly interesting story and great character development
  • Paladin of Souls by Lois McMaster Bujold - A beautifully written fantasy novel. Engrossing and scary. At the same level as China and Dan Simmons
  • Night Watch by Terry Pratchett - my first Pratchett novel. Not bad at all, light and funny
  • Planesrunner by Ian McDonald - Interesting but definitely young adult sci-fi. Story and the worlds had a lot of promise but lacks the sophistication of hard core sci-fi. Kiddie stuff
  • Terry Pratchett - Small gods. Typical irreverential Pratchett style. Funny and not too bad
  • The Emperor's Soul: Brandon Sanderson. Novella. Makes for a nice, light, quick reading
  • The Martian by Andy Weir - Amazing piece of near sci-fi, survival. You'll love the detail especially if you are an engineer. Kindle only
  • Crystal sphere - Short story. Short but nice
  • Gabble - Some stories are great fun. Where it comes to Polity, it's uncomfortably close to the great Iain Banks' Culture. Neal Asher should've tried something original and not rip off Banks. Still, worth reading
  • Six directions of space - Alistair Reynolds. Multiple time lines. Abrupt ending. Short story. Should've gone with a longer, novel format
Until next time!

Sunday, June 16, 2013

ForkJoin - a quick exploration .. long overdue

ForkJoin has been available to us Java folks since Java 7 and if you consider the JSR 166 packages, then even longer. I found the time to explore this API only recently.

Having written about Phasers a couple of years ago and realizing that I'd still not found a use for it in production, I was not too eager to explore another "thread-pool" (just kidding - where would we be today without j.u.c classes).

Anyway, I downloaded the latest JDK 8 pre-relase (b93), changed my IntelliJ 12 language mode to Java 8-with-lambdas and ran some simple tests.

Mind you, the JavaDocs for ForkJoin and related classes are quite elaborate and expect you to set aside some time to go through it in detail... which you can probably postpone if you read this post.

ForkJoin is recommended as a thread-pool if your main task has to divide itself into a lot of smaller tasks, usually recursively. Usually in such scenarios the number of children tasks are not known upfront. Technically, the work-stealing aspect of ForkJoin and the claim that it scales well when faced with a large number of tasks makes it a good fit for such workloads.

Essentially, there are 3 ways in which you can write jobs/tasks to run in a ForkJoinPool - RecursiveAction, RecursiveTask and the new JDK 8 CountedCompleter.

The RecursiveAction is fairly simple. It embodies the logic to work on the root of your computation problem. It also splits its work into smaller sub-tasks recursively. Very similar to a binary search but searching each half will be spawned off as a sub-task recursively. The computations for this tree of tasks completes when the leaf nodes are processed.

I can think of a simplified but realistic use case where you'd want to do a mix of sync and async, parallel sub-tasks:

  1. Receive purchase order request from client
  2. Convert request payload (JSON, XML) to Java object
  3. Make synchronous authorization check with LDAP
  4. Make some async requests
    1. Make async request to inventory service to check and reserve stock
    2. Make async request to shipment service and find closest free shipment date to requested destination
    3. Make async request to fetch similar/recommended items to offer package deals
  5. Consolidate results of async requests
  6. Generate response JSON
You could do steps 2, 3 and 6 in a regular ThreadPoolExecutor. If you need to accommodate priority purchase order processing then you could easily do it with a combination of PriorityBlockingQueue and the right constructor on TPE.

In fact, there are so many implementations of BlockingQueue, for example LinkedTransferQueue and SynchronousQueue which could be useful in some special cases. The Exchanger is another such nugget in the j.u.c package. Apparently CompletableFuture is ideal for such cases (like Scala's Promise and Google Guavas' ListenableFuture) but I was surprised to see there were no examples in the JavaDoc.

(Ok, this is turning out to be a longer post than I had expected. Not a quick exploration after all)

Going back to our example, incorporating the 3 asynchronous operations in step 4 might constitute as sub-tasks of step 4. Although in reality, the JavaDoc for ForkJoinPool says that the ForkJoinTasks should ideally not block on external resources like I/O. This is called "unmanaged synchronization" as it involves waiting for resources outside the fork-join system. For that the ManagedBlocker is recommended, although to me it looks like it was added only as an after thought.

So, sadly the above seemingly real-world example might not be a good case for ForkJoin. Which means the ideal use case is something that involves recursively decomposing and pure computation - a.k.a in-memory map-reduce.

So, we make our way back to the overly geeky sort-merge example used in the JavaDocs. In my case, I decided to dispense with the sorting part and simplified the problem even further - purely for illustration purposes.

In my examples, I use ForkJoin to recursively split and list numbers from "start" to "end". At each step if the start to end range is larger than 5 it splits that range into 2 equal halves and forks them off as sub-tasks. Otherwise that task is the leaf level and just adds the numbers in a for-loop from start to end into a queue that is passed around to all tasks.

The first test is a naive implementation of RecursiveAction where it just keeps forking away sub-tasks till the leaf levels. So, the thread that created the root level task attempts to wait for the whole tree of computations to complete. Since each level that spawns the next level of 2 sub-tasks asynchronously and does not wait ("fork()") for the children to complete, the whole tree completes asynchronously. This way the caller thread in the "main()" method comes out of "invoke()" prematurely. As a result this recursive task is almost what we wanted but not entirely.

Since the naive approach of forking did not suffice, we make a small change by making the parent task wait for its children to complete by calling the "join()" method on its children.

An even better approach is to allow each task to fork  away sub-tasks and not have to "join()" on them. Because waiting only means that a thread is not in idle-wait state where it should've been "stealing" work from other threads and making progress. What we need is for a way to let the sub-tasks notify the parent task that it has completed. We can let this bubble up all the way and register a listener at the root.

For the listener we will even use the fancy Lamda feature and something from the new java.util.function package to register a listener. In fact completion listeners can be registered at any level - for example to print to the console that certain % of the tree is complete and so on. There are 2 versions of this - one that sub-classes CountedCompleter to simply let the completions bubble up and then eventually notifies the blocked calling thread in "main()".

The more sophisticated implementation using Lambdas.

Here's an even more sophisticated example that wraps the fork-join pool as a CompletionService and submits 5 tasks and then picks up the results as they complete.

There are a few other things worth reading up about. I skipped the use of RecursiveTask. I also skipped mentioning the different methods to "steal" tasks. The RecursiveAction JavaDocs even has an example that keeps track of spawned sub-tasks and then follows that chain to try and complete them if another thread has not already done it. The reason I did not venture into this bit is because I'm not sure as of now whether it is worth doing this instead of letting the FJ framework do the scheduling internally.

Without studying the source code I can only guess that manually keeping track of spawned sub-tasks and then trying and unforking them would be to help complete that sub-tree of tasks quickly. If we were to use a simple queue to just dump sub-tasks like in the ThreadPoolExecutor, then they would get mixed up with other sub-tasks from other threads in the pool. This means that the current sub-tree may not complete on time because the dependent sub-tasks are somewhere at the back of the queue. This is where FJ shines in addition to it scalability.

One thing we do lose with FJ is that tasks do not have priorities unlike using a PriorityBlockingQueue with TPE so you might end up using multiple FJ pools.

Until next time!

Saturday, June 08, 2013

Hiking in Huddart County Park

Hiking in Huddart County Park.

Nice, secluded, close to I-280. Trails always in the shade, good even in summer.They even have camp sites and picnic benches.

$6 entrance fee. No maps, trail directions are a little confusing. Especially when you are coming back to the parking area there are many roads and unmarked trails branching off. Funnily, we had trouble finding our parking lot at the end. We did not have such problems while hiking though. The single used map they did have was lacking in detail about the parking areas.

Wednesday, June 05, 2013

Diesel - DSL experiments on the JVM (Part 2)

In part 1 we explored some simple ways to fake a DSL using JSON and YAML. In part 2 we will expend a little more effort to build a more powerful mini-DSL.

By "powerful", I mean a DSL that not only supports the language structure we like but also allows for more complex constructs like method calls and expressions.

For this task, I chose Groovy, which is a really nice, Ruby-like language that is tightly integrated with Java. Later, I will also explore Ruby using JRuby just to show how similar Groovy and Ruby are in many aspects.

Groovy runs on the JVM and so it integrates seamlessly with Java. Even IntelliJ comes with native support for Groovy.Groovy can be used like a plain scripted, interpreted language or even compiled. It is a little slower than Java but offers a lot of powerful metaprogramming features to balance it out. I will not go into the details but suffice to say that it is particularly useful for building (surprise!) DSLs. For full blown examples see Cloudify, Gradle, this or this.

At first, I chose the simple approach of just using Groovy as a scripting language to specify the stocks. It didn't really look like a DSL because it is not.

Next, with just a tiny bit of setup where I make some ready made expressions and methods available to the script, I was able to specify my stocks in a nicer and more powerful format. The cool "with" syntax in Groovy also helped.

To demonstrate that I could also write executable code, I had the script print the date and name of the file in the first line.

I've just scratched the Groovy surface because I could've spent a lot more time overloading numeric types like the goodFor property where I could've used "30.days" but I didn't. There are obviously holes in my implementation but you cannot dismiss the speed at which you can get at least this much functionality. Perhaps with a little more Groovy proficiency and time, I could've done better.

Now on to Ruby. To show how similar Groovy and Ruby are, I used JRuby to build this DSL:

I also have a simpler, raw JRuby script but then it's not a DSL but just a script. Again note the similarities with Groovy.

Getting JRuby to integrate neatly with my Stocks beans was a little challenging. It required a slightly different setup. I also ran into some issues which are not documented well in JRuby. So, I asked for help on the mailing list but haven't heard from them. This coupled with the fact that JRuby startup takes several seconds made testing and experimenting a little frustrating.

Ruby in itself is used in a lot of places to build DSLs like Chef, RSpec and a lot of other projects.

One option I obviously overlooked is Scala. Scala is well known (among the Scala users) for its "apparently" powerful language features. However, in my opinion Scala's complex, sometimes bizarre, dense and obtuse syntax might keep it out of reach of average engineers like myself. I've shared this opinion earlier too.

So, that's it for now. I may extend this preliminary work on Diesel later when I have the time. Or better yet, you can fork it and share it.


Diesel - DSL experiments on the JVM (Part 1)

I've been meaning to write about my series of little experiments building a mini-DSL on the JVM.

I've worked with and written about expression evaluators before. However, there've been many instances where I've felt the need to quickly build a part pseudo-language and part configuration script.

Also, I did not have the time, resources nor the justification to build a full fledged grammar/parser. There were times where I did morph an existing ANTLR grammar for something else, but in the end I realized that a simple hand built tokenizer and AST would've done the trick. (Note to self: try Parboiled)

So, I was curious to see what options I had to build an "almost language", quickly. "Quickly" being the operative word. "Dirty" being the unsaid word.

I'm not going to go into the details of what a DSL is, or spend time debating over internal or external DSLs etc. Enough material is available on the internet and some books too.

If you want to learn more about Java API based DSLs - more commonly known as a Fluent DSL, there are several good places to start learning by example - Jooq, Google Guava ComparisonChain etc. I've built Fluent DSLs several times and it is a cleaner and better way to implement the Builder design pattern - like Google Protocol Buffers' Builder.

This time though, I wanted to evaluate options to build something that could read/parse/load files that look like structured, readable English. Some common cases where you'd need this:

  • Configuration scripts are prime candidates for this. In the early-mid 2000's, XML would've been the way to go; with XPath and XSDs/DTDs; built in support in the JDK and support for hierarchical structures
  • Glue to stitch together different modules in a program - something that usually involves some configuration code and basic expressions
  • Actual mini-languages that allow business analysts or IT/DevOps people to plug in some logic without writing complex Java code. Also without having to get developers and a full build cycle involved
So, let's cut to the chase and see what I came up with.

For my tests, I wanted to accomplish something very simple. I wanted a way to describe stock buying or selling instruction to my stock broker. It's a completely contrived example of course but it seemed valid for this test. I wanted a way to specify which stock to buy or sell, at what price, for how long the instruction is valid and some other little things.

Since I brought up XML, I'll talk about the simplest approach first - XML's slightly less ugly cousin JSON:

Using JSON and calling it a DSL is not only dumb but also cheating. But there are obviously a lot of places where this would suffice. Unlike XML, this is less verbose, but it still needs the user to know where to put double quotes, braces, square brackets and all this without a schema to validate the file.

It does have its advantages. All I had to do was create a JavaBean with all the possible combinations my "stock specification" could have and then use Google Gson to do the serialization/deserialization to/from JSON.

This is what the JavaBeans look like:

Assuming that this was enough, all I had to do was read the JSON into the Stocks bean and related inner classes and start using it.

The keen reader will notice that the Order class has some properties - limit, stopLimit, market which are really mutually exclusive. JSON does not prevent me from providing values to all 3 which would be wrong. I could've spent some more time fleshing those properties into an enum or a complex string but I'll leave that as an exercise for later (or the reader).

The full source along with scripts can be found on my GitHub Diesel repo for your reference.

So, I decided that JSON wouldn't cut it. A while ago I had played with YAML briefly, which is JSON's distant cousin. Actually, the latest YAML spec makes it JSON's parent (how convenient).

YAML is like JSON but without the frivolous double quotes and braces. Compare this YAML file with the previous JSON file, it speaks for itself:

It is without doubt, cleaner and more usable than JSON. You use SnakeYaml to do automatic ser/deser into the Stocks JavaBean like Gson.

Also, like Gson, if you don't have a bean or your configuration makes it difficult to map directly to a bean, you can just read it free form as a map of maps. This would be a poor man's AST. Gson's free form structure is actually better that way, in that it almost looks like XML Nodes.

If YAML is good enough for you, you can stop reading right here. In fact, YAML is also my favorite for simple configuration files that involve lists and hierarchies. This is miles ahead of and better than the flat format used in Java Properties.

But, defining YAML still has the same issues that JSON had with regards to semantic validations like limit, market etc. However this is really an issue with the way I've created the beans. Think of the YAML file as a free form AST. You'd have to write your semantic and syntactic validations in your Java code by walking this AST. I prefer to do this in Java because it's easier to have all the validations and exception messages in one file than split it across multiple ANTLR and Java files.

In part 2, we will explore other framework and language choices. Until next time, take care!

Wednesday, May 29, 2013

Camping at Joshua Tree National Park

This Memorial Day weekend we camped at Joshua Tree National Park. Campsites were full and we were just lucky to get (probably) the last campsite that wasn't already taken up. Next time we go camping we have to remember to get there a day early at these "first come first served" sites.

We spent a little less than 24 hours at the park. Camping was fun and we managed to do a moderately strenuous hike to 49 Palms Oasis. Overall it was not bad. Being a desert there's really not much to write home about.

So, we spent an evening at Knott's Soak City on our way back from Joshua Tree.

The next day we did a nice hike in LA, where the world famous Hollywood sign is posted on the side of a hill.

Overall the weather was warm, sunny and cool winds blowing that made hiking at both places enjoyable.

View Trip/May 2013/Joshua Tree and LA in a larger map


Monday, May 13, 2013

A collection of "low level" JVM and JavaScript related articles and more

Here's a nice collection of "low level" JVM and JavaScript related articles. None of which would be on anyone's list for low level programming.

While doing some reading on math and matrix operations in Java I came across many projects trying to overcome the limitations of the JVM while trying to implement numerical recipes:
While we are stuck with this, Dart seems to be making better progress by supporting SIMD instructions. JavaScript is getting weirder ("low level") with Emscripten and Asm.js.

Other interesting Java related reading material:
More next time! (Of course)

Monday, April 22, 2013

Graphs, machine learning, PostGres and other tidbits

I hadn't pushed out my "favorite reads of the season" for a while. So, here's a bunch of links to keep you occupied over the next few days.

Graphs, search and recommendations:
Discussion on Redis mailing list about SSD / Twitter Fatcache / Facebook McDipper and a follow up.
While doing some research on NoSQL systems, especially Cassandra, I was surprised to hear that newer releases of Cassandra are moving away from the flexible, semi-structured column families. Instead with CQL, there is a well somewhat restrictive, repetitive schema that should work well for certain workloads. Is it me or does it look like NoSQL is grudgingly moving towards SQL?
Speaking of SQL, PostGres is moving in the other direction. Recent (9.x+) versions have some very interesting column data types - Array, HSTORE, JSON etc. Of course, its SQL support is obviously fantastic.

And finally, a nice talk on trade processing and a of paper on MongoDB for finance.


Sunday, April 14, 2013

Importing OpenSSL/EC2 .pem keypair to Java keystore

I spent several hours scouring the internet looking for a way to import (OpenSSL) Amazon EC2's .pem keypair into a Java keystore. At the end of this frustrating exercise I was baffled to see how scattered the information was.

(FYI, doing this on Windows, especially the OpenSSL interaction part, for self-signing a certificate was painful even with Cygwin. I had to resort to using my Linux distro running in a VM)

To save myself time in the future and for those of you tearing your hair out looking for the same information, here it is. (The file paths are not real. You have to clean them up to match your setup):

Here are my references in no particular order:

To complement this, there were other things I had to do (being a first time user of EC2) to make my EC2 instance accept SSH connections:
And then to install Oracle JDK 7 on my EC2 Ububtu image:

Thursday, March 21, 2013

Information Overload

Information overload  a.k.a:

  • Too many websites, blogs, apps, social networks and not enough unification (and time)
  • Whatever happened to open formats? (ahem.. RSS/Atom?)

Wednesday, February 27, 2013

YesSQL, JVMs that need to be NUMA aware & other stories

Here's a whole bunch of fascinating reading material I've accumulated these past few months. You can tell there's a lot of love going on for SQL/RDBMS. Then some crazy JVM deployments that make you sit up and wonder. There's also quite a bit of performance related articles on UI/browser technologies. 

Data tier:
Here's a nice tool that I've filed for later. Esp useful if you find yourself doing production/support calls - Your logs are your data: logstash + elasticsearch. Sort of a poor man's Splunk. 

UI (mostly beating the life out of HTTP and JavaScript):
After covering all 3 tiers - DBs, JVMs and UIs, why stop there when you can finish it off by learning something about QA/unit testing? Here are some relatively new JUnit features (they've finally caught up with TestNG):

That should keep you busy for several days. Until next time!