{ Make this readable }
Showing posts with label tech. Show all posts
Showing posts with label tech. Show all posts

Wednesday, August 12, 2015

Summer 2015 tech reading and goodies

Graph and other stores:
  • http://www.slideshare.net/HBaseCon/use-cases-session-5
  • http://www.datastax.com/dev/blog/tales-from-the-tinkerpop
  • TAO: Facebook's Distributed Data Store for the Social Graph
    Architecture & Implementation
    All of the data for objects and associations is stored in MySQL. A non-SQL store could also have been used, but when looking at the bigger picture SQL still has many advantages:
    …it is important to consider the data accesses that don’t use the API. These include back-ups, bulk import and deletion of data, bulk migrations from one data format to another, replica creation, asynchronous replication, consistency monitoring tools, and operational debugging. An alternate store would also have to provide atomic write transactions, efficient granular writes, and few latency outliers
  • Twitter Heron: Stream Processing at Scale
    Storm has no backpressure mechanism. If the receiver component is unable to handle incoming data/tuples, then the sender simply drops tuples. This is a fail-fast mechanism, and a simple strategy, but it has the following disadvantages:
    Second, as mentioned in [20], Storm uses Zookeeper extensively to manage heartbeats from the workers and the supervisors. use of Zookeeper limits the number of workers per topology, and the total number of topologies in a cluster, as at very large numbers, Zookeeper becomes the bottleneck.
    Hence in Storm, each tuple has to pass through four threads from the point of entry to the point of exit inside the worker proces2. This design leads to significant overhead and queue contention issues.
    Furthermore, each worker can run disparate tasks. For example, a Kafka spout, a bolt that joins the incoming tuples with a Twitter internal service, and another bolt writing output to a key-value store might be running in the same JVM. In such scenarios, it is difficult to reason about the behavior and the performance of a particular task, since it is not possible to isolate its resource usage. As a result, the favored troubleshooting mechanism is to restart the topology. After restart, it is perfectly possible that the misbehaving task could be scheduled with some other task(s), thereby making it hard to track down the root cause of the original problem.
    Since logs from multiple tasks are written into a single file, it is hard to identify any errors or exceptions that are associated with a particular task. The situation gets worse quickly if some tasks log a larger amount of information compared to other tasks. Furthermore, an unhandled exception in a single task takes down the entire worker process, thereby killing other (perfectly fine) running tasks. Thus, errors in one part of the topology can indirectly impact the performance of other parts of the topology, leading to high variance in the overall performance. In addition, disparate tasks make garbage collection related-issues extremely hard to track down in practice.
    For resource allocation purposes, Storm assumes that every worker is homogenous. This architectural assumption results in inefficient utilization of allocated resources, and often results in over-provisioning. For example, consider scheduling 3 spouts and 1 bolt on 2 workers. Assuming that the bolt and the spout tasks each need 10GB and 5GB of memory respectively, this topology needs to reserve a total of 15GB memory per worker since one of the worker has to run a bolt and a spout task. This allocation policy leads to a total of 30GB of memory for the topology, while only 25GB of memory is actually required; thus, wasting 5GB of memory resource. This problem gets worse with increasing number of diverse components being packed into a worker
    A tuple failure anywhere in the tuple tree leads to failure of the entire tuple tree . This effect is more pronounced with high fan-out topologies where the topology is not doing any useful work, but is simply replaying the tuples.
    The next option was to consider using another existing open- source solution, such as Apache Samza [2] or Spark Streaming [18]. However, there are a number of issues with respect to making these systems work in its current form at our scale. In addition, these systems are not compatible with Storm’s API. Rewriting the existing topologies with a different API would have been time consuming resulting in a very long migration process. Also note that there are different libraries that have been developed on top of the Storm API, such as Summingbird [8], and if we changed the underlying API of the streaming platform, we would have to change other components in our stack.
Until next time!

Monday, June 01, 2015

Spring 2015 reading list

Here's a giant list of articles I read and liked (hat tip to people I follow on Twitter/Blogs. I'm just re-sharing this):

Sunday, April 12, 2015

A simple guide to using Unix/GNU Linux command line tools for fiddling with log files (*runs on Windows too)

I've been meaning to write this post for years now. Every time I thought about compiling a basic list, I've told my self "Nah.. there must be tons of examples on the net". Yes there are tons of them but I couldn't find anything:

  • That helped absolute noobs with a consolidated list
  • That demonstrated actual fiddling with Java log files
  • Something that works on Windows(!) No, I don't mean the awful Cygwin tool but something like Busybox or the wonderful Gow
So, here it is:

Sunday, February 01, 2015

Starting 2015 with yet another link dump

A belated happy new year! Here's some reading material I've been accumulating for a few months.

Distributed systems:

Performance related:
On tuning:
Misc tech articles:
Formatting comments on Gerrit:
That's it for now!

Sunday, October 19, 2014

Fall 2014 tech reading

My posts are getting less frequent and when I do post something, I realize that they are mostly just links. Yes, work is keeping me busy.
Big data:
Really? Another Hadoop SQL layer? Another Storm?
For those of you who knew about the original "column oriented stores" and "in-memory stream processing" - KDB - http://queue.acm.org/detail.cfm?id=1531242

Java 8 - the good and ugly bits:
Networks and systems:
The usual Scala and Go hate:
Until next time!