Monday, February 3, 2014

Remote Dependencies, Convenience, Risk and Other Considerations for Operating Distributed Systems

One deeply held principle by experienced distributed system operators that I have worked with is that you should have no external dependencies to your software other than the ties to minimum requirements of the OS such as common system libraries, utilities, and the kernel of the base OS. This approach should enable recreating a distributed system deployment without any dependencies on the outside world. When something goes wrong, you should have control over your own destiny. Reliance on any external dependency that is managed or hosted by someone else introduces risk that something outside your system can affect your ability to restore and recreate the system any time you need to.
To use a simple metaphor, imagine your system is represented by Jenga blocks and it falls over as Jenga towers inevitably do. However, instead of being able to rebuild your tower you find out that a mandatory required component at the base of your tower is missing or unavailable now no matter what you cannot rebuild the tower exactly how it was before. Your new tower is going to behave differently in unexpected ways and you might topple over because you do not understand all the behaviors when using different building blocks combined in a different way.
Some of the original designers of the software deployment project for Cloud Foundry named BOSH (Mark LucovskyVadim SpiwakDerek Collison) embraced this principle and tried to create a prescriptive framework that encouraged this approach. They had experience managing large scale distributed systems at Google (the web services APIs). Kent Skaar also did similar for SaaS provider Zendesk. Given a software release that references specific versions of multiple software packages (known as a BOSH release), an instantiation of that release (a BOSH deployment) can be reconstructed at any time with the deployment configuration (a BOSH deployment manifest), the base OS images (the BOSH stemcells) and the software release (the BOSH packages and job templates for applying configuration). at any point in time, properly implemented BOSH releases of large scale distributed systems can be recreated without external dependencies. That means this holds true even when the internet is unavailable.
BOSH does give you the framework hooks to break out of this prescriptive principle and use external dependencies or at least external dependency formats if you choose to for convenience or other reasons. Dr Nic Williams recently implemented tooling to use apt packages instead of compiling from source. another example: some of the Pivotal big data software intentionally targets CentOS/RHEL only and therefore only ships rpm packages rather than compiling Hadoop. A guiding principle is that you should be mindful of the tradeoffs you are making of convenience vs risk and tying your release to only one OS distributor.
Examples of the tradeoffs:
  • relying on an externally hosted package manager like apt-get could affect the availability or correctness of that dependency when you need it most
  • relying on debian packages could prevent someone from using your release unmodified with a CentOS image
A recent real-world example demonstrated the risk of an external dependency changing unexpectedly. The coreos/etcdproject that Cloud Foundry is using for storing stateful configuration data for the new Cloud Foundry Health Manager codebase had one of the dependencies (goraft/raft) force push to master of their git repository that overwrote some git history required by git to work properly. This situation has limited the flexibility of some users to make code modifications on several previous releases of Cloud Foundry without some tedious intervention.
A common reaction when learning about Cloud Foundry BOSH is to question the prescriptive guidance to compile from source when commonly used distributed package management systems exist in the Linux distributions. My recommendation is to understand the tradeoffs involved and make the best choice for your situation. You should explicitly call out external dependencies if you have them in your system. When you tower inevitably falls over, know how to rebuild it.

Saturday, February 1, 2014

How to Find Java Mission Control on OSX 10.9

i was excited to show some of my colleagues how great Java Mission Control is to debug, troubleshoot and monitor local and remote Java applications. On my OSX 10.9 laptop i installed the latest Oracle JDK 7, which now includes Java Mission Control with the JDK, and i expected to be able to type jmc on the command line. that didn't work and resulted in a command not found! OSX Finder wouldn't find jmc it either. i found a hint on an OTN community thread. the Java installer only put installation files in the obscure and hard to find /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/ instead of also placing a shortcut in /usr/local/bin which should link to /System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/jmc that should change when you use /usr/libexec/java_home. you can see the install location by running this command:
$ find /Library/Java -name jmc
/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/missioncontrol/Java Mission

the solution i chose was to create this executable file in /usr/local/bin/jmc



maybe it's because i had JDK 7 installed originally before Mission Control was included with it. i'm not sure, but i hope this helps someone else.

Thursday, December 19, 2013

Friendly BOSH Labels in vCenter

thanks to the BOSH team for showing me how to do this. Cloud Foundry BOSH automatically tags vSphere deployed VMs with various attributions including the job and index. this way instead of justing have a GUID as the name in vCenter, you can add additional columns that are already populated with the BOSH job and job index. we'll be adding this to the documentation for pivotal cf soon. click the image to see the full size.

Thursday, December 12, 2013

Send Interactive Commands to a Cloud Foundry App with websocketd

on my way home from work tonight i saw the tweet below about connecting STDIN and STDOUT from remote processes with websocket. i tried it out quickly locally and it worked streaming output of numbers from 1 to 10 sent to STDOUT 1 second apart over websocket using localhost. can i apply this to cloud foundry easily?  it turns out the answer is yes!
you can easily include the small linux 64bit websocketd binary and this script with your app and remotely send commands over websocket that will execute in the app container and stream the STDOUT from the command back over websocket to the browser. this is helpful for sending commands like rake db:migrate or to explore the linux container file system after the buildpack has run. see the screenshot and video below. for the impatient, skip to about 3:00 of the short demo.

more instructions are on github. also see the websocketd project.

Sunday, December 8, 2013

Be Direct When You Communicate

"Be direct when communicating" is a common theme I've been hearing the last few days in Pivotal leadership discussions and other places.

When I listened to the Twitter CEO Dick Costolo fireside chat with PandoMonthly (start listening at around 35:30) it crystalized how important this is and how easy it is to fall out of this to appease the feelings of someone you're meeting with. Dick describes how he has a management / leadership training class where they do exercises for this and how many experienced people still yield to the temptation to "migrate along the Y-axis" and give up clarity instead of "optimize for the X-axis." Doing this definitely takes practice, but I'm going to remind myself to think of these axes in my communication. It's totally great if the person feels great about the discussion, but how they feel about it is not as important as receiving and understanding the message. There are certainly many conversations where someone will feel badly about the message, and that's a fine outcome if they receive the message and were unlikely to feel good about it under any circumstances.

Monday, October 14, 2013

Help Us Design and Build an Open Source Google Omega

One key Cloud Foundry design goal is that a single operations person should be able to manage hundreds or thousands of machines. We want to bring to the similar tools that are available privately to web scale-out companies like Google with Omega to the typical enterprise. We are looking for a hands-on practitioner that is passionate about creating and operating large systems, wants to act as the product owner role and enjoys working shoulder-to-shoulder with an extremely talented and experienced Cloud Foundry BOSH engineering team. You will also join an amazing user community with passionate technologists like Dr Nic Williams. Watch Dr Nic's talk describing why he fell in love with Cloud Foundry BOSH in his presentation at PlatformCF in September of 2013.

We’re looking for someone passionate and conceptually aware of:
  • cloud operator user experience
  • cluster and data lifecycle management
  • vm/container orchestration
  • complex network management
  • disk volumes
  • software package management
  • basic monitoring

Dr Nic metaphor for BOSH at PlatformCF 2013
Cloud Foundry runtime is a PaaS for running apps and services. Underneath the PaaS, there is a whole other aspect of Cloud Foundry named BOSH (Bosh Outer SHell), which was inspired by the systems in use at Google, Amazon, Facebook and Twitter to deploy and manage their software across many data centers around the world. There is not anything else available in open source that has the same scope and capabilities. Apache Mesos is in the same space, but has a different architecture that doesn't focus on IaaS orchestration, utilizing base operating system images and fine-grained control over network and disk. Cloud Foundry BOSH is the secret sauce that enables running the same Cloud Foundry runtime software on VMware, Amazon and OpenStack infrastructure-as-a-service changing only a minimal amount of infrastructure specific configuration. Not only does BOSH deploy these systems across hundreds or thousands of instances, but it is also able to keep them up-to-date without downtime as new software and fixes are released and rolled through the machines incrementally including kernel patches and middleware.

Cluster lifecycle management and operator user experience will be main focus areas for BOSH in 2014. Many stateful scale-out clustered data services that Cloud Foundry users want to deploy and manage with BOSH have nuances around startup sequence, whether cluster nodes can run in a mixed version cluster, and related considerations. This is a hard problem that when solved well, will be unique in the open source community.

Recently, we made it possible to use BOSH on a developer class machine with bosh-lite. The bosh-lite project uses vagrant and Linux Containers. Instead of traditional IaaS with multiple VMs for each role, bosh-lite uses linux containers inside of a single linux host. Therefore it is much faster to develop and test a BOSH release with fewer resources.

One of the best things the work is the collaboration process. Pivotal truly values the product owner / product manager (PM) role. This role makes the key product decisions with input from all stakeholders, but ultimately the product owner is responsible for defining the product. The PM is embedded shoulder-to-shoulder with the engineering team and is responsible for representing end-users and should meet with customers to build user empathy. The PM prioritizes the daily work items for the engineering team with the agile software tool Pivotal Tracker. if the PM prioritizes a feature in the engineering backlog, we can have it in production in days. The feeling of true influence and collaboration over the product destiny cannot be contrasted enough with the experience I had with a traditional waterfall process, such as my time in Product Management at Oracle.

Another incredible benefit is that you will be working with a truly world class product management team including James Watters, Matt Reider, Shannon Coen, Justin Richard, Scott Truitt, Mark Kropf, Ryan Morgan and Tammer Saleh. We work at 875 Howard St in downtown San Francisco, walking distance from many commute options and have incredible amenities.

If you read this far, then you should get in touch with and reference the BOSH PM position or reach out to me on twitter.

Additional reading:

Google Omega:

Mesos - open source clustering platform in use at Twitter and others:

Yarn - distributed workload management using Hadoop 2.0+

Systems and configuration management automation:

Distributed linux container management:

Saturday, October 5, 2013

A Quick Tour of the Cloud Foundry Router and Sticky Sessions

The Cloud Foundry Router is a key component of the platform that provides:
  • load balancing to multiple instances of applications with a dynamic routing table updated in a fraction of a second every time there is a change in the system
  • sticky session support
  • access log for apps that feeds into an application's Loggregator stream
  • support for Web Socket and other TCP protocols via HTTP Connect
Cloud Foundry has had support for "sticky" sessions in both v1 and v2. This week I had a meeting with a very large technology provider that has been using Cloud Foundry for quite awhile. They had no idea that sticky sessions were supported or how they worked and I realized we should have a better explanation and showcase of this capability. The use-case for this scenario involved a significant performance optimization when using sticky sessions because of user-specific data caching in the warmed-up app instance.

To enable Cloud Foundry sticky sessions all you need to do in the application is set a cookie named JSESSIONID. Java web applications have a history of using a cookie named JSESSIONID when you enable a session with code like request.getSession( true ) and because many Cloud Foundry users run Java apps, the project decided to use JSESSIONID as an indicator that sticky sessions should be used. To use another sticky sessions with another language like Ruby or Node.js, all you need to do is set a cookie named JSESSIONID and Cloud Foundry will attempt to consistently route requests to the same app instance.

I created a simple Java web application to show how this works and I hosted a copy of it on In the first screenshot you can see that sticky sessions are not enabled. First we can see that there are 4 instances of the app running.

Now when we visit the app and use something like Chrome Dev Tools to inspect the app. Refresh the app several times and observe that the printed out port changes as we are load balanced across the multiple app instances.

Now we'll click the "start a sticky session" link in the app, which creates a Java session in Tomcat and creates a JSESSIONID cookie. The router will notice the JSESSIONID cookie and set a Cloud Foundry specific cookie named __VCAP_ID__ with a value of the app instance GUID. The __VCAP_ID__ cookie is the hint to the router to send the request to the specific app instance. With these cookies set, refresh the page and the browser should consistently route to the single app instance.

If you use Chrome Dev Tools to delete the __VCAP_ID__ cookie and refresh the page, and you should see a randomly selected app instance value get set again for __VCAP_ID__. Subsequent page refreshes will be sticky to the instance reflected in the updated __VCAP_ID__ cookie value.

If you are interested in other features for the Cloud Foundry Router, let us know on the Cloud Foundry mailing list.