Being Alive and Ready

Hey, a guy I respect in DevOps told me micro services that implement liveness prob and readiness prob is a good thing to do in a Kubernetes environment. That sounds obvious but if an app is in ready state, would that also means it is a live?  Why the curve ball?  Well, container state machine goes like startup -?-> liveness -?-> readiness then the pod is up and running.  It is generally consider bad to send traffic to pod that is not ready to process data.  Otherwise, it can cause data lost if proper exception handling is not in place or that reliable connection is assumed like UDP traffic.  Liveness prob is there to detect pod in bad state and restart it to gain more use out of iffy pod hosting application of questionable quality.  Emm, that sound like making the best of bad situation until someone is notified.  For a pod to transition from one state to another, there needs to be some conditions meet.  What is the condition transitioning from start up to liveness check or from liveness to readiness check?

So, liveness prob rely on self check and if often a HTTP GET call of a health check URL of the application in question.  This health check does not mean the app can talk to a backend DB etc.  It only means the app is able to receive HTTP request and respond.  App owner should make the liveness check comprehensive but only within the app's or the container.  I think people usually don't do much beyond HTTP request and respond.  The app could have check resource utilization and fail the liveness check to have itself restarted by K8S.  There is no shame in that.  People would understand your app is busy and from time to time needs to take a 40 second break.  Besides, app bodies would pick up the load and everything will be fine.

Applications I have seen deployed to a cloud environment often do not handle traffic congestion well.  They don't back off when downstream services are having hard time serving requests.  Even if they back off, they don't signal upstream services that they should take it easy for a while until things are looking better.  To make the traffic flow nicer, once can implement buffering in the traffic path like a pub/sub system, but application that lives in a pod should implement readiness prob.  If you app detects a congestion downstream, you can have your app stop responding to readiness prob.  This should cause K8S service to stop routing traffic to your app.  How does one know there is congestion downstream?  May be via timeout exceptions or other error handle events.  The issue with this approach is you assume downstream services is good at communicating congestion condition to your app.  Usually developers are in a rush who then has time to test out congestion signal propagation to the initial application?  Usually this is QA's responsibility but to test this scenario is not trivial.  What end up happening most of the time is you have a live production problem and people gather information about those problem then.  Further, once a production problem get resolved, the on-call team disband and go their separate way again.  So, I don't see a straight forward solution to this.  The minimum is implement readiness prob when it is your turn to give a damn.

The startup, liveness, and readiness probs occupy distinct usage space.  Liveness prob is the subsequent state of startup prob.  But since you don't have to have a startup prob to have liveness prob, couldn't I extend the time wait long enough to account for the startup time?  May be but also may be a little clunky doing it that way because liveness prob gives up the status of the application when it is already in running state.  The app's ongoing running concern is being monitored by the liveness state.  While it is to our interest to react to problems while the application is running.  It is not our interest to be fooled by long startup time.  So, it make sense to have startup and liveness check in separate states.  Lastly, the readiness state does not seem to depend on liveness state.  That is liveness and readiness run in parallel.   So the three states don't form a state machine after all.


Comments

Popular posts from this blog

LB + amp agar plate protocol

SSH to Docker container and run X11 on Mac Pro Host Machine

DevOps Madness