Search This Blog

Tuesday, October 21, 2014

Reviving Quality Circles to Continuously Learn in Fragile Systems

A "gift" from the team after my latest quality crusade
I had the opportunity to present this information at two Agile Austin QA SIG meetings in September and October 2014.  It was a great opportunity to share some of what I've learned since joining Socialware.  It's helping me to change the culture for the benefit of our customers and is a key part of "The Socialware Way".  Before too long, I expect nearly everyone to say, "Quality is sexy!"

Background

While greenfield development efforts always have the promise of "doing it right" from the ground up, they can quickly devolve into the "legacy" systems that are essentially unsafe to modify as doing so will almost guarantee problems for users that the development team cannot anticipate. Further, many of today's applications exist in a complex and fragile ecosystem of APIs and other dependencies that are beyond the control of a team, division, or an entire organization. A culture of continuous learning is key in combating these challenges to create safe and valuable software for the customers and development teams that build and maintain them.
At Socialware, we have started the process of reviewing all production critical issues in an open and visible manner by using a Quality Circle approach as a team. We call this the Critical Defect Review Process.  This is especially important due to the fact that our products exist on top of the constantly changing APIs of LinkedIn, Facebook, and Twitter--the world's largest social media companies.  As our company has the largest Social Media deployments by Financial Services firms in the world, who have little, if any, tolerance for problems due to their scale and regulatory requirements, the focus on quality is paramount. 

The Goal 

The goal of the Critical Defect Review Process is to understand the root cause of why any high-priority defect exists in production and take specific remediation actions to ensure that this issue and others do not occur again.  We also establish ownership of any actions going forward.  Finally, we keep a recorded log of these issues, which will be reviewed periodically.  Note, that the goal is NOT a "witch hunt" in terms of assigning blame.  If we are not open to deep understanding, we will never be able to solve the problem.  The ultimate goal of this process is to deprecate the process when we have zero critical defects in production, however this occurs.

Implementation

We continue to monitor this process and continuously inspect and adapt it, but currently, this is the basic framework of our Critical Defect Review Process.


Periodicty

This review occurs every three weeks (the same cadence as our sprints) with all resolved production defects marked as either a "Priority 1" or "Priority 2."  The reason we wait until they are fixed is that the highest priority is to resolve any customer-critical issues, regardless of their genesis.  Additionally, reflection generally helps towards the further understanding of the root cause of problems, and the eventual solution.


Participants

We have the following roles participate at these reviews:
  • Moderator/Scribe (generally a member of the Development Management team)
  • Product Technical Lead
  • Primary Engineer involved 
  • Primary QA Engineer involved 
  • QA Lead
  • Customer Support Lead
  • All team members are, of course, welcome to attend and listen

Format

The discussion captures the issue in question, any related issues, the primary engineer associated with the issue, the primary QA engineer associated with the issue, the root cause/synopsis, recommended actions moving forward (remediation), and agreed actions. All of this is captured in our wiki. Note, the agreed actions may be an acceptance of a certain risk within our process, although this is not the primary desired objective.  This information will be reviewed at a regular basis by Product Development Management, Executive Leadership, and the team, at the very least during the sprint retrospective.

Again, there is no blame assigned, which can only work within the larger context of "The Socialware Way" where our team members are respected, trusted, and treated as our most precious assets as opposed to "resources."  


Early Results

While we have only been doing this for about two months, we've seen some fairly impressive results.  Not only are the issues trending down, we are also able to respond more quickly and have a deeper understanding of why quality problems exist.  Of course, we are implementing a number of other changes to our product development system that result in high-quality customer value as a natural outcome of our system rather than something that needs to be forced.  So, the attribution of the increased quality is manifold.  I will continue to monitor and share results.

Genesis

The genesis of this concept comes from the concept of a Quality Circle, which was introduced by Dr. Ishikawa.  It was interpreted into "cross-functional teams" which was a key part of some Total Quality Management systems.  

Thoughts?

I would love to hear your thoughts on this process, please feel free to direct message, or comment below.  Thank you for your time in reading!

3 comments:

  1. I actually think a bigger impact is that things are supposed to work more than they are supposed to meet deadlines. Meeting deadlines is the fastest way to git shortcut code and establishing that bug free code is more important than meeting deadlines at an executive level has a big impact. Of course, still meet deadlines at a reasonable level. The actions and words of Executes are echoed and reverberated in the engineers they bear.

    ReplyDelete
  2. Hi Lawrence--good comment, thanks! You're bringing up some of the other changes we're making as well ("Of course, we are implementing a number of other changes to our product development system that result in high-quality customer value as a natural outcome of our system rather than something that needs to be forced. So, the attribution of the increased quality is manifold."), and why I titled the original blog post "Quality is sexy volume one." It's in the URL :)

    ReplyDelete
  3. I thought the following paper was pertinent here: http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf. I especially liked the quote 7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.
    Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident.
    There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident.
    Indeed, it is the linking of these causes together that creates the circumstances required for the accident. Thus, no isolation of the ‘root cause’ of an accident is possible. The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the s

    ReplyDelete