Cloud-native developer. Distributed systems wannabe. DevOps and continuous delivery. 10x troublemaker. DevOps Manager at VHT.
8948 stories
1 follower

Dew Drop – October 19, 2018 (#2827)

1 Share

Top Links

Web & Development

XAML, UWP & Xamarin

Visual Studio & .NET

Design, Methodology & Testing

Mobile, IoT & Game Development

Podcasts, Screencasts & Videos

Community & Events


More Link Collections

The Geek Shelf

The post Dew Drop – October 19, 2018 (#2827) appeared first on Morning Dew.

Read the whole story
4 hours ago
Akron, OH
Share this story

Orca: differential bug localization in large-scale services


Orca: differential bug localization in large-scale services Bhagwan et al., OSDI’18

Earlier this week we looked at REPT, the reverse debugging tool deployed live in the Windows Error Reporting service. Today it’s the turn of Orca, a bug localisation service that Microsoft have in production usage for six of their large online services. The focus of this paper is on the use of Orca with ‘Orion,’ where Orion is a codename given to a ‘large enterprise email and collaboration service that supports several millions of users, run across hundreds of thousands of machines, and serves millions of requests per second.’ We could it ‘Office 365’ perhaps? Like REPT, Orca won a best paper award (meaning MR scooped 2 out of the three awards at OSDI this year!).

Orca is designed to support on-call engineers (OCEs) in quickly figuring out the change (commit) that introduced a bug to a service so that it can be backed out. (Fixes can come later!). That’s a much harder task than it sounds in highly dynamic and fast moving environments. In ‘Orion’ for example there are many developers concurrently committing code. Post review the changes are eligible for inclusion in a build. An administrator periodically creates new builds combining multiple commits. A build is the unit of deployment for the service, and may contain from one to hundreds of commits.

There’s a staged roll-out process where builds move through rings. A ring is just a pre-determined set of machines that all run the same build. Builds are first deployed onto the smallest ring, ring 0, and monitored. When it is considered safe the build will progress to the next ring, and so on until it is finally deployed world wide.

Roughly half of all Orion’s alerts are caused by bugs introduced through commits. An OCE trying to trace an alert back to a commit may need to reason through hundreds of commits across a hierarchy of builds. Orca reduced the OCE workload by 3x when going through this process. No wonder then that it seems to be spreading rapidly within Microsoft.

Example post-deployment bugs

Over a period of eight months, we analyzed various post-deployment bugs and the buggy source-code that caused them. Table 2 (below) outlines a few characteristic issues.

The commits column in the table above shows the number of candidate commits that an OCE had to consider while triaging the issue. Post-deployment bugs fall pre-dominantly into the following categories:

  • Bugs specific to certain environments (aka ‘works for me!’)
  • Bugs due to uncaptured dependencies (e.g. a server-side implementation is modified but the corresponding client change has not been made)
  • Bugs that introduce performance overheads (that only emerge once a large number of users are active)
  • Bugs in the user-interface whereby a UI feature starts misbehaving and customers are complaining.

Studying the bugs and observing the OCEs gave a number of insights that informed Orca’s design:

  • Often the same meaningful terms occur both in the symptom and the cause. E.g. in bug no 1 in the above table the symptom is that the People Suggestion feature has stopped working, and the commit introducing the problem has a variable ‘suggest’.
  • Testing and anomaly detection algorithms don’t always find a bug immediately. A bug can start surfacing in a new build, despite being first introduced in a much older build. For example bug 3 in the table appeared in a build that contained 160 commits, but the root cause was in the previous build with 41 commits.
  • Builds may contain hundreds of commits, so manually attributing bugs can be a long task.
  • There are thousands of probes in the system, and probe failures and detections are continuously logged.

Based on these insights Orca was designed as a custom search engine using the symptom as the query text. A build provenance graph is maintained so that Orca knows the candidate set of commits to work through, and a custom ranking function is used to help rank commits in the search results based on a prediction of risk in the commit. Finally, the information already gathered from the logged probe failures give a rich indication of likely symptom search terms, and these can be used to track the frequency of terms in query and use Inverse Query Frequency ICF (cf. Inverse Document Frequency, IDF) in search ranking.

In addition to being used by OCEs to search symptoms, multiple groups have also integrated Orca with their alerting system to get a list of suspect commits for an alert and include it in the alert itself.

How Orca works

The input query to Orca is a symptom of the bug… The idea is to search for this query through changes in code or configurations that could have caused this bug. Thus the “documents” that the tool searches are properties that changed with a commit, such as the names of files that were added, removed or modified, commit and review comments, modified code and modified configuration parameters. Orca’s output is a ranked list of commits, with the most likely buggy commit displayed first.

The first step is to tokenize the terms from the presenting symptoms, using heuristics specially built for code and log messages (e.g. splitting large strings according to Camel-casing). In addition to single tokens, some n-gram tokens are also created. After tokenization, stop words are filtered out, as well as commonly used or irrelevant terms found through analysing about 8 million alerts in Orion’s log store (e.g., the ‘Exception’ postfix on class names).

Now we need to find the set of commits to include in the search. Orca creates and maintains a build provenance graph. The graph captures dependencies between builds and across rings.

Builds that are considered stable in a given ring can be promoted to the next ring, which gives rise to an inter-ring edge. Interestingly not all fixes roll forward through this process though. A critical bug found in a build in a later ring can be fixed directly in that ring and then the fix is back-ported to earlier rings. Orca uses the graph to trace a path back from the the build in which the symptom is exhibiting, to the origin build in ring 0. The candidate list of commits to search includes all commits in the builds on this path, together with any back-ported commits.

Given the set of commits, Orca uses differential code analysis to prune the search space. ASTs of the old and new versions of the code are compared to discover relevant parts of the source that have been added, removed and modified in a commit. The analysis finds differences in classes, methods, references, conditions, and loops. For all of the changed entities, a heuristic determines what information to include in the delta. For example, if two lines in a function have been changed the diff will include the entire text of the two changed lines (old version and new version) as well as the name of the function. This approach catches higher-level structures that a straight lexical analysis of the diff would miss.

The output of the differential code analysis is search for tokens from the tokenised symptom description, using TF-IQF as a ‘relevance’ score. These scores are first computed on a per-file basis and then aggregated to give an overall score for the commit. In the evaluation, ‘max’ was found to work well as the evaluation function.

We can now return the ranked search results. However, the authors found that very often multiple commits had the same or very similar scores. To break ties between commits with the same scores, a commit risk prediction model is used.

Commit risk prediction

We have built a regression tree-based model that, given a commit, outputs a risk value for it which falls between 0 and 1. this is based on data we have collected for around 93,000 commits made over 2 years. Commits that caused bugs in deployed are labeled ‘risky’ while those that did not, we labeled ‘safe.’ We have put in considerable effort into engineering the features for this task…

This is a super-interesting area in its own right. The paper only includes a brief summary of the main features that inform this model:

  • Developers who are new to the organisation and the code base tend to create more post-deployment bugs. So there are several experience-related features in the model.
  • Files mostly changed by a single developer tend to have fewer bugs than files touched by several developers. So the model includes features capturing whether a commit includes files with many owners or few.
  • Certain code-paths when touched tend to cause more post-deployment bugs than others, so the model includes features capturing this.
  • Features such as file types changed, number of lines changed, and number of reviewer comments capture the complexity of the commit.


Since its deployment for ‘Orion’ in October 2017, Orca has been used to triage 4,400 issues. The authors collected detailed information for 48 of these to help assess the contribution that Orca makes to the process. Recall that an OCE is presented with a symptom and may have to search through hundreds of commits to find the culprit. The following table shows how often Orca highlights the correct commit in a top-n list of results (look at the top line, and the ‘ALL’ columns):

As deployed, Orca presents 10 search results. So the correct commit is automatically highlighted to the OCE as part of this list in 77% of cases. The build provenance graph contributes 8% to the recall accuracy, and the commit risk-based ranking contributes 11%.

The impact on overall OCE workload is significant. The following chart shows the number of expected commits an OCE investigates with and without Orca for the 48 issues in the study.

… using Orca causes a 6.5x reduction in median OCE workload and a 3x (67%) reduction in the OCE’s average workload… For the 4 bugs that were caught only because of the build provenance graph, the OCE had to investigate an average of 59.4 commits without Orca, and only 1.25 commits with it. This is a 47.5x improvement.

A closing thought:

Though we describe Orca in the context of a large service and post-deployment bugs, we believe the techniques we have used also apply generically to many Continuous Integration / Continuous Deployment pipelines. This is based on our experience with multiple services that Orca is operational on within our organization.

Read the whole story
4 hours ago
Akron, OH
Share this story

What is an SRE and how does it relate to DevOps?

1 Share
toolbox drawing

Even though the site reliability engineer (SRE) role has become prevalent in recent years, many people—even in the software industry—don't know what it is or does. This article aims to clear that up by explaining what an SRE is, how it relates to DevOps, and how an SRE works when your entire engineering organization can fit in a coffee shop.

read more
Read the whole story
4 hours ago
Akron, OH
Share this story

New Terraform Providers: F5 Networks, Nutanix, Tencent Cloud, and Helm

1 Share

We are pleased to announce that F5 Networks, Nutanix, Tencent Cloud and Helm providers are now available for HashiCorp Terraform. This blog will detail the new providers and include links to additional resources.

Terraform Providers

F5 Networks

F5 Networks is an application delivery solutions company based in Seattle, WA. F5 focuses on helping firms securely deploy their applications in public or private cloud environments through hardware, software, or as-a-service solutions.

The F5 BIG-IP Terraform provider enables resources for BIG-IP Local Traffic Manager (LTM). BIG-IP LTM is a platform which assists in controlling network traffic for applications and provides monitoring services across infrastructure to ensure reliability and security. The BIG-IP Terraform provider enables operators to provision LTM resources as part of their normal Terraform workflow. For more information about the F5-BIGIP Terraform provider, please visit:


Nutanix is an enterprise cloud platform company which provides a single OS for organizations running public, private, or distributed cloud environments. They provide software based solutions for assisting enterprises with the challenges of Hyperconverged Infrastructure (HCI).

The Nutanix Terraform provider enables operators to provision and manage resources and data sources on the Nutanix AHV virtualization solution. For more information about the Nutanix Terraform Provider, please visit:

Tencent Cloud

Tencent Cloud is China-based public cloud service that offers a number of cloud computing capabilities for compute, storage, networking, and many more. Tencent offers services in 45 availability zones across 25 regions worldwide.

The Tencent Cloud Terraform provider enables operators to provision a number of resources, including data sources, container cluster services, and VPCs, as part of their infrastructure provisioning workflow. Authentication can be done through either static credentials in a Terraform configuration file or called as environmental variables. For more information about the Tencent Cloud Terraform provider, please visit:


Helm is a tool for creating charts which can be used for managing Kubernetes. Recently, we announced an integration with Consul for using Helm charts to deploy and configure Consul on Kubernetes clusters and now we are happy to announce the availability of a dedicated Helm provider for Terraform.

The Helm Terraform provider is used to deploy software packages onto Kubernetes clusters. In order to use this provider, you must be running a recent version of Kubernetes and have installed/configured a local copy of kubectl. For more information about the Helm Terraform provider, please visit:

For more information about HashiCorp Terraform please visit our product pages.

Read the whole story
15 hours ago
Akron, OH
Share this story

Roundup #22: OSS Library Guidance, .NET Core ThreadPool Starvation, VS2019 Roadmap, Know the Flow!


Here are the things that caught my eye this week.  I’d love to hear what you found most interesting this week.  Let me know in the comments or on Twitter.

Open-source library guidance

This guidance provides recommendations for developers to create high-quality .NET libraries. This documentation focuses on the what and the why when building a .NET library, not the how.

The issue of strong naming is address and actual guidance is actually in the docs now.


Diagnosing .NET Core ThreadPool Starvation

This article is worth a read if you have

  1. A service written in .NET (and in particular .NET ASP.NET Core)
  2. Its ability to service incoming load has topped out
  3. But the machine’s CPUs are not fully utilized (often only using a small fraction of the total available CPU



Visual Studio 2019 Roadmap

Notable improvements are:

  • A better performing and more reliable debugger, moving to an out-of-process 64-bit process.
  • Improved search accuracy for menus, commands, options, and installable components.
  • Visual Studio tooling for Windows Forms and WPF development on .NET Core 3.



Know the Flow! Events, Commands & Long-Running Services

Strategic design does not stop at defining boundaries around business capabilities – it should reach out for truly smart endpoints, emphasize autonomy and the need for more coarse-grained and asynchronous APIs. The long-running services behind such APIs feel responsible for their business and sort out most problems without leaking internal troubles and bothering their clients. While long-running services will leverage domain events for decoupling, they will often expose their core functions as commands – in order to minimise overall coupling! Extracting the customer-facing core processes of companies into dedicated, long-running services allows to keep sight of larger-scale flows – without violating bounded contexts or introducing god services. In this talk, Martin not only explores strategic design in the light of understanding the long-running nature of delivering many real-life business capabilities. He will also show the practical side of the equation: implementing long-running behaviour of services, requiring proper reactions on failures, timeouts and the compensating actions sagas are known for. A new generation of lightweight, embeddable and scalable saga managers and flow engines assist in that endeavour. Expect real-life experience and many examples!



Enjoy this post? Subscribe!

Subscribe to our weekly Newsletter and stay tuned.

The post Roundup #22: OSS Library Guidance, .NET Core ThreadPool Starvation, VS2019 Roadmap, Know the Flow! appeared first on CodeOpinion.

Read the whole story
13 hours ago
West Grove, PA
15 hours ago
Akron, OH
Share this story

Never miss a beat—new integrations make it easy to insert content, trigger actions within Gmail

1 Share

Raise your hand if you’ve ever had to attach a doc, insert an image or paste data from another app in an email? Now keep that hand up if that process took more time than you would have liked. You are not alone. Research shows that workers spend up to 8 hours per week searching for, or consolidating, information. Whether it’s digging through various file folders to find and attach a document, or hopping from your project management or CRM app to copy-paste links in email, all of that back-and-forth adds up.

This changes today with Compose Actions in Gmail Add-ons. Compose Actions make it easy for you to add attachments, reference records, or liven up your messages with content from your favorite third-party apps right as you draft your message in Gmail.

Stay in the flow

We first previewed compose actions at Google Cloud Next ‘18 and are proud to make it generally available today to all Gmail users with four fantastic integrations you can try right away (with more on the way).

  • Box: The Box Add-on for Gmail enables Box users to save valuable time by letting them quickly attach Box files to emails and save email attachments to Box, all within Gmail.

Box Compose Actions Add-ons
Box Add-on for Gmail

  • Dropbox: The Dropbox Add-on for Gmail lets users share Dropbox links as well as save files into their Dropbox account right from Gmail. 

Dropbox Add-on for Gmail
  • Atlassian: The Atlassian Cloud Add-on for Gmail brings helpful context from Jira and Bitbucket into your inbox. Users can easily add previews of Jira issues to their emails by browsing recent issues to quickly add them to any message.
Atlassian Add-on for Gmail
  • Egnyte: The Egnyte Add-on for Gmail lets users save email attachments to Egnyte as well as link Egnyte files and folders all from within the Gmail compose window.  

Convenient and secure

Compose Actions is a new feature of Gmail Add-ons, which means the moment you authorize action with the feature, they will work in Gmail across mobile and web. G Suite admins can also easily whitelist the add-ons they want to enable for their organization.

Try Compose Actions today

G Suite and Gmail users can check out the G Suite Marketplace to find and install add-ons, with more compose actions coming soon. Developers can also consult our documentation to build their own.

Read the whole story
22 hours ago
Akron, OH
Share this story
Next Page of Stories