Integration specialist. Linux aficiando. Web tinkerer. BizTalk and .NET developer. Chief Propeller Head at StoneDonut, LLC.
4449 stories
·
1 follower

RED Method for Prometheus – 3 Key Metrics for Monitoring

1 Share

Anita Buehrle is a developer advocate at Weaveworks.

On July 25th, Luke Marsden from Weaveworks and Bill Maxwell from Rancher Labs led a webinar on ‘A Practical Toolbox to Supercharge Your Kubernetes Cluster’. In the talk they described how you can use Rancher and Weave Cloud to set up, manage and monitor an app in Kubernetes.

In this blog, we’ll discuss how and why Weave developed the best-practice RED method for monitoring apps with Prometheus.

 

What is Prometheus Monitoring?

You may have heard a lot about Prometheus lately, especially when it comes to monitoring applications in Kubernetes. To provide a bit of background before we delve into the RED method, apps running in containers and orchestrated by Kubernetes are highly automated and dynamic, and so, when it comes to monitoring applications in these environments, traditional server-based monitoring tools designed for static services are not sufficient.

This is where Prometheus comes in.

Prometheus is an open source project that was originally developed by engineers at SoundCloud. It was built and designed specially to monitor microservices that run in containers. Data is scraped from running services at time intervals and saved to a time-series database where it can be queried via the PromQL language. Because the data is stored as a time series, it allows you to explore those time intervals to diagnose problems when they occurred and to also analyze long-term monitoring trends with your infrastructure — two awesomely powerful features of Prometheus.

At Weaveworks we built on the open source distribution of Prometheus and created a scalable, multi-tenant version that is part of our Software-as-a-Service called Weave Cloud.

After having run this service for several months now, and by using Weave Cloud to monitor itself, we’ve learned a few things about monitoring cloud native applications and devised a system that we use in determining what to measure before instrumenting code.

 

What to Instrument?

One of the most important decisions to make when setting up Prometheus Monitoring is deciding on the type of metrics you need to collect about your app. The metrics you choose simplifies troubleshooting when a problem occurs and also enables you to stay on top of the stability of your services and infrastructure. To help us think about what’s important to instrument, we defined a system that we call the RED method.

The RED method follows on the principles outlined in the Four Golden Signals developed by Google Site Reliability Engineers or SREs, which focuses on measuring things that end-users care about when using your web services.

With the RED method, three key metrics are instrumented that monitor every microservice in your architecture:

  • (Request) Rate – the number of requests, per second, your services are serving.
  • (Request) Errors – the number of failed requests per second.
  • (Request) Duration – The amount of time each request takes expressed as a time interval.

Rate, Errors and Duration attempt to cover the most obvious web service issues. These metrics also capture an error rate that is expressed as a proportion of request rate. With these three basic metrics, we believe the most common problems that can result in poor customer satisfaction are covered.

For even more detailed coverage, you may also include the Saturation metric. Saturation is used in another methodology called the USE or Utilization, Saturation and Errors method and it refers to a resource with extra work that can’t be serviced and therefore must be added to the queue for later processing.

 

USE vs. RED Methods

The USE method focuses more on monitoring performance and is meant to be used as a starting point in identifying the root cause of performance issues and other systemic bottlenecks.

Ideally, both the USE and the RED Methods can be used together when monitoring your applications.

 

Why you should measure the same metrics for every service

From a monitoring perspective, the benefits of treating each service the same is scalability in your operations teams.

What does scalability of an operations team mean? We look at this from the point of view of how many services a given team can support. In an ideal world, the number of services the team can support would be independent from its team size, and instead dependent on other factors like what kind of response SLA you want, and whether you need 24/7 coverage, etc.

How do you decouple the number of services you can support from the team size? By making every service look, feel and taste the same. This reduces the amount of service-specific training the team needs, and also reduces the service-specific special cases the oncalls need to memorize for those high-pressure incident response scenarios or what is referred to as “cognitive load.”

Capacity planning: Do it as a function of QPS and latency

Automating tasks and creating alerts An advantage of the RED method is that it helps you to think about how to display information in your dashboards. With just these three metrics, you can standardize on the layout of your dashboards to make it even simpler to read and to create alerts on for when there is a problem. For example, a possible layout might entail – a different Weave Cloud notebook for each service with PromQL queries for request & error, and latency for each of those services.

Also it goes without saying, but if you treat all your services the same, repetitive tasks are more easily automated.

PromQL query

RED Method Metrics Monitored in Weave Cloud

Limitations

It is fair to say this method only works for request-driven services – it breaks down for batch-oriented or streaming services for instance. It is also not all-encompassing. There are times you will need to monitor other things – the USE Method for example is great when applied to resources like host CPU & Memory, or caches.

 

What’s next?

These topics were covered in the latest Rancher and Weaveworks webinar where we showed you how to monitor and troubleshoot a service with Weave Cloud that was deployed through Rancher. To get right to using Rancher and Weaveworks together, you can also view just the demo from the webinar online as well.

Watch a replay of the webinar with demos and real world examples or recreate it on your own with these how-to steps.

 

 

The post RED Method for Prometheus – 3 Key Metrics for Monitoring appeared first on Rancher Labs.

Read the whole story
sbanwart
11 hours ago
reply
Akron, OH
Share this story
Delete

Announcing the preview of App Service domain

2 Shares
For a production web app, you probably want users to see a custom domain name. Today we are announcing the preview of App Service domain. App Service domain (preview) gives you a first class experience in the Azure portal to create and manage domains that will be hosted on Azure DNS for your Azure services such as Web Apps, Traffic Manager, Virtual Machines, and more.
 
image

Simplified domain management

App Service domains (preview) simplifies the life cycle of creating and managing a domain for Azure services leveraging Azure DNS. Azure DNS then provides reliable performant and secure options of hosting your domains. App Service domains is currently limited to the following TLDs, com, net, co.uk, org, nl, in, biz, org.uk, and co.in. To get started with creating a domain, please see How to buy a domain for App Service.

Here are some benefits to using App Service domains:

  • Subdomain management and assignment

  • Auto-renew capabilities

  • Free cancellation within the first five days

  • Better security, performance, and reliability using Azure DNS

  • 'Privacy Protection' included for free except for TLDs who's registry does not support privacy such as .co.in, .co.uk, etc.

Checkout the following resources to manage your domain:

Submit your ideas/feedback in UserVoice. Please add [Domain] at the beginning of the title.

 
Read the whole story
sbanwart
12 hours ago
reply
Akron, OH
alvinashcraft
14 hours ago
reply
West Grove, PA
Share this story
Delete

Model comparison and merging for Azure Analysis Services

1 Share

Relational-database schema comparison and merging is a well-established market. Leading products include SSDT Schema Compare and Redgate SQL Compare, which is partially integrated into Visual Studio. These tools are used by organizations seeking to adopt a DevOps culture to automate build-and-deployment processes and increase the reliability and repeatability of mission critical systems.

Comparison and merging of BI models also introduces opportunities to bridge the gap between self-service and IT-owned “corporate BI”. This helps organizations seeking to adopt a “bi-modal BI” strategy to mitigate the risk of competing IT-owned and business-owned models offering redundant solutions with conflicting definitions.

Such functionality is available for Analysis Services tabular models. Please see the Model Comparison and Merging for Analysis Services whitepaper for detailed usage scenarios, instructions and workflows.

This is made possible using PBIX import in the Azure Analysis Services web designer (see this post for more information) and BISM Normalizer, which we are pleased to announce now resides on the Analysis Services Git repo. BISM Normalizer is a popular open-source tool that works with Azure Analysis Services and SQL Server Analysis Services. All tabular model objects and compatibility levels, including the new 1400 compatibility level, are supported. As a Visual Studio extension, it is tightly integrated with source control systems, build and deployment processes, and model management workflows.

Azure AS Schema Diff

Thanks to Javier Guillen (Blue Granite), Chris Webb (Crossjoin Consulting), Marco Russo (SQLBI), Chris Woolderink (Tabular) and Bill Anton (Opifex Solutions) for their contributions to the whitepaper.

Read the whole story
sbanwart
13 hours ago
reply
Akron, OH
Share this story
Delete

Velocity Widget available for Analytics Extensions

1 Share
The Velocity Widget is now available for those who’ve installed the Analytics Extension. The Velocity Widget provides functionality not available in the Velocity Chart displayed on the Backlog view, such as: Show velocity for any team, not just the current team Show velocity for any backlog level or work item type, not just the Stories... Read More
Read the whole story
sbanwart
13 hours ago
reply
Akron, OH
Share this story
Delete

Lessons Learned From Benchmarking Fast Machine Learning Algorithms

2 Shares

This post is authored by Miguel Fierro, Data Scientist, Mathew Salvaris, Data Scientist, Guolin Ke, Associate Researcher, and Tao Wu, Principal Data Science Manager, all at Microsoft.

Boosted decision trees are responsible for more than half of the winning solutions in machine learning challenges hosted at Kaggle, according to KDNuggets. In addition to superior performance, these algorithms have practical appeal as they require minimal tuning. In this post, we evaluate two popular tree boosting software packages: XGBoost and LightGBM, including their GPU implementations. Our results, based on tests on six datasets, are summarized as follows:

  1. XGBoost and LightGBM achieve similar accuracy metrics.
  2. LightGBM has lower training time than XGBoost and its histogram-based variant, XGBoost hist, for all test datasets, on both CPU and GPU implementations. The training time difference between the two libraries depends on the dataset, and can be as big as 25 times.
  3. XGBoost GPU implementation does not scale well to large datasets and ran out of memory in half of the tests.
  4. XGBoost hist may be significantly slower than the original XGBoost when feature dimensionality is high.

All our code is open-source and can be found in this repo. We will explain the algorithms behind these libraries and evaluate them across different datasets. Do you like your machine learning to be quick? Then keep reading.

The Basics of Boosted Decision Trees

Gradient boosting is a machine learning technique that produces a prediction model in the form of an ensemble of weak classifiers, optimizing a differentiable loss function. One of the most popular types of gradient boosting is boosted decision trees, that internally is made up of an ensemble of weak decision trees. There are two different strategies to compute the trees: level-wise and leaf-wise. The level-wise strategy grows the tree level by level. In this strategy, each node splits the data prioritizing the nodes closer to the tree root. The leaf-wise strategy grows the tree by splitting the data at the nodes with the highest loss change. Level-wise growth is usually better for smaller datasets whereas leaf-wise tends to overfit. Leaf-wise growth tends to excel in larger datasets where it is considerably faster than level-wise growth.

DecisionTrees_3

Level-wise growth strategy vs leaf-wise growth strategy. The level-wise strategy adds complexity extending the depth of the tree level by level. As a contrary, the leaf-wise strategy generates branches by optimizing a loss.

A key challenge in training boosted decision trees is the computational cost of finding the best split for each leaf. Conventional techniques find the exact split for each leaf, and require scanning through all the data in each iteration. A different approach approximates the split by building histograms of the features. That way, the algorithm doesn’t need to evaluate every single value of the features to compute the split, but only the bins of the histogram, which are bounded. This approach turns out to be much more efficient for large datasets, without adversely affecting accuracy.

XGBoost started in 2014, and it has become popular due to its use in many winning Kaggle competition entries. Originally XGBoost was based on a level-wise growth algorithm, but recently has added an option for leaf-wise growth that implements split approximation using histograms. We refer to this version as XGBoost hist. LightGBM is a more recent arrival, started in March 2016 and open-sourced in August 2016. It is based on a leaf-wise algorithm and histogram approximation, and has attracted a lot of attention due to its speed (Disclaimer: Guolin Ke, a co-author of this blog post, is a key contributor to LightGBM). Apart from multithreaded CPU implementations, GPU acceleration is now available on both XGBoost and LightGBM too.

Evaluating XGBoost and LightGBM

We performed machine learning experiments across six different datasets. These experiments are in the python notebooks in our github repo. We also showed the specific compilation versions of XGBoost and LightGBM that we used and provided the steps to install them and set up the experiments. We tried classification and regression problems with both CPU and GPU.  All experiments were run on an Azure NV24 VM with 24 cores, 224 GB of memory and NVIDIA M60 GPUs. The operating system was Ubuntu 16.04. In all experiments, we found XGBoost and LightGBM had similar accuracy metrics (F1-scores are shown here), so we focused on training times in this blog post. The table below shows training times and the training time ratios between the two libraries in both CPU and GPU implementations.

DecisionTrees_2

Benchmark of XGBoost vs LightGBM training times and training time ratios between the libraries in CPU and GPU. In the situations where XGBoost GPU training time does not appear (-), we got an out of memory error. In the Airline dataset with XGBoost in CPU (-*), we stopped the computation after 5 hours. The best training time for each dataset is in boldface text.

Learnings from the Benchmark

As it is usually said, no benchmark is true but some of them are useful. From our experiments, we found the leaf-wise implementation faster than the level-wise one in general. However, the CPU results for BCI and Planet Kaggle datasets, as well as the GPU result for BCI, show that XGBoost hist takes considerably longer than standard XGBoost. This is due to the large size of the datasets, as well as the large number of features, which causes considerable memory overhead for XGBoost hist.

We also found that XGBoost GPU implementation did not scale well, as it gave out of memory errors for 3 larger datasets. In addition, we had to terminate XGBoost training on the Airline dataset after 5 hours.

Finally, between LightGBM and XGBoost, we found that LightGBM is faster for all tests where XGBoost and XGBoost hist finished, with the biggest difference of 25 times for XGBoost and 15 times for XGBoost hist, respectively.

We wanted to investigate the effect of different data sizes and number of rounds in the performance of CPU vs GPU. In the next table, we show the results of using subsamples of the Airline dataset. When comparing the CPU and GPU training times, we see that the GPU version of LightGBM outperforms the CPU one when the dataset is large and for a high number of rounds. As expected, with small datasets, the additional IO overhead of copying the data between RAM and GPU memory overshadows the speed benefits of running the computation on GPU. Here, we did not observe any performance gains in using  XGBoost hist on GPU.  As a side note, the standard implementation of XGBoost (exact split instead of histogram based) does not benefit from GPU either, as compared to multi-core CPU, per this recent paper.

DecisionTrees_3

Benchmark of XGBoost, XGBoost hist and LightGBM training time and AUC for different data sizes and rounds. Same as before, XGBoost in GPU for 100 million rows is not shown due to an out of memory (-). In XGBoost for 100 million rows and 500 rounds we stopped the computation after 5 hours (-*). The best training time and the highest AUC for each sample size are in boldface text.

Overall, we find that LightGBM is faster than XGBoost, in both CPU and GPU implementations. Furthermore, if XGBoost is used, we would recommend keeping a close eye on feature dimensionality and memory consumption. The significant speed advantage of LightGBM translates into the ability to do more iterations and/or quicker hyperparameter search, which can be very useful if you have a limited time budget for optimizing your model or want to experiment with different feature engineering ideas.

Happy coding!

Miguel, Mathew, Guolin & Tao.

Acknowledgment: We would like to thank Steve Deng, David Smith and Huseyin Yildiz from Microsoft for their assistance and feedback on this post.

Read the whole story
alvinashcraft
11 hours ago
reply
West Grove, PA
sbanwart
13 hours ago
reply
Akron, OH
Share this story
Delete

Cloud Shell’s code editor goes GA

1 Share


Last October we added an experimental web-based code editor to Cloud Shell that makes it easier to edit source code within the browser. Today we're happy to announce that this feature is in beta, and we've made additional improvements that will make writing code in Cloud Shell even more delightful.

The editor now lives side-by-side with the Cloud Shell window, so you don’t have to switch between tabs when going from editing to building and testing. You can launch the editor by clicking on the icon in Cloud Shell, or going directly to this URL.
Whether you're working on an existing software project, learning or exploring a new API or open-source library, Cloud Shell makes it extremely easy to start writing code, all from within your browser.

The editor is based on the open-source Eclipse Orion project, which comes with several convenience features. Here are just a few:
  • Key bindings to navigate, format or edit code. To see the available key bindings go to Options > Keyboard Shortcuts or type Alt+Shift+? (or Ctrl+Shift+? on Mac OS X).
  • Syntax highlighting for JavaScript, HTML, CSS and Java; basic content assist for JavaScript, CSS and HTML files. Type Ctrl+Space to open content assist at the current cursor position in the editor.

  • Find and replace

  • Font and UI customization

The Cloud Shell code editor in action


In the previous blogpost we showed you how to deploy and debug an App Engine application using Cloud Shell. Here's a quick tutorial on how to test and modify a NodeJS app written with the Express.js framework.

  1. Open Cloud Shell by clicking on the Shell icon on the top right section of the toolbar.
  2. Get sample code. Clone the repository that contains the Google Cloud Platform (GCP) Node samples by typing the following command in the prompt, and navigate to the directory for the Hello World code:
  3. git clone 
    https://github.com/GoogleCloudPlatform/nodejs-docs-samples
    
    cd nodejs-docs-samples/appengine/hello-world
  4. Install dependencies and start the app.
  5. npm install
    
    npm start
  6. Preview the app. Click on the web preview icon on the top right of the screen, and click to open port 8080:
  7. Modify the app to show the current time.
    • Open the code editor from the Cloud Shell toolbar.
    • In the file tree to the left of the editor navigate to the directory ~/nodejs-docs-samples/appengine/hello-world and, click on app.js
    • Starting at line 23, replace the contents of the app.get function to the snippet below (changes are indicated in bold). As you start to type date.toTimeString(), you’ll see the autocomplete functionality suggest all the functions available under the Date object.
      app.get('/', (req, res) => {
        var date = new Date();
        var time = date.toTimeString();
        res.status(200).send('Hello, world! It is now ' + time).end();
      });
      
    • On the bottom shell panel, type ctrl+c to stop the previously running app, and then restart the app
    • npm start
    • Refresh the tab showing the "Hello World" message to see the new output.
Congratulations! You’ve just successfully created a new NodeJS application  all without ever once leaving your browser. If you’d like to learn more about this example, including how to deploy the app to run App Engine flexible environment, click here. To learn more about Cloud Shell, click here.
Read the whole story
sbanwart
13 hours ago
reply
Akron, OH
Share this story
Delete
Next Page of Stories