Cloud-native developer. Distributed systems wannabe. DevOps and continuous delivery. 10x troublemaker. DevOps Manager at VHT.
7320 stories
·
1 follower

SQLAlchemy 1.2.7 Released

1 Share

SQLAlchemy release 1.2.7 is now available.

Release 1.2.7 includes some dialect-specific fixes as well as a small number of SQL and ORM related fixes.

Changelog for 1.2.7 is at Changelog.

SQLAlchemy 1.2.7 is available on the Download Page.

Read the whole story
sbanwart
1 day ago
reply
Akron, OH
Share this story
Delete

API Management Sign-in Tenant

1 Share

Azure API Management supports multiple identity providers for the Developer Portal. One of these is Azure Active Directory. A common complaint, however, was that when enabling AAD authentication on the developer portal, the sign-in experience would use the default look-and-feel of AAD rather than your organization’s customized sign-in pages.

The reason for this is that unlike many other products and services, API Management always works as a multi-tenant application allowing users from multiple AAD tenants (the ones you configure). Because of this it always uses the common AAD sign-in URL https://login.microsoftonline.com/common rather than the tenant-specific sign-in URL https://login.microsoftonline.com/{tenant_name_or_id}. You can read more about the common endpoint in the AAD documentation.

This changed a few weeks ago on the API Management side. Looking at the release notes, we find this little note:

When configuring Azure AD as an identity provider, it is now possible to designate one of the allowed tenants as a sign-in tenant. All developer portal users will be redirected to that tenant when logging in (instead of the “common” tenant)

The new configuration property on the AAD identity provider is called signinTenant, and can be configured in the Azure Portal experience when adding (or editing) an AAD identity provider:

AAD identity provider configuration

Note: When configuring a new provider, you need to both add the tenant ID to the allowed tenant list, and provide it in the sign-in tenant field if you want it to work.

There is also a nice side-effect from configuring the signinTenant property: Before this, using guest accounts in AAD directories, such as Microsoft Accounts (MSA) or guest B2B accounts to sign in to the API Management Developer Portal was not supported. The reason for this is that when using the common endpoint, AAD has no way to know the target directory to sign the user into.

If you set the signinTenat property, however, this now works for both MSA and B2B guest accounts on the that specific tenant. This enables a lot of useful scenarios for API Management users.

Read the whole story
sbanwart
1 day ago
reply
Akron, OH
Share this story
Delete

Making an AWS static website secure

2 Shares

So there I was, patting myself on the back for making an Azure static website secure (with all the right headers, natch), when I gave myself a quick nod: yep, let’s do the same for this other static website, one that’s hosted on Amazon S3. Morceau de gâteau!

Please, please, please, can I go back in time to stop myself? What a lengthy ordeal, a flippin’ slog. Sisyphus had it easy.

Let’s enumerate what you should do, in the right order (rather, than what I did, which was all messed up).

Get your static website up and running

This is the easy part. In fact, I’d done it ages ago for WhoIsThisJulian.com. As it happens, the complete steps to do this are pretty easy to follow even in the official documentation. You’ve bought your domain. You then set up two buckets in Amazon S3 (one for the unadorned domain name, such as example.com, and one for the www-adorned version, that is www.example.com). You make the first one public, and then you set things up in Route 53 to properly point to these buckets.

Already you’re starting to sink into the Amazon way of referring to these things. S3 stands for Simple Storage Service and is the way you, er, store things in the cloud with Amazon. Route 53 is a web service that, well, routes requests to web apps and sites (among other things). It’s a DNS service in essence. And why “53”, I don’t know (unless it’s because it looks like S3).

Fix your HTML to help make the site secure

The reason? I found doing this to be a right royal pain in the neck after I’d set up the basic HTTPS support. (Hint: it involves caching from somewhere I wasn’t expecting.) So, in no particular order:

  • Remove inline scripts, put them in external JS files and add <script> tags to load them. [Mine were Google Analytics bits of inline script. In fact, having gone through this twice now, DO NOT USE INLINE SCRIPTS. Ever. OK?]
  • Make a note of every external-to-your-domain file you load as part of your HTML. Things like externally hosted script & CSS files, images, and so on. You are going to have to eventually whitelist these other domains in your response headers (the Content Security Policy). If you can, host them yourself in your S3 bucket and update the references in the HTML. [For WhoIsThisJulian.com, I was using FontAwesome, so I had to note the URL for the external CSS.]
  • Change any external HTTP references to HTTPS. [Just being thorough here, to improve the eventual rendering speed.]

Test your changes to make sure everything still works the way you want, because now we hit the big time.

Set up a new CloudFront distribution

This is where it gets hairy. CloudFront, to quote Amazon, “is a global content delivery network (CDN) service that securely delivers data, videos, applications, and APIs to your viewers with low latency and high transfer speeds.” The important thing to note here is that we shall be changing how people get to your site: they will instead go through CloudFront to get to it. If you like, it’s inserting itself between your viewers out there and your site hosted on Amazon S3.

For this part I made use of a walkthrough on Medium, recommended by an old mate, Bryan Slatner. The big change for me from what it recommended is that I’d bought my domain from GoDaddy, with whom I have an email plan. I already had my MX records pointing to their servers, and didn’t have to do any of that Amazon SES stuff from the walkthrough. (That would be Amazon’s Simple Email Service.)

So, step 1 then is to create a new web distribution on CloudFront. The Origin Domain Name is simple: it drops down a list of your S3 buckets. Choose the non-www one for your domain. I ignored the rest of the Origin Settings section.

Onto the Default Cache Behavior Settings section. Here I chose Redirect HTTP to HTTPS, and not HTTPS Only, as suggested in the walkthrough. The Object Caching option got me worked up (I tried several different settings), but to be honest you can leave it alone. Leave all the other options as default.

For the Distribution Settings section, enter your domain name (for example, example.com) as an Alternate Domain Name (CNAMEs).

Now for the fun part. The next question is about the SSL Certificate. Choose the Custom SSL Certificate option and then click on the Request or Import a Certificate with ACM button. (ACM? AWS Certificate Manager.) Why? Because you are going to…

Create an SSL Certificate from Amazon

Clicking on that button opens up a new tab/window and takes you to the Request a certificate page. You have to fill in the domain names you want this new certificate to apply to. THIS IS WHERE I WENT WRONG TO BEGIN WITH, so pay attention. Enter your domain name twice (as it were), first as example.com then add *.example.com as the second ‘name’ by clicking on the Add another name to this certificate button. In essence, you will have two edit fields filled in before you press the Next button.

The next screen asks you whether to validate through DNS or email. I chose email. As it turns out, this was where that Medium walkthough led me off the beaten path. ACM will email a whole bunch of addresses that are @example.com (for example, postmaster@example.com, etc), but it will also email the owner of the domain according to WHOIS. Fine by me: I use my GMail address for that purpose. I would just get the email in my usual inbox. (So, you don’t have to set up those email addresses and MX routes if you don’t want to.)

After a very short time, you’ll get the validation email. This email has a link to click to approve the certificate, so click on it. You get sent to a page with a big button saying I Approve. So click on that. Boom, done. you have a free SSL Certificate from Amazon. Select and copy that long ARN value. (ARN: Amazon Resource Name.)

So, now, go back to your open page where you were creating a CloudFront distribution.

Finish your CloudFront distribution

…and paste that ARN into the field under the Custom SSL Certificate option.

Next, and this one I managed to forget, causing lots of Access Denied errors as I tried to work out what wasn’t working, is to set the Default Root Object for the site. For WhoIsThisJulian.com it’s default.html, but on another site I’ve since secured, it was index.html. Yes, I know that a viewer never has to type those default page names into the address bar, but CloudFront needs to know.

Leave the rest as is, and click on the Create Distribution button right at the bottom of the page. You get redirected to your CloudFront Distributions page with your new distribution showing a Status of a rotating pair of arrows and In Progress.

Sit back, go make a coffee, read the news, do the Sudoku puzzle, take the dog out for a walk, play some tennis, just occupy yourself for some looooong period of time. (I’ve read that it takes 8-10 minutes, but that must have been in the days when no one used CloudFront. For me, it’s easily 20-25 minutes, or more. That’s why I got so wackily frustrated with this process: if something didn’t work, I’d change something in the distribution, and bam another half hour of my life was gone.)

Eventually, that rotating arrows indicator will disappear and be replaced with Deployed. Yay! Except…

Check your Route 53 settings

On the line for your newly created distribution, there will be a special CloudFront Domain Name. It’ll be a random looking name of the form d012345abcdef.cloudfront.net. Select and copy it. (You can, if you want to, open up a new tab in your browser and navigate to that URL to see your site in all its secure glory.)

Switch over to the Route 53 dashboard and open up the Hosted Zone for your site. You’ll have several different DNS record sets visible, but there are two important ones that’ll need changing: two A record sets, one for your plain domain name (e.g., example.com) and one for the www version. You’ll see that the ALIASes point to some URL on amazonaws.com. We now have to change that to point them to that special unique CloudFront URL. The unadorned domain name (example.com) must have an A record set point to that new URL, and the www version must have a CNAME record set pointing to the same place. You may find that this has already been done by setting up the CloudFront distribution. (I’m unsure on this point: the first time I did this, it had been done already, the second time, I had to do it myself. Your mileage may vary.)

Once you’ve saved those changes, your site should be reachable and, more importantly, secure.

So we’re done?

No. Remember last time we had to change the security headers returned from your site? Well, we have to do something similar here on Amazon. But that will have to wait until Part 2.

Locks on Bridge - banner

Read the whole story
sbanwart
2 days ago
reply
Akron, OH
alvinashcraft
2 days ago
reply
West Grove, PA
Share this story
Delete

Kubernetes best practices: How and why to build small container images

3 Shares


Editor’s note: Today marks the first installment in a seven-part video and blog series from Google Developer Advocate Sandeep Dinesh on how to get the most out of your Kubernetes environment. Today he tackles the theory and practicalities of keeping your container images as small as possible.

Docker makes building containers a breeze. Just put a standard Dockerfile into your folder, run the docker ‘build’ command, and shazam! Your container image is built!

The downside of this simplicity is that it’s easy to build huge containers full of things you don’t need—including potential security holes.

In this episode of “Kubernetes Best Practices,” let’s explore how to create production-ready container images using Alpine Linux and the Docker builder pattern, and then run some benchmarks that can determine how these containers perform inside your Kubernetes cluster.

The process for creating containers images is different depending on whether you are using an interpreted language or a compiled language. Let’s dive in!

Containerizing interpreted languages


Interpreted languages, such as Ruby, Python, Node.js, PHP and others send source code through an interpreter that runs the code. This gives you the benefit of skipping the compilation step, but has the downside of requiring you to ship the interpreter along with the code.

Luckily, most of these languages offer pre-built Docker containers that include a lightweight environment that allows you to run much smaller containers.

Let’s take a Node.js application and containerize it. First, let’s use the “node:onbuild” Docker image as the base. The “onbuild” version of a Docker container pre-packages everything you need to run so you don’t need to perform a lot of configuration to get things working. This means the Dockerfile is very simple (only two lines!). But you pay the price in terms of disk size— almost 700MB!

FROM node:onbuild
EXPOSE 8080
By using a smaller base image such as Alpine, you can significantly cut down on the size of your container. Alpine Linux is a small and lightweight Linux distribution that is very popular with Docker users because it’s compatible with a lot of apps, while still keeping containers small.

Luckily, there is an official Alpine image for Node.js (as well as other popular languages) that has everything you need. Unlike the default “node” Docker image, “node:alpine” removes many files and programs, leaving only enough to run your app.

The Alpine Linux-based Dockerfile is a bit more complicated to create as you have to run a few commands that the onbuild image otherwise does for you.

FROM node:alpine
WORKDIR /app
COPY package.json /app/package.json
RUN npm install --production
COPY server.js /app/server.js
EXPOSE 8080
CMD npm start
But, it’s worth it, because the resulting image is much smaller at only 65 MB!

Containerizing compiled languages


Compiled languages such as Go, C, C++, Rust, Haskell and others create binaries that can run without many external dependencies. This means you can build the binary ahead of time and ship it into production without having to ship the tools to create the binary such as the compiler.

With Docker’s support for multi-step builds, you can easily ship just the binary and a minimal amount of scaffolding. Let’s learn how.

Let’s take a Go application and containerize it using this pattern. First, let’s use the “golang:onbuild” Docker image as the base. As before, the Dockerfile is only two lines, but again you pay the price in terms of disk size—over 700MB!

FROM golang:onbuild
EXPOSE 8080
The next step is to use a slimmer base image, in this case the “golang:alpine” image. So far, this is the same process we followed for an interpreted language.

Again, creating the Dockerfile with an Alpine base image is a bit more complicated as you have to run a few commands that the onbuild image did for you.

FROM golang:alpine
WORKDIR /app
ADD . /app
RUN cd /app && go build -o goapp
EXPOSE 8080
ENTRYPOINT ./goapp

But again, the resulting image is much smaller, weighing in at only 256MB!
However, we can make the image even smaller: You don’t need any of the compilers or other build and debug tools that Go comes with, so you can remove them from the final container.

Let’s use a multi-step build to take the binary created by the golang:alpine container and package it by itself.

FROM golang:alpine AS build-env
WORKDIR /app
ADD . /app
RUN cd /app && go build -o goapp

FROM alpine
RUN apk update && \
   apk add ca-certificates && \
   update-ca-certificates && \
   rm -rf /var/cache/apk/*
WORKDIR /app
COPY --from=build-env /app/goapp /app
EXPOSE 8080
ENTRYPOINT ./goapp

Would you look at that! This container is only 12MB in size!
While building this container, you may notice that the Dockerfile does strange things such as manually installing HTTPS certificates into the container. This is because the base Alpine Linux ships with almost nothing pre-installed. So even though you need to manually install any and all dependencies, the end result is super small containers!

Note: If you want to save even more space, you could statically compile your app and use the “scratch” container. Using “scratch” as a base container means you are literally starting from scratch with no base layer at all. However, I recommend using Alpine as your base image rather than “scratch” because the few extra MBs in the Alpine image make it much easier to use standard tools and install dependencies.

Where to build and store your containers


In order to build and store the images, I highly recommend the combination of Google Container Builder and Google Container Registry. Container Builder is very fast and automatically pushes images to Container Registry. Most developers should easily get everything done in the free tier, and Container Registry is the same price as raw Google Cloud Storage (cheap!).

Platforms like Google Kubernetes Engine can securely pull images from Google Container Registry without any additional configuration, making things easy for you!

In addition, Container Registry gives you vulnerability scanning tools and IAM support out of the box. These tools can make it easier for you to secure and lock down your containers.

Evaluating performance of smaller containers


People claim that small containers’ big advantage is reduced time—both time-to-build and time-to-pull. Let’s test this, using containers created with onbuild, and ones created with Alpine in a multistage process!

TL;DR: No significant difference for powerful computers or Container Builder, but significant difference for smaller computers and shared systems (like many CI/CD systems). Small Images are always better in terms of absolute performance.

Building images on a large machine


For the first test, I am going to build using a pretty beefy laptop. I’m using our office WiFi, so the download speeds are pretty fast!


For each build, I remove all Docker images in my cache.

Build:
Go Onbuild: 35 Seconds
Go Multistage: 23 Seconds
The build takes about 10 seconds longer for the larger container. While this penalty is only paid on the initial build, your Continuous Integration system could pay this price with every build.

The next test is to push the containers to a remote registry. For this test, I used Container Registry to store the images.

Push:
Go Onbuild: 15 Seconds
Go Multistage: 14 Seconds
Well this was interesting! Why does it take the same amount of time to push a 12MB object and a 700MB object? Turns out that Container Registry uses a lot of tricks under the covers, including a global cache for many popular base images.

Finally, I want to test how long it takes to pull the image from the registry to my local machine.

Pull:
Go Onbuild: 26 Seconds
Go Multistage: 6 Seconds
At 20 seconds, this is the biggest difference between using the two different container images. You can start to see the advantage of using a smaller image, especially if you pull images often.

You can also build the containers in the cloud using Container Builder, which has the added benefit of automatically storing them in Container Registry.

Build + Push:
Go Onbuild: 25 Seconds
Go Multistage: 20 Seconds
So again, there is a small advantage to using the smaller image, but not as dramatic as I would have expected.

Building images on small machines


So is there an advantage for using smaller containers? If you have a powerful laptop with a fast internet connection and/or Container Builder, not really. However, the story changes if you’re using less powerful machines. To simulate this, I used a modest Google Compute Engine f1-micro VM to build, push and pull these images, and the results are staggering!

Pull:
Go Onbuild: 52 seconds
Go Multistage: 6 seconds
Build:
Go Onbuild: 54 seconds
Go Multistage: 28 seconds
Push:
Go Onbuild: 48 Seconds
Go Multistage: 16 seconds
In this case, using smaller containers really helps!

Pulling on Kubernetes


While you might not care about the time it takes to build and push the container, you should really care about the time it takes to pull the container. When it comes to Kubernetes, this is probably the most important metric for your production cluster.

For example, let’s say you have a three-node cluster, and one of the node crashes. If you are using a managed system like Kubernetes Engine, the system automatically spins up a new node to take its place.

However, this new node will be completely fresh, and will have to pull all your containers before it can start working. The longer it takes to pull the containers, the longer your cluster isn’t performing as well as it should!

This can occur when you increase your cluster size (for example, using Kubernetes Engine Autoscaling), or upgrade your nodes to a new version of Kubernetes (stay tuned for a future episode on this).

We can see that the pull performance of multiple containers from multiple deployments can really add up here, and using small containers can potentially shave minutes from your deployment times!

Security and vulnerabilities


Aside from performance, there are significant security benefits from using smaller containers. Small containers usually have a smaller attack surface as compared to containers that use large base images.

I built the Go “onbuild” and “multistage” containers a few months ago, so they probably contain some vulnerabilities that have since been discovered. Using Container Registry’s built-in Vulnerability Scanning, it’s easy to scan your containers for known vulnerabilities. Let’s see what we find.

Wow, that’s a big difference between the two! Only three “medium” vulnerabilities in the smaller container, compared with 16 critical and over 300 other vulnerabilities in the larger container.

Let’s drill down and see which issues the larger container has.

You can see that most of the issues have nothing to do with our app, but rather programs that we are not even using! Because the multistage image is using a much smaller base image, there are just fewer things that can be compromised.

Conclusion

The performance and security advantages of using small containers speak for themselves. Using a small base image and the “builder pattern” can make it easier to build small images, and there are many other techniques for individual stacks and programming languages to minimize container size as well. Whatever you do, you can be sure that your efforts to keep your containers small are well worth it!

Check in next week when we’ll talk about using Kubernetes namespaces to isolate clusters from one another. And don’t forget to subscribe to our YouTube channel and Twitter for the latest updates.

If you haven’t tried GCP and our various container services before, you can quickly get started with our $300 free credits.

Read the whole story
alvinashcraft
2 days ago
reply
West Grove, PA
sbanwart
2 days ago
reply
Akron, OH
Share this story
Delete

Top stories from the VSTS community – 2018.04.20

2 Shares
Here are top stories we found in our streams this week related to DevOps, VSTS, TFS and other interesting topics, listed in no specific order: TOP STORIES VSTS Gems – Marketplace, the one stop location for added functionalityRui Melo highlights the place to find widgets and extensions to enhance your VSTS experience. Opps, I made... Read More
Read the whole story
alvinashcraft
2 days ago
reply
West Grove, PA
sbanwart
2 days ago
reply
Akron, OH
Share this story
Delete

Titus, the Netflix container management platform, is now open source

3 Shares

by Amit Joshi, Andrew Leung, Corin Dwyer, Fabio Kung, Sargun Dhillon, Tomasz Bak, Andrew Spyker, Tim Bozarth

Today, we are open-sourcing Titus, our container management platform.

Titus powers critical aspects of the Netflix business, from video streaming, recommendations and machine learning, big data, content encoding, studio technology, internal engineering tools, and other Netflix workloads. Titus offers a convenient model for managing compute resources, allows developers to maintain just their application artifacts, and provides a consistent developer experience from a developer’s laptop to production by leveraging Netflix container-focused engineering tools.

Over the last three years, Titus evolved initially from supporting batch use cases, to running services applications (both internal, and ultimately critical customer-facing). Through that evolution, container use at Netflix has grown from thousands of containers launched per week to as many as three million containers launched per week in April 2018. Titus hosts thousands of applications globally over seven regionally isolated stacks across tens of thousands of EC2 virtual machines. The open-sourcing of Titus shares the resulting technology assembled through three years of production learnings in container management and execution.

Why are we open sourcing?

Over the past few years of talking about Titus, we’ve been asked over and over again, “When will you open source Titus?” It was clear that we were discussing ideas, problems, and solutions that resonated with those at a variety of companies, both large and small. We hope that by sharing Titus we are able to help accelerate like-minded teams, and to bring the lessons we’ve learned forward in the container management community.

Multiple container management platforms (Kubernetes, Mesosphere DC/OS, and Amazon ECS) have been adopted across the industry during the last two years, driving different benefits to a wide class of use cases. Additionally, a handful of web-scale companies have developed solutions on top of Apache Mesos to meet the unique needs of their organizations. Titus shares a foundation of Apache Mesos and was optimized to solve for Netflix’s production needs.

Our experience talking with peers across the industry indicates that other organizations are also looking for some of the same technologies in a container management platform. By sharing the code as open source, we hope others can help the overall container community absorb those technologies. We would also be happy for the concepts and features in Titus to land in other container management solutions. This has an added benefit for Netflix in the longer term, as it will provide us better off-the-shelf solutions in the future.

And finally, a part of why we are open-sourcing is our desire to give back and share with the community outside Netflix. We hope open sourcing will lead to active engagements with other companies who are working on similar engineering challenges. Our team members also enjoy being able to present their work externally and future team members can learn what they have an opportunity to work on.

How is Titus different from other container platforms?

To ensure we are investing wisely at Netflix, we stay well aware of off-the-shelf infrastructure technologies. In addition to the aforementioned container orchestration front, we also stay deeply connected with the direction and challenges of the underlying container runtime technologies such as Docker (Moby, container-d) and CRI-O. We regularly meet with engineering teams at the companies both building these solutions as well as the teams using them in their production infrastructures. By balancing the knowledge of what is available through existing solutions with our needs, we believe Titus is the best solution for container management at Netflix.

A few of those key reasons are highlighted below:

The first is a tight integration between Titus and both Amazon and Netflix infrastructure. Given that Netflix infrastructure leverages AWS so broadly, we decided to seamlessly integrate, and take advantage of as much functionality AWS had to offer. Titus has advanced ENI and security group management support spanning not only our networking fabric but also our scheduling logic. This allows us to handle ENIs and IPs as resources and ensure safe large scale deployments that consider EC2 VPC API call rate limits. Our IAM role support, which allows secure EC2 applications to run unchanged, is delivered through our Amazon EC2 metadata proxy. This proxy also allows Titus to give a container specific metadata view, which enables various application aspects such as service discovery. We have leveraged AWS Auto Scaling to provide container cluster auto scaling with the same policies that would be used for virtual machines. We also worked with AWS on the design of IP target groups for Application Load Balancers, which brings support for full IP stack containers and AWS load balancing. All these features together enable containerized applications to transparently integrate with internal applications and Amazon services.

In order to incrementally enable applications to transition to containers while keeping as many systems familiar as possible, we decided to leverage existing Netflix cloud platform technologies, making them container aware. We choose this path to ensure a common developer and operational approach between VMs and containers. This is evident through our Spinnaker enablement, support in our service discovery (Eureka), changes in our telemetry system (Atlas), and performance insight technologies.

Next is scale, which has many dimensions. First, we run over a thousand different applications, with some being very compute heavy (media encoding), some being critical Netflix customer facing services, some memory and GPU heavy (algorithm training), some being network bound (stream processing), some that are happy with resource over commitment (big data jobs) and some that are not. We launch up to a half million containers and 200,000 clusters per day. We also rotate hundreds of thousands of EC2 virtual machines per month to satisfy our elastic workloads. While there are solutions that help solve some of these problems, we do not believe there are off-the-shelf solutions that can take on each of these scale challenges.

Finally, Titus allows us to quickly and nimbly add features that are valuable as our needs evolve, and as we grow to support new use-cases. We always try to maintain a philosophy of “just enough” vs “just in case” with the goal of keeping things as simple and maintainable as possible. Below are a few examples of functionality we’ve been able to quickly develop in response to evolving business and user needs:

In the scheduling layer, we support advanced concepts such as capacity management, agent management, and dynamic scheduling profiles. Capacity management ensures all critical applications have the capacity they require. Agent management provides multiple functions required to support a fleet of thousands of hosts. Agent management is inclusive of host registration and lifecycle, automatic handling of failing hosts, and autoscaling hosts for efficiency. We have dynamic scheduling profiles that understand the differences needed in scheduling between application types (customer facing services vs. internal services vs. batch) and differences in scheduling needed during periods of normal or degraded health. These scheduling profiles help us optimize scheduling considering real world trade-offs between reliability, efficiency and job launch time latencies.

In container execution, we have a unique approach to container composition, Amazon VPC networking support, isolated support for log management, a unique approach to vacating decommissioned nodes, and an advanced operational health check subsystem. For container composition, we inject our system services into containers before running the user’s workload in the container. We classify container networking traffic using BPF and perform QoS using HTB/ECN ensuring we provide highly performant, burstable as well as sustained throughput to every container. We isolate log uploading and stdio processing within the container’s cgroup. Leveraging Spinnaker, we are able to offload upgrade node draining operations in an application specific way. We have operationalized the detection and remediation of kernel, container runtime, EC2, and container control plane health issues. For our security needs, we run all containers with user namespaces, and provide transparent direct user access to only the container.

Titus is designed to satisfy Netflix’s complex scalability requirements, deep Amazon and Netflix infrastructure integration, all while giving Netflix the ability to quickly innovate on the exact scheduling and container execution features we require. Hopefully, by iterating our goals in detail you can see how Titus’s approach to container management may apply to your use cases.

Preparing for open sourcing

In the fourth quarter of 2017, we opened up Titus’s source code to a set of companies that had similar technical challenges as Netflix in the container management space. Some of these companies were looking for a modern container batch and service scheduler on Mesos. Others were looking for a container management platform that was tightly integrated with Amazon AWS. And others still were looking for a container management platform that works well with NetflixOSS technologies such as Spinnaker and Eureka.

By working with these companies to get Titus working in their AWS accounts, we learned how we could better prepare Titus for being fully open sourced. Those experiences taught us how to disconnect Titus from internal Netflix systems, the level of documentation needed to get people started with Titus, and what hidden assumptions we relied on in our EC2 configuration.

Through these partnerships, we received feedback that Titus really shined due to our Amazon AWS integration and the production focused operational aspects of the platform. We also heard how operating a complex container management platform (like Titus) is going to be challenging for many.

With all these learnings in mind, we strived to create the best documentation possible to getting Titus up and running. We’ve captured that information on the Titus documentation site.

Wrapping Up

Open sourcing Titus marks a major milestone after over three years of development, operational battle hardening, customer focus, and sharing/collaboration with our peers. We hope that this effort can help others with the challenges they are facing, and bring new options to container management across the whole OSS community.

In the near future we will keep feature development in Titus well aligned with Netflix’s product direction. We plan to share our roadmap in case others are interested in seeing our plans and contributing. We’d love to hear your feedback. We will be discussing Titus at our NetflixOSS meetup this evening and will post the video later in the week.

Appendix

Conference talks: Dockercon 2015, QCon NYC 2016, re:Invent 2016, QCon NYC 2017, re:Invent 2017, and Container World 2018
Articles: Netflix techblog posts (1, 2), and ACM Queue

NetflixOSS Titus

Titus, the Netflix container management platform, is now open source was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read the whole story
sbanwart
2 days ago
reply
Akron, OH
alvinashcraft
2 days ago
reply
West Grove, PA
Share this story
Delete
Next Page of Stories