Blog | Scottie Enriquez

Becoming More Human in an Increasingly AI World

May 9, 2025 · 12 min read

Human in Los Angeles, CA

I'm tired. Existentially.

As a senior engineer with nearly 13 years of professional experience, I've reinvented myself countless times. I've seen more technologies than I can count or remember come and go. I've seen bubbles form and burst. I remember when the latest flop was supposed to be the next big thing. I've spent countless hours learning new skills and honing my craft in a knowingly futile effort to stay up-to-date with technology. I've seen layoffs and downturns. Trust me, I understand that the macroeconomics of the tech industry has peaks and valleys. After all, I wrote my first Hello, world! in high school shortly after the .com crash. However, something feels different this time. I'm starting to feel deeply disillusioned about the future of AI.

Goodhart's Law and Vestigial Structures

I will refer to the concept of Goodhart's Law a few times throughout this post, so I'll quickly define my interpretation. It states that:

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

Or, in layperson's terms (which I strongly prefer):

When a measure becomes a target, it ceases to be a good measure.

I think about this adage almost every single day. It's no secret that Amazon is one of the most data-driven companies on the planet. During my tenure at AWS, I wrote numerous docs overflowing with quantitative evidence to support my claims and proposals. As a customer-facing solutions architect, data was integral to guiding my customers on the right path and measuring the efficacy of my work. That said, Amazonians are also expected to balance the data with anecdotes. Like Jeff Bezos famously said back in 2018:

The thing I have noticed is that when the anecdotes and the data disagree, the anecdotes are usually right. There is something wrong with the way that you are measuring it.

I remember my professors at university telling stories of how programmers back in the day used to have their productivity measured by various arbitrary metrics like program execution duration, lines of code, etc. When these metrics become targets, other arguably more meaningful metrics like code maintainability quickly suffer.

Exhibit A:

terribleAddFunction.ts
const add = async (first: number, second: number): Promise<number> => {
    const sleep = (millisecondsToWait: number): Promise<void> => {
        return new Promise(resolve => setTimeout(resolve, millisecondsToWait));
    };
    // waste five minutes of runtime
    await sleep(300000);
    return first + second;
}

Exhibit B:

terribleIfFunction.ts
// replace
// condition ? truePathReturn : falsePathReturn
// with extra lines
// add comments for even more lines
const ifWithExtraLines = (condition: Boolean, truePathReturn: object, falsePathReturn: object): object => {
    // check condition
    if (condition) {
        // return true path object if condition is true
        return truePathReturn;
    }
    else {
        // return false path object if condition is false
        return falsePathReturn;
    }
}

You get the idea. These code snippets are terrible, but if developers' targets are runtime or the number of lines, this is the type of software that is incentivized. Individuals will always index on their targets.

As the tech industry has narrowed its focus and investments on AI lately, I can't help but feel less human. I'm constantly bombarded on social media with the sensationalized idea that developers will soon be replaced. Programmers will become vestigial structures, casualties of the rise of AI agents, and ironically, the very source of data that will lead to our own demise. While I am skeptical about just how close humans are to AGI (artificial general intelligence), there's no doubt that we are approaching a critical juncture in technological history. I'm also not convinced that we're considering the right metrics.

I'm not going to masquerade as some genius economist, but let's consider a stock price as an example of a target for a moment. Everyone can agree that this is a meaningful target, but I argue that this is a short-term target. Why are some executives paid primarily via stock? So their focus is to raise the stock price, of course. Typically, vest periods of a few years ensure no short-term stock manipulation. However, they have a clear target and timeline. Other leadership now indexes on this as well. Many other metrics, such as the company's long-term performance, employee satisfaction, etc., are often less of a focus because they impact their target far less. Goodhart's Law in action. Now, most employees are constantly laser-focused on short-term growth in perpetuity. Not all companies are like this, but this example illustrates the point. However, I fear this myopic mentality honing in on short-term profitability and adoption, instead of ethics and considerations of socioeconomic impact, is prevalent in AI.

Developer Productivity Is Not for You

With the advent of large language models (LLMs) and AI agents, there are typically two value propositions: hire fewer developers and/or materially augment their performance. While we can debate just how much the productivity gains are, tools like these undoubtedly help developers like me ship features faster than ever. But have you ever asked yourself who benefits from these productivity gains? In theory, these productivity gains mean more features. More features correlate to more revenue. More revenue sways the market. The market support bolsters the stock price. So simply put, your employer and its shareholders benefit from your increased productivity as a developer. Developer productivity is not for you.

While I use these tools daily for tasks like explaining code I did not write, I have no idea how this will impact my skills and cognitive sharpness in the long term. This is a nuanced question about human psychology. Having an LLM available to generate code for me can make me far less motivated to do my own thinking and research, especially during project crunch times when I am under massive pressure to deliver. These tools certainly make me more productive, but may not make me a better programmer. That is the key distinction because one benefits my employer, and the other benefits me as a skilled worker.

We Gave Away the Knowledge for Free (Under MIT)

I also won't parade myself around as an open-source purist. The lion's share of the software I've authored resides under the lock and key of my megacorporate employers' private repositories. That said, I never imagined that all the code I (and much more so the actual open-source contributors) published freely on the internet would benefit a select few companies later, back when I signed up for my GitHub account over a decade ago. While access to these models is democratized for a fee, creating these massive models is limited to the select few corporate titans with the means of production (mountains of GPUs, barring revolutionary advancements in the vein of DeepSeek).

While these technological advancements are truly remarkable, I still struggle with the ethics of using these large models at times. Without namedropping any company or model, you can find countless lawsuits from creators who have coaxed significant portions of their works from LLMs and diffusion models without the proper licensing. These cases are ample evidence, in my opinion, that the macroeconomic corporate stance is to ask for forgiveness, not permission, for what content is acceptable to train models on.

Ultimately, we gave away the knowledge for free, but now we have to pay for access to its application. When we look at the jobs that are most at risk of being replaced by AI (or rather by its perceived value), they are mostly well-paying jobs in a rapidly vanishing middle class (e.g., developers) or creative fields like graphic design. To quote the brilliant tweet of @QuinnCat13:

We could automate menial jobs so people have time to make art and music, but apparently, we'd rather automate art and music so people have time for menial jobs.

Software like Glaze, which adds minor modifications to images that are imperceptible to the human eye but makes it more difficult for models to train on, gives me hope for checks, balances, and enforcement in AI ethics. However, these powerful models already exist. Remediation is much more challenging than prevention.

SaaS Is Not Going to Save Us

One trendy take I've seen from the indie hacker influencers on Twitter and LinkedIn is that instead of working for a corporation, you should quickly build your own products using LLMs and IDEs like Cursor. This way, you own your intellectual property and generate revenue directly. However, I see several problems.

I do not want to gatekeep how we define a programmer, so I'll use artistry as an example. I have zero experience in Photoshop, Illustrator, etc. I am not a graphic designer or illustrator by any stretch of the imagination. However, I can now throw a sentence or two into a diffusion model and produce imagery. Am I now an artist? I argue not. If any modifications need to be made to the output, I am utterly incapable of handling them unless I can do so via prompt engineering. The rendered art will also be soulless, lack perspective, and solely regurgitate existing works of actual artists. Similar problems exist for purely AI-generated codebases by inexperienced programmers. I won't namedrop and pile on to them, but you can find numerous examples of security breaches and unexpected cost spikes due to issues in AI-generated code.

Another key issue is the concept of differentiated value. This is a common argument for using cloud service providers like AWS. You can run your own data centers, manage your own virtualization, build your own provisioning automation, etc. However, if AWS can do it out-of-the-box with better economies of scale than your company, why waste time on work that does not set your product apart from its competition? A common approach is to limit undifferentiated work. If AI primarily writes your SaaS product, then the differentiated value of your business is effectively a set of prompts. While I predict that LLMs have and will continue to plateau, let's assume I'm wrong. Why would a company (or individual power users) pay a recurring fee for your product if they can get the same results by generating and plugging in the prompts themselves? The barrier to entry for developers is lowered with AI appliances, but corporations also have this exact ability with far more resources behind them. SaaS is not going to save us.

No One Left to Pay

At the height of the Great Depression, unemployment hit approximately 25% in the United States. Most people fail to realize how much the global economy depends on people having money to consume services and goods. Again, let's assume that I'm wrong about the AI plateau. Let's also ignore supply and demand issues for compute resources. Who is left to pay for anything if AGI replaces all these jobs? A global economic crisis would ensue. I don't know whether it's a universal basic income, an oligarchy with access to key resources, or something else that emerges after AGI, but it does not look bright for humanity at that point.

A Focus on Humanity

Believe it or not, this post started as a therapy exercise. I've been struggling lately to define my worth, identity, and goodness as a human being without relying on my career accolades, the perceived prestige of the company I work for, the quality of my education, etc. When I look to others for inspiration online, I see the same qualities: braggadocious reflections about their many followers, indie makers plastering their incomes on their profiles, toxic elitism, etc. We've been conditioned for these targets in this influencer-obsessed society.

Over the years, I've sought personal validation and measured my worth through various metrics that, unfortunately, are deeply rooted in capitalism and vanity. At no other point in human history have we had unfettered access to data about ourselves as we do now. We can measure how seen we are with social media views, how much people approve of our thoughts via likes, how valuable we are with our incomes and contributions to our employers, etc. Perhaps even worse, we can also see these metrics for our peers. We choose these arbitrary metrics as our targets, and the wrong actions are incentivized. We build and optimize for these targets, and humanity takes a backseat. No longer do we have time to gaze up at the stars, draw constellations, tell stories, have downtime, etc., as our ancient ancestors did. We've replaced this with incessant media consumption, working more than ever, and nonstop reminders that all the numbers that matter to us must keep going up.

I don't blog often, but I felt inspired to write about my frustrations with the industry, society, and myself. I still have a genuine love of technology and am grateful that my career revolves around this passion, a privilege I do not take for granted. However, now more than ever, it's time for me to relinquish some of that competitive, capitalistic vanity and reconnect with my humanity. I do not care about constant improvement, revenue growth, external validation, etc., the same way that I used to. I want to spend more time creating for myself, learning for fun, exploring spirituality or connection with the natural world, and investing time in things that make the world a slightly better place. If AI replaces me in the process, then so be it.

I have no power to make changes on a massive scale. I'm a minuscule cog in this intricate and profoundly unfair machine, like most people. I can offer up my authentic self, though. I can share my tiny voice that questions the ethics and benefits of this technology that we know and use. I acknowledge that the anecdote has much more context than the data. I can share my journey to define my self-worth and goodness without the quantitative pressure of comparing myself to others. I can strive to be more human in an increasingly AI-obsessed world.

Disclaimer

When I started this blog post, I worked for Amazon Web Services. The opinions and views expressed here are my own and not those of Amazon. I have since left the company.

Preparing for the Certified Kubernetes Application Developer (CKAD) Exam Using Amazon EKS

August 8, 2024 · 32 min read

Scottie Enriquez

Human in Los Angeles, CA

Motivation and Background

While I've used Kubernetes professionally in a few capacities (particularly in customer engagements while working at AWS), I wanted to cement my knowledge and increase my mastery with a systematic approach. I decided to prepare for the Certified Kubernetes Application Developer (CKAD) exam. I've taken and passed more than a dozen technology certification exams spanning AWS, Azure, HashiCorp, and more. This exam is unique in several ways. Namely, it's all hands-on in a lab environment. Azure exams often have a coding, configuration, or CLI command component, but even these are typically multiple-choice questions. The CKAD presents you with a virtual desktop and several Kubernetes clusters, making you tackle 15-20 tasks with a strict two-hour time limit. I put together this repository and post for a few reasons:

I wanted to document all of my hands-on preparation for when I have to recertify in two years
I wanted to share my knowledge with others and offer a supplemental guide to a CKAD course
Since the CKAD exam focuses on Kubernetes from a cloud-agnostic perspective, I wanted to fill in the gaps in my own knowledge of running Kubernetes in the AWS ecosystem (e.g., Karpenter, Container Insights, etc.)
Many courses and guides leverage Microk8s or minikube to run Kubernetes locally, but I wanted to focus on cloud-based infrastructure, especially for things like EBS volumes created via PVCs, ELBs created via a Service, etc.

In summary, this material focuses on hands-on exercises for preparing for the exam and other tools in the cloud-agnostic and AWS ecosystems.

Preparing for the Exam

While two hours may sound like plenty of time, you'll need to work quickly to complete the exam. With an average of six to eight minutes per exercise (each is not timed individually), ensuring you can work efficiently and ergonomically is paramount. The following items were incredibly useful for me:

Running through a practice exam to get a feel for the CKAD structure
Proficiency with Vim motions (since most of the exam takes place in a terminal) to efficiently edit code
Generating YAML manifests via the command line for new resources instead of copying and pasting from documentation (e.g., kubectl create namespace namespace-one -o yaml --dry-run=client)
Generating YAML manifests for existing resources that do not have one (e.g., kubectl get namespace namespace-one -o yaml > namespace.yaml)
Leveraging the explain command instead of looking up resource properties in the web documentation (e.g., kubectl explain pod.spec)
Memorizing the syntax for running commands in a container (e.g., kubectl exec -it pod-one -- /bin/sh) and for quickly creating a new Pod to run commands from (e.g., kubectl run busybox-shell --image=busybox --rm -it --restart=Never -- sh)
Refreshing knowledge of Docker commands like exporting an image (i.e., docker save image:tag --output image.tar)

Materials and Getting Started

All code shown here resides in this GitHub repository. In addition to this content, I highly recommend the following:

The CKAD courses on Pluralsight for classroom learning
Killer Shell for practice exams
This GitHub repository for many useful CLI commands

My preferred approach was to work through the Pluralsight course first. After reviewing the classroom material, I designed and implemented the examples below. If you have foundational Kubernetes knowledge, skip to the most useful exercises. Each one is designed to be a standalone experience.

00: eksctl Configuration

eksctl is a powerful CLI tool that quickly spins up and tears down Kubernetes clusters via Amazon EKS. Nearly all of the exercises below start by leveraging the tool to create a cluster:

00-eksctl-configuration/create-cluster.sh
# before running these commands, first authenticate with AWS (e.g., aws configure sso)
eksctl create cluster -f cluster.yaml
# if connecting to an existing cluster
eksctl utils write-kubeconfig --cluster=learning-kubernetes

The default cluster configuration uses a two-node cluster of t3.medium instances to keep hourly costs as low as possible. At the time of writing this blog post, the exam tests on Kubernetes version 1.30.

00-eksctl-configuration/culster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: learning-kubernetes
  region: us-west-2 
  version: "1.30"

nodeGroups:
  - name: node-group-1 
    instanceType: t3.medium 
    desiredCapacity: 2 
    minSize: 2
    maxSize: 2

This cluster can be transient for learning purposes. To keep costs low, be sure to run the destroy-cluster.sh script to delete the cluster when not in use. I also recommend configuring an AWS Budget as an extra measure of cost governance.

00-eksctl-configuration/destroy-cluster.sh
eksctl delete cluster --config-file=cluster.yaml --disable-nodegroup-eviction

01: First Deployment with Nginx (CKAD Topic)

With the cluster created, we can now make our first Deployment. We'll start by creating a web server with three replicas using the latest Nginx image:

01-first-deployment-with-nginx/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx-container
        image: nginx:latest
        ports:
        - containerPort: 80

The following commands leverage the manifest to create three Pods and inspect them:

01-first-deployment-with-nginx/commands.sh
# assumes cluster created from 00-eksctl-configuration first
kubectl apply -f ./ 
# returns three pods (e.g., nginx-deployment-5449cb55b-jfgnc)
kubectl get pods -o wide
# clean up
kubectl delete -f ./ 

02: Pod Communication over IP (CKAD Topic)

The Deployment in this example is identical to the previous: a web server with three replicas. Use the following commands to explore how IP addressing works for Pods:

02-pod-communication-over-ip/commands.sh
# assumes cluster created from 00-eksctl-configuration first
kubectl apply -f ./
# 192.168.51.32 is one of my pod's IP address, but yours will be different
# when the pod is replaced, this IP address changes
kubectl get pods -o wide
# creates a pod with the BusyBox image
# entering BusyBox container shell to communicate with pods in the cluster
kubectl run -it --rm --restart=Never busybox --image=busybox sh
# replace the IP address as needed
wget 192.168.51.32
# displys the nginx homepage code
cat index.html
# returning to default shell and deletes the BusyBox pod
exit
# clean up
kubectl delete -f ./ 

03: First Service (CKAD Topic)

Since each Pod has a separate IP address that can change, we can use a Service to keep track of the Pod's IP addresses on our behalf. This abstraction allows us to group Pods via a selector and reference them via a single Service. In the Service manifest and leveraging the same Deployment as before, we specify how to select which Pods to target, what port to expose, and the type of Service:

03-first-service/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
spec:
  selector:
    name: nginx
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
      # nodePort is used for external access
  # ClusterIP services are only accessible within the cluster
  # NodePort services are a way to expose ClusterIP services externally without using a cloud provider's load balancer
  # LoadBalancer is covered in the next section
  type: ClusterIP

Using the Service, we have a single interface to the three nginx replicas. We can also use the Service name instead of its IP address.

03-first-service/commands.sh
# assumes cluster created from 00-eksctl-configuration first
kubectl apply -f ./
# 10.100.120.203 is the service IP address
kubectl describe service nginx-service
# entering BusyBox container shell
kubectl run -it --rm --restart=Never busybox --image=busybox sh
# can also use the IP address instead
wget nginx-service
cat index.html
# returning to default shell
exit
# clean up
kubectl delete -f ./

04: Elastic Load Balancers for Kubernetes Service (CKAD Topic)

A significant benefit of Kubernetes is that it can create and manage resources in AWS on our behalf. Using the AWS Load Balancer Controller, we can specify annotations to create a Service of type LoadBalancer that leverages an Elastic Load Balancer. Using the same Deployment from the past two sections, this manifest illustrates how to leverage a Network Load Balancer for the Service:

04-load-balancer/load-balancer.yaml
apiVersion: v1
kind: Service
metadata:
  name: nginx-load-balancer
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
    # by default, a Classic Load Balancer is created
    # https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/introduction.html
    # this annotation creates a Network Load Balancer
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  selector:
    name: nginx
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: "192.0.2.127" 

The following commands deploy the LoadBalancer Service:

04-load-balancer/commands.sh
# assumes cluster created from 00-eksctl-configuration first
kubectl apply -f ./ 
# entering BusyBox container shell
kubectl run -it --rm --restart=Never busybox --image=busybox sh
wget nginx-load-balancer
cat index.html
# returning to default shell
exit
# clean up
# this command ensures that the load balancer is deleted
# be sure to run before destroying the cluster
kubectl delete -f ./

05: Ingress (CKAD Topic)

Services of type ClusterIP only support internal cluster networking. The NodePort configuration allows for external communication by exposing the same port on every node (i.e., EC2 instances in our case). However, this introduces a different challenge because the consumer must know the nodes' IP addresses (and nodes are often transient). The LoadBalancer configuration has a 1:1 relationship with the Service. If you have numerous Services, the cost of load balancers may not be feasible. Ingress alleviates some of these challenges by providing a single external interface over HTTP or HTTPS with support for path-based routing. Leveraging the Nginx example one last time, we can create an Ingress that exposes a Service with the NodePort configuration via an Application Load Balancer.

05-ingress/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress  
metadata:
  name: ingress
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
spec:
  rules:
    - http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: nginx-service
                port: 
                  number: 80

The following commands install the AWS Load Balancer Controller, configure required IAM permissions, and deploy the Ingress. Be sure to set the $AWS_ACCOUNT_ID environment variable first.

05-ingress/commands.sh
# assumes cluster created from 00-eksctl-configuration first
# install AWS Load Balancer Controller
# https://docs.aws.amazon.com/eks/latest/userguide/lbc-manifest.html
curl -O https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.7.2/docs/install/iam_policy.json
aws iam create-policy \
    --policy-name AWSLoadBalancerControllerIAMPolicy \
    --policy-document file://iam_policy.json
rm iam_policy.json
eksctl utils associate-iam-oidc-provider --region=us-west-2 --cluster=learning-kubernetes --approve
eksctl create iamserviceaccount \
  --cluster=learning-kubernetes \
  --namespace=kube-system \
  --name=aws-load-balancer-controller \
  --role-name AmazonEKSLoadBalancerControllerRole \
  --attach-policy-arn=arn:aws:iam::$AWS_ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy \
  --approve
kubectl apply \
    --validate=false \
    -f https://github.com/jetstack/cert-manager/releases/download/v1.13.5/cert-manager.yaml
curl -Lo v2_7_2_full.yaml https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases/download/v2.7.2/v2_7_2_full.yaml
sed -i.bak -e '596,604d' ./v2_7_2_full.yaml
sed -i.bak -e 's|your-cluster-name|learning-kubernetes|' ./v2_7_2_full.yaml
kubectl apply -f v2_7_2_full.yaml
rm v2_7_2_full.yaml* 
kubectl get deployment -n kube-system aws-load-balancer-controller
# apply maniftests
kubectl apply -f ./ 
# gets address (e.g, http://k8s-default-ingress-08daebdfec-204015293.us-west-2.elb.amazonaws.com/) that can be opened in a web browser
kubectl describe ingress
# clean up
kubectl delete -f ./

06: Jobs and CronJobs (CKAD Topic)

Jobs are a powerful mechanism that reliably ensures that Pods are completed successfully. CronJobs extend this functionality by supporting a recurring schedule.

06-jobs-and-cronjobs/job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4

06-jobs-and-cronjobs/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: hello
spec:
  # runs every minute
  schedule: "* * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox:1.28
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster
          restartPolicy: OnFailure

07: Metrics Server and Pod Autoscaling (CKAD Topic)

Metrics Server provides container-level resource metrics for autoscaling within Kubernetes. It is not installed by default and is meant only for autoscaling purposes. There are other options, such as Container Insights, Prometheus, and Grafana, for more accurate resource usage metrics (all covered later in this post). With Metrics Server installed, a HorizontalPodAutoscaler resource can be configured with values such as target metric, minimum replicas, maximum replicas, etc.

07-metrics-server-and-pod-autoscaling/horizontal-pod-autoscaler.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
status:
  observedGeneration: 1
  currentReplicas: 1
  desiredReplicas: 1
  currentMetrics:
  - type: Resource
    resource:
      name: cpu
      current:
        averageUtilization: 0
        averageValue: 0

HorizontalPodAutoscalers create and destroy Pods based on metric usage. On the other hand, vertical autoscaling rightsizes the resource limits (covered in the next section) for Pods.

08: Resource Management (CKAD Topic)

When creating a Pod, you can optionally specify an estimate for the number of resources a container needs (e.g., CPU and RAM). This baseline estimate should be specified in the requests parameter. The limits parameter specifies the threshold for which a container should be terminated to prevent starvation of other processes. Limits also help with cluster capacity planning (e.g., EKS node groups). Below is the Nginx Deployment from earlier with resource management applied:

08-resource-management/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      name: nginx
  template:
    metadata:
      labels:
        name: nginx
    spec:
      containers:
      - name: nginx-container
        image: nginx:latest
        ports:
        - containerPort: 80
        resources:
          # estimated resources for container to run optimally
          requests:
            cpu: 100m
            memory: 128Mi
          # kills the container if threshold is crossed
          limits:
            cpu: 200m
            memory: 256Mi

09: Karpenter

In the previous two sections, we covered how additional Pods are created (i.e., horizontal scaling) and how resources (e.g., CPU and RAM) are requested and limited in Kubernetes. The next topic is managing the underlying compute when additional infrastructure is required. There are two primary options for scaling compute using EKS on EC2: Cluster Autoscaler and Karpenter. On AWS, Cluster Autoscaler leverages EC2 Auto Scaling Groups (ASGs) to manage node groups. Cluster Autoscaler typically runs as a Deployment in the cluster. Karpenter does not leverage ASGs, allowing for the ability to select from a wide array of instance types that match the exact requirements of the additional containers. Karpenter also allows for easy adoption of Spot for further cost savings on top of better matching the workload to compute resources. The cluster defined in 00-eksctl-configuration uses an unmanaged node group and does not leverage Cluster Autoscaler or Karpenter. To demonstrate how to leverage Karpenter, we'll need a different cluster configuration file. We can dynamically generate it like so:

09-karpenter/commands.sh
# set environment variables
export KARPENTER_NAMESPACE=karpenter
export KARPENTER_VERSION=v0.32.10
export K8S_VERSION=1.28
export AWS_PARTITION="aws"
export CLUSTER_NAME="${USER}-karpenter-demo"
export AWS_DEFAULT_REGION="us-west-2"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export ARM_AMI_ID="$(aws ssm get-parameter --name /aws/service/eks/optimized-ami/${K8S_VERSION}/amazon-linux-2-arm64/recommended/image_id --query Parameter.Value --output text)"
export AMD_AMI_ID="$(aws ssm get-parameter --name /aws/service/eks/optimized-ami/${K8S_VERSION}/amazon-linux-2/recommended/image_id --query Parameter.Value --output text)"
export GPU_AMI_ID="$(aws ssm get-parameter --name /aws/service/eks/optimized-ami/${K8S_VERSION}/amazon-linux-2-gpu/recommended/image_id --query Parameter.Value --output text)"
# deploy resources to support Karpenter
aws cloudformation deploy \
  --stack-name "Karpenter-${CLUSTER_NAME}" \
  --template-file karpenter-support-resources-cfn.yaml \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides "ClusterName=${CLUSTER_NAME}"
# generate cluster file and deploy
<<EOF > cluster.yaml
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: ${CLUSTER_NAME}
  region: ${AWS_DEFAULT_REGION}
  version: "${K8S_VERSION}"
  tags:
    karpenter.sh/discovery: ${CLUSTER_NAME}

iam:
  withOIDC: true
  serviceAccounts:
  - metadata:
      name: karpenter
      namespace: "${KARPENTER_NAMESPACE}"
    roleName: ${CLUSTER_NAME}-karpenter
    attachPolicyARNs:
    - arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:policy/KarpenterControllerPolicy-${CLUSTER_NAME}
    roleOnly: true

iamIdentityMappings:
- arn: "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}"
  username: system:node:{{EC2PrivateDNSName}}
  groups:
  - system:bootstrappers
  - system:nodes

managedNodeGroups:
- instanceType: t3.medium
  amiFamily: AmazonLinux2
  name: ${CLUSTER_NAME}-ng
  desiredCapacity: 2
  minSize: 2
  maxSize: 5
EOF
eksctl create cluster -f cluster.yaml

Next, we install Karpenter on the EKS cluster:

09-karpenter/commands.sh
# set additional environment variables
export CLUSTER_ENDPOINT="$(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.endpoint" --output text)"
export KARPENTER_IAM_ROLE_ARN="arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/${CLUSTER_NAME}-karpenter"
# install Karpenter
helm registry logout public.ecr.aws
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace \
  --set "serviceAccount.annotations.eks\.amazonaws\.com/role-arn=${KARPENTER_IAM_ROLE_ARN}" \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

Finally, we create a node pool that specifies what compute our workload can support. In this case, Karpenter can provision EC2 Spot instances from the c, m, or r families from any generation greater than two running Linux on AMD64 architecture.

09-karpenter/commands.sh
# create NodePool
<<EOF > node-pool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["2"]
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: default
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  role: "KarpenterNodeRole-${CLUSTER_NAME}"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}"
  amiSelectorTerms:
    - id: "${ARM_AMI_ID}"
    - id: "${AMD_AMI_ID}"
EOF
kubectl apply -f node-pool.yaml

With the new EKS cluster deployed and Karpenter installed, we can add new Pods and see new EC2 instances created on our behalf.

09-karpenter/commands.sh
# deploy pods and scale
kubectl apply -f deployment.yaml
kubectl scale deployment inflate --replicas 5

In less than a minute after the inflate command, a new EC2 instance is created that matches the node pool specifications. In my case, a c5n.2xlarge server was deployed.

Karpenter instances

As expected, the node pool leverages Spot instances.

Karpenter instances

You can monitor the Karpenter logs via the command below. Less than a minute after deleting the Deployment, the c5n.2xlarge instance was terminated. Be sure to follow the cleanup steps when done to ensure no resources become orphaned.

09-karpenter/commands.sh
# monitor Karpenter events
kubectl logs -f -n "${KARPENTER_NAMESPACE}" -l app.kubernetes.io/name=karpenter -c controller
# scale down
kubectl delete deployment inflate
# clean up
helm uninstall karpenter --namespace "${KARPENTER_NAMESPACE}"
aws cloudformation delete-stack --stack-name "Karpenter-${CLUSTER_NAME}"
aws ec2 describe-launch-templates --filters Name=tag:karpenter.k8s.aws/cluster,Values=${CLUSTER_NAME} |
    jq -r ".LaunchTemplates[].LaunchTemplateName" |
    xargs -I{} aws ec2 delete-launch-template --launch-template-name {}
eksctl delete cluster --name "${CLUSTER_NAME}"

10: Persistent Volumes Using EBS (CKAD Topic)

Storage in Kubernetes can be classified as either ephemeral or persistent. Without leveraging PersistentVolumes (PVs), containers read and write data to the volume attached to the node they run on. Ephemeral storage is temporary and tied to the Pod's lifecycle. If requirements dictate that the storage persists or be shared across Pods, there are some prerequisites before EBS can be leveraged for PVs.

The first step is installing the AWS EBS Container Storage Interface (CSI) driver. The next step is to define a StorageClass (SC) that includes configuration such as volume type (e.g., gp3), encryption, etc. The final step is to reference a PersistentVolumeClaim (PVC) when deploying a Pod in order to dynamically provision the EBS volume and attach to the containers.

In practice, this goes as follows:

10-persistent-volumes/commands.sh
# assumes cluster created from 00-eksctl-configuration first
# create an OIDC provider
eksctl utils associate-iam-oidc-provider --cluster learning-kubernetes --approve
# install aws-ebs-csi-driver
eksctl create iamserviceaccount \
    --name ebs-csi-controller-sa \
    --namespace kube-system \
    --cluster learning-kubernetes \
    --role-name AmazonEKS_EBS_CSI_DriverRole \
    --role-only \
    --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
    --approve
eksctl create addon --name aws-ebs-csi-driver --cluster learning-kubernetes --service-account-role-arn arn:aws:iam::$AWS_ACCOUNT_ID:role/AmazonEKS_EBS_CSI_DriverRole --force

Once completed, the add-on will appear in the AWS Console.

EBS CSI

Next, define the StorageClass and PersistentVolumeClaim:

10-persistent-volumes/storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  csi.storage.k8s.io/fstype: xfs
  type: gp3
  encrypted: "true"
allowedTopologies:
  - matchLabelExpressions:
      - key: topology.ebs.csi.aws.com/zone
        values:
          - us-west-2a
          - us-west-2b
          - us-west-2c

10-persistent-volumes/persistent-volume-claim.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ebs-claim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ebs-sc
  resources:
    requests:
      storage: 4Gi

Finally, attach the PVC to the Pod and deploy:

10-persistent-volumes/persistent-volume-claim.yaml
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
    - name: app
      image: centos
      command: ["/bin/sh"]
      args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
      volumeMounts:
        - name: persistent-storage
          mountPath: /data
  volumes:
    - name: persistent-storage
      persistentVolumeClaim:
        claimName: ebs-claim

As soon as the Pod is created, a gp3 volume is provisioned.

PVC

11: Prometheus and Grafana

The next several sections focus on observability. Prometheus is an open-source monitoring system commonly leveraged in Kubernetes clusters. As a de facto standard, it's widely used with Grafana to provide cluster monitoring. Using Helm we can quickly deploy both of these tools to our cluster.

11-prometheus-and-grafana/commands.sh
# assumes cluster created from 00-eksctl-configuration first
# install Helm on local machine
# https://helm.sh/docs/intro/install/
brew install helm
# install Helm charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install v60-0-1 prometheus-community/kube-prometheus-stack --version 60.0.1
# use http://localhost:9090 to access Prometheus
kubectl port-forward svc/prometheus-operated 9090
# get Grafana password for admin
kubectl get secret v60-0-1-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
# use http://localhost:3000 to access Grafana
kubectl port-forward svc/v60-0-1-grafana 3000:80

Using port forwarding, we can quickly access Prometheus:

Prometheus

And Grafana:

Grafana

12: Container Insights

Prometheus and Grafana are both open-source and cloud-agnostic. AWS has a native infrastructure monitoring offering called Container Insights that integrates cluster data with the AWS Console via CloudWatch with two simple commands:

12-container-insights/commands.sh
# assumes cluster created from 00-eksctl-configuration first
# configure permissions
# change role to the one created by eksctl
aws iam attach-role-policy \
--role-name $EKSCTL_NODEGROUP_ROLE_NAME \
--policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
# wait until add-on is installed and give time for data to propagate
aws eks create-addon --cluster-name learning-kubernetes --addon-name amazon-cloudwatch-observability

Container Insights

It's worth noting that Container Insights can also ingest Prometheus metrics.

13: EKS Split Cost Allocation Data in Cost and Usage Reports

The AWS Cost and Usage Reports (CUR) are the most comprehensive and detailed billing data available to customers. It offers a well-defined schema that we can use to write SQL queries against via Athena. CUR data offers resource-level time series data for in-depth AWS cost and usage analysis. In April 2024, AWS released EKS split cost allocation data for CUR. Previously, the lowest resource level available was an EC2 instance. This feature adds billing data container-level resources in EKS (e.g., Pods).

Create a new CUR via Data Exports in the Billing and Cost Management Console if required. If you have an existing CUR without split cost allocation data, you can modify the report content configuration to add this.

Data Exports

With this configured, we can use the following SQL query in Athena to gather cost and usage data for the EKS cluster resources:

13-cur-split-cost-allocation/query.sql
SELECT 
  DATE_FORMAT(
    DATE_TRUNC(
      'day', "line_item_usage_start_date"
    ), 
    '%Y-%m-%d'
  ) AS "date",
  "line_item_resource_id" AS "resource_id",
  ARBITRARY(CONCAT(
    REPLACE(
      SPLIT_PART(
        "line_item_resource_id", 
        '/', 1
      ), 
      'pod', 
      'cluster'
    ), 
    '/', 
    SPLIT_PART(
      "line_item_resource_id", 
      '/', 2
    )
  )) AS "cluster_arn", 
  ARBITRARY(SPLIT_PART(
    "line_item_resource_id", 
    '/', 2
  )) AS "cluster_name", 
  ARBITRARY("split_line_item_parent_resource_id") AS "node_instance_id", 
  ARBITRARY("resource_tags_aws_eks_node") AS "node_name", 
  ARBITRARY(SPLIT_PART(
    "line_item_resource_id", 
    '/', 3
  )) AS "namespace",
  ARBITRARY("resource_tags_aws_eks_workload_type") AS "controller_kind", 
  ARBITRARY("resource_tags_aws_eks_workload_name") AS "controller_name", 
  ARBITRARY("resource_tags_aws_eks_deployment") AS "deployment",
  ARBITRARY(SPLIT_PART(
    "line_item_resource_id", 
    '/', 4
  )) AS "pod_name", 
  ARBITRARY(SPLIT_PART(
    "line_item_resource_id", 
    '/', 5
  )) AS "pod_uid", 
  SUM(
    CASE WHEN "line_item_usage_type" LIKE '%EKS-EC2-vCPU-Hours' THEN "split_line_item_split_cost" + "split_line_item_unused_cost" ELSE 0.0 END
  ) AS "cpu_cost", 
  SUM(
    CASE WHEN "line_item_usage_type" LIKE '%EKS-EC2-GB-Hours' THEN "split_line_item_split_cost" + "split_line_item_unused_cost" ELSE 0.0 END
  ) AS "ram_cost", 
  SUM(
    "split_line_item_split_cost" + "split_line_item_unused_cost"
  ) AS "total_cost" 
FROM 
  cur
WHERE 
  "line_item_operation" = 'EKSPod-EC2' 
  AND CURRENT_DATE - INTERVAL '7' DAY <= "line_item_usage_start_date" 
GROUP BY 
  1, 
  2
ORDER BY 
  "cluster_arn", 
  "date" DESC

Athena

AWS also offers open-source QuickSight dashboards that provide a visualization of this data.

14: ConfigMap (CKAD Topic)

The following two sections focus on configuration management. A ConfigMap is a Kubernetes construct that stores non-sensitive key-value pairs (e.g., URLs, feature flags, etc.). There are several ways to consume ConfigMaps, but we'll set an environment variable for a container below. First, I created a TypeScript Cloud Development Kit (CDK) application to deploy a FastAPI container to Elastic Container Repository (ECR). The API is simple:

14-configmap/api-cdk/container/app/main.py
api = fastapi.FastAPI()
@api.get('/api/config')
def config():
    return {
        'message': os.getenv('CONFIG_MESSAGE', 'Message not set')
    }

We publish the container to ECR via CDK:

14-configmap/api-cdk/lib/api-cdk-stack.ts
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { DockerImageAsset } from 'aws-cdk-lib/aws-ecr-assets';

export class ApiCdkStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);
    const dockerImageAsset = new DockerImageAsset(this, 'MyDockerImage', {
      directory: './container/'
    });
  }
}

Next, we define the ConfigMap:

14-configmap/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: api-configmap
data:
  config-message: "Hello from ConfigMap!"

Finally, we reference the ConfigMap in the Deployment:

14-configmap/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: config-api-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      name: config-api
  template:
    metadata:
      labels:
        name: config-api
    spec:
      containers:
        - name: config-api-container
          # deployed via CDK
          # replace with your image
          image: 196736724465.dkr.ecr.us-west-2.amazonaws.com/cdk-hnb659fds-container-assets-196736724465-us-west-2:afbbd8d7b43a7f833eb07c26a13d5344fa7656c136b1e27b545490fa58dad983
          ports:
            - containerPort: 8000
          env:
            - name: CONFIG_MESSAGE
              valueFrom:
                configMapKeyRef:
                  name: api-configmap
                  key: config-message

With the API deployed, we can verify that the configuration propagates correctly.

14-configmap/commands.sh
# entering BusyBox container shell
kubectl run -it --rm --restart=Never busybox --image=busybox sh
wget config-api-service:80/api/config
cat config

15: Secrets (CKAD Topic)

Secrets are very similar to ConfigMaps except that they are intended for sensitive information. Opaque is the default type of Secret for arbitrary user data unless you need to store SSH credentials, TLS certificates, ~/.dockercfg, etc. For a complete list of types, see the documentation. Kubernetes Secrets do not encrypt the data on your behalf. That responsibility is on the developer.

15-secrets/secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: busybox-password 
type: Opaque
data:
  password: MWYyZDFlMmU2N2Rm

16: Multi-Container Pods (CKAD Topic)

In the examples so far, Pods and containers had a 1:1 relationship. Two common patterns for multi-container Pods in Kubernetes are init containers and sidecars. To illustrate these patterns, we'll use a PostgreSQL database with a backend that relies on it. Given that the backend container depends on the database, we must ensure that PostgreSQL is available before starting it. To do so, we can use an init container that verifies the ability to connect to the database. All init containers run before the Pod starts. If any init container fails, the Pod fails.

16-multi-container-pods/backend.deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend-with-database
  namespace: default
spec:
  selector:
    matchLabels:
      app: backend
  replicas: 1
  template:
    metadata:
      labels:
        app: backend
    spec:
      initContainers:
        - name: verify-database-online
          image: postgres
          command: [ 'sh', '-c',
            'until pg_isready -h database-service -p 5432; 
                do echo waiting for database; sleep 2; done;' ]
      containers:
        - name: backend
          image: nginx

An example of a sidecar container is a GUI called Adminer for the database. The GUI has a lifecycle tightly coupled to the Postgres container (i.e., if we don't need the database anymore, we don't need the GUI). To configure a sidecar, append another container to the Deployment's spec:

16-multi-container-pods/database.deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres-database
  namespace: default
spec:
  selector:
    matchLabels:
      app: database
  replicas: 1
  template:
    metadata:
      labels:
        app: database
    spec:
      containers:
        - name: database
          image: postgres
          envFrom:
            - configMapRef:
                name: database-access
          ports:
            - containerPort: 5432
        - name: database-admin
          image: adminer
          ports:
            - containerPort: 8080

With the sidecar in place, we can deploy and leverage the GUI to log into our database.

16-multi-container-pods/commands.sh
# assumes cluster created from 00-eksctl-configuration first
kubectl apply -f database.configmap.yaml
kubectl apply -f backend.deployment.yaml
# check that the primary container is not yet running because the init container has not completed
# STATUS shows as Init:0/1
kubectl get pods
# deploy database and service
kubectl apply -f database.deployment.yaml
kubectl apply -f database.service.yaml
# verify that init container has completed
# get database pod
kubectl get pods
# forward ports
kubectl port-forward pod/postgres-database-697695b774-xcp9p 9000:8080
# open Adminer in browser
# see screenshot for logging in
wget http://localhost:9000
# clean up
kubectl delete -f ./ 

17: Deployment Strategies (CKAD Topic)

The four most common deployment strategies are rolling, blue/green, canary, and recreate. Rolling updates involve deploying new Pods in a batch while decreasing old Pods at the same rate. This is the default behavior in Kubernetes. Blue/green deployments provision an entirely new environment (green) parallel to the existing one (blue), then perform a Service selector cutover when approved for production release. Canary deployments allow developers to test a new deployment with a subset of users in parallel with the current production release. Recreating an environment involves destroying the old environment and then provisioning a new one, which may result in downtime.

For a blue/green release, let's start by creating the blue and green deployments. The following YAML for the blue deployment is nearly identical to the green. The only difference is the Docker image used.

17-deployment-strategies/blue-green-deployment/blue.deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: blue-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
      role: blue
  template:
    metadata:
      labels:
        app: nginx
        role: blue
    spec:
      # the green deployment uses the green Docker image
      containers:
      - name: blue
        image: scottenriquez/blue-nginx-app
        imagePullPolicy: Always
        ports:
        - containerPort: 80
        resources:
          limits:
            memory: "128Mi"
            cpu: "200m"

By default, the production Service should point to the blue environment.

17-deployment-strategies/blue-green-deployment/production.service.yaml
kind: Service
apiVersion: v1
metadata:
  name: production-service
  labels:
    env: production
spec:
  type: ClusterIP
  selector:
    app: nginx
  ports:
    - port: 9000
      targetPort: 80

To perform the release, change the selector on the Production service. Then verify that the web application contains the green release instead of the blue.

17-deployment-strategies/blue-green-deployment/commands.sh
# perform cutover
# can also be done via manifest
kubectl set selector service production-service 'role=green'
# entering BusyBox container shell
kubectl run -it --rm --restart=Never busybox --image=busybox sh
# verify green in HTML
wget production-service:9000
cat index.html

Switching gears to a canary release, we start by creating stable and canary Deployments. In this code example, the two web applications are nearly identical, except that the canary has a yellow message in a <h1> tag. We control the percentage of canary Pods by splitting the number of canary and stable replicas. For this example, there is a 20% chance of using a canary Pod because there is one canary replica and four stable replicas.

17-deployment-strategies/canary-deployment/canary.deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: canary-deployment
spec:
  # the stable Deployment has four replicas
  replicas: 1
  selector:
    matchLabels:
      track: canary
  template:
    metadata:
      labels:
        app: nginx
        track: canary
    spec:
      # the stable deployment uses the stable Docker image
      containers:
      - name: canary-deployment
        image: scottenriquez/canary-nginx-app
        imagePullPolicy: Always
        ports:
        - containerPort: 80
        resources:
          limits:
            memory: "128Mi"
            cpu: "200m"

With this approach, traffic will be directed to the canary pod on average 20% of the time. It may take several requests to the Service, but a canary webpage will eventually be returned.

Canary

18: Probes (CKAD Topic)

There are two primary types of probes: readiness and liveness. Kubernetes uses liveness probes to determine when to restart a container (i.e., a health check). It uses readiness probes to determine when a container is ready to accept traffic. These two probes are independent and unaware of each other. Probes of type HTTP, TCP, gRPC, and shell commands are supported. For this example, we'll use HTTP for both and add them as endpoints to an API:

18-probes-and-health-checks/api-cdk/container/app/main.py
@api.get('/api/healthy')
def config():
    return {
        'healthy': True
    }

@api.get('/api/ready')
def config():
    return {
        'ready': True
    }

In the Deployment manifest, we simply map the probes to the endpoints:

18-probes-and-health-checks/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: probe-api-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      name: probe-api
  template:
    metadata:
      labels:
        name: probe-api
    spec:
      containers:
        - name: probe-api-container
          # deployed via CDK
          # replace with your image
          image: 196736724465.dkr.ecr.us-west-2.amazonaws.com/cdk-hnb659fds-container-assets-196736724465-us-west-2:86b591781a296c7b2980608eeb67e30aaf316c732c92b6a47e536555bce0dc93 
          ports:
            - containerPort: 8000
          resources:
            limits:
              cpu: 250m
              memory: 256Mi
          livenessProbe:
            httpGet:
              path: /api/healthy
              port: 8000
          readinessProbe:
            httpGet:
              path: /api/ready
              port: 8000

19: SecurityContext (CKAD Topic)

A SecurityContext resource configures a Pod's privilege and access control settings, such as enabling or disabling Linux capabilities and running as a specific user ID. This topic is straightforward but critical for the exam.

19-security-context/pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: security-context-pod
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000
  volumes:
  - name: security-context-pod-volume
    emptyDir: {}
  containers:
  - name: security-context-container
    image: busybox:1.28
    command: [ "sh", "-c", "id" ]
    volumeMounts:
    - name: security-context-pod-volume
      mountPath: /data/security-context-volume
    securityContext:
      allowPrivilegeEscalation: false

20: ServiceAccounts and Role-Based Access Control (CKAD Topic)

Like how Identity and Access Management (IAM) in AWS grants principals permissions to specific actions for specific resources, Kubernetes Roles and ServiceAccounts allow resources within the cluster to leverage the control plane to perform operations on the cluster. For this example, let's grant a Pod access to get other Pods. We start by creating a Role:

20-service-accounts-and-rbac/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
# "" indicates the core API group
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list"]

A ServiceAccount provides an identity for processes that run inside Pods. We generate one next:

20-service-accounts-and-rbac/service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    kubernetes.io/enforce-mountable-secrets: "true"
  name: sa-pod-reader

Next, we bind the Role to the ServiceAccount:

20-service-accounts-and-rbac/role-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: default
subjects:
- kind: ServiceAccount 
  name: sa-pod-reader 
  apiGroup: ""
roleRef:
  kind: Role
  name: pod-reader 
  apiGroup: "" 

Then, we create a Pod that leverages the ServiceAccount:

20-service-accounts-and-rbac/pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: reader-pod 
spec:
  serviceAccountName: sa-pod-reader
  containers:
    - name: reader-container
      image: alpine:3.12
      resources:
        limits:
          memory: "128Mi"
          cpu: "500m"
      command: ["/bin/sh"]
      args: ["-c", "sleep 3600"]

Finally, we can test specific operations against the API server to validate that certain actions are allowed and others are denied.

20-service-accounts-and-rbac/commands.sh
# entering Pod shell
kubectl exec -it reader-pod -- sh
# install curl
apk --update add curl
# get Pods
# allowed by Role
curl -s --header "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt https://kubernetes/api/v1/namespaces/default/pods
# get Secrets
# denied by Role
curl -s --header "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt https://kubernetes/api/v1/namespaces/default/secrets

21: NetworkPolicy (CKAD Topic)

By default, network traffic between Pods is unrestricted. In other words, any Pod can communicate with any other Pod. A NetworkPolicy is a Kubernetes resource that uses selectors to implement granular ingress and egress rules. However, a network plugin must first be installed in the cluster to leverage NetworkPolicies. For this example, we will use Calico. For EKS, this only requires two commands:

21-network-policy/commands.sh
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.1/manifests/tigera-operator.yaml
kubectl create -f - <<EOF
kind: Installation
apiVersion: operator.tigera.io/v1
metadata:
  name: default
spec:
  kubernetesProvider: EKS
  cni:
    type: AmazonVPC
  calicoNetwork:
    bgp: Disabled
EOF

With the network plugin installed, we can define a simple NetworkPolicy based on three example Pods:

21-network-policy/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: pod-one-and-two-network-policy
spec:
  podSelector:
    matchLabels:
      # this label is also attached to the first two Pods but not the third
      network: allow-pod-one-and-two 
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          network: allow-pod-one-and-two 
  egress:
  - to:
    - podSelector:
        matchLabels:
          network: allow-pod-one-and-two 

After applying the manifest above, we can validate that the network traffic now behaves as expected (i.e., the first two Pods can communicate with each other without allowing traffic from the third).

21-network-policy/commands.sh
# get Pod IP addresses
kubectl get pods -o wide
# enter pod-three shell
kubectl exec -it pod-three -- sh
# ping pod-one and pod-two IP address (replace with yours)
# these commands should hang
ping 192.168.2.246
ping 192.168.21.2
# returning to default shell
exit
# enter pod-one shell
kubectl exec -it pod-one -- sh
# ping pod-two IP address (replace with yours)
# this command should be successful
ping 192.168.21.2
# ping pod-three IP address (replace with yours)
# this command should hang
ping 192.168.71.123

22: ArgoCD

ArgoCD is a declarative continuous delivery tool for Kubernetes that leverages the GitOps pattern (i.e., storing configuration files in a Git repository to serve as the single source of truth). Instead of developers constantly typing kubectl apply -f manifest.yaml, ArgoCD monitors a specified Git repository for changes to manifests. ArgoCD applications can be configured to automatically update when deltas are detected or require manual intervention. Applications can be created using a GUI or through a YAML file. To get started, we install ArgoCD on the cluster and configure port forwarding to access the UI locally.

22-argocd/commands.sh
# install CLI
brew install argocd
# install ArgoCD on the cluster
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# get initial password
argocd admin initial-password -n argocd
# forward ports to access the ArgoCD UI locally
kubectl port-forward svc/argocd-server -n argocd 8080:443

Once we've navigated to the UI in the browser, we create an ArgoCD application using the following YAML:

22-argocd/argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: helm-webapp-dev
spec:
  destination:
    name: ''
    namespace: default
    server: https://kubernetes.default.svc
  source:
    path: helm-webapp
    # all credit to devopsjourney1 for the repository
    # https://github.com/devopsjourney1
    # https://www.youtube.com/@DevOpsJourney
    repoURL: https://github.com/scottenriquez/argocd-examples
    targetRevision: HEAD
    helm:
      valueFiles:
        - values-dev.yaml
  sources: []
  project: default
  syncPolicy:
    automated:
      prune: false
      selfHeal: false

Based on the configuration, the ArgoCD application will automatically be updated when we commit to the specified GitHub repository. Via the UI, we can monitor the resources that have been created, sync status, commit information, etc.

ArgoCD

23: cdk8s

The AWS Cloud Development Kit (CDK) is an open-source software development framework that brings the capabilities of general-purpose programming languages (e.g., unit testing, adding robust logic, etc.) to infrastructure as code. In addition to being more ergonomic for those with a software engineering background, CDK also provides higher levels of abstraction through constructs and patterns. HashiCorp also created a spinoff called CDK for Terraform (CDKTF). Using a similar design, AWS created a project called Cloud Development Kit for Kubernetes (cdk8s). Rather than managing the cloud infrastructure, cdk8s only manages the resources within a Kubernetes cluster. The code compiles the TypeScript (or language of your choice) to a YAML manifest file. Below is an example:

23-cdk8s/cluster/main.ts
export class MyChart extends Chart {
  constructor(scope: Construct, id: string, props: ChartProps = { }) {
    super(scope, id, props);
    new KubeDeployment(this, 'my-deployment', {
      spec: {
        replicas: 3,
        selector: { matchLabels: { app: 'frontend' } },
        template: {
          metadata: { labels: { app: 'frontend'} },
          spec: {
            containers: [
              {
                name: 'app-container',
                image: 'nginx:latest',
                ports: [{ containerPort: 80 }]
              }
            ]
          }
        }
      }
    });
  }
}

Which produces:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-my-deployment-c8e7fb18
spec:
  replicas: 3
  selector:
    matchLabels:
      app: frontend
  template:
    metadata:
      labels:
        app: frontend
    spec:
      containers:
        - image: nginx:latest
          name: app-container
          ports:
            - containerPort: 80

24: OpenFaaS

OpenFaaS is a nifty project that allows you to run serverless functions on Kubernetes. We start by installing OpenFaaS to our cluster and as a CLI:

24-openfaas/commands.sh
# install CLI on local machine
# https://docs.openfaas.com/cli/install/
brew install faas-cli
# create namespace
kubectl apply -f namespace.yaml
# add Helm charts to cluster
helm repo add openfaas https://openfaas.github.io/faas-netes
helm install my-openfaas openfaas/openfaas --version 14.2.49 --namespace openfaas
# forward the API's port in a separate terminal tab
kubectl port-forward svc/gateway 8080 --namespace openfaas
# fetch password and log in
faas-cli login --password $(kubectl -n openfaas get secret basic-auth -o jsonpath="{.data.basic-auth-password}" | base64 --decode)

Next, we create a simple function that looks similar to AWS Lambda:

24-openfaas/commands.sh
# create a function
faas-cli new --lang python openfaas-python-function
# requires Docker running locally
faas-cli build -f openfaas-python-function.yml
# push to DockerHub
faas-cli publish -f openfaas-python-function.yml
# deploy to cluster
faas-cli deploy -f openfaas-python-function.yml

24-openfaas/openfaas-python-function/handler.py
def handle(request):
    return "Hello from OpenFaaS!"

Finally, we can invoke the function through the web UI:

OpenFaaS

Disclaimer

At the time of writing this blog post, I currently work for Amazon Web Services. The opinions and views expressed here are my own and not the views of my employer.

The Nature of Code Companion Series: Chapter One

May 25, 2024 · 8 min read

Scottie Enriquez

Human in Los Angeles, CA

About the Book

Recently, I started reading a fantastic book called The Nature of Code by Daniel Shiffman. From the description:

How can we capture the unpredictable evolutionary and emergent properties of nature in software? How can understanding the mathematical principles behind our physical world help us to create digital worlds? This book focuses on a range of programming strategies and techniques behind computer simulations of natural systems, from elementary concepts in mathematics and physics to more advanced algorithms that enable sophisticated visual results. Readers will progress from building a basic physics engine to creating intelligent moving objects and complex systems, setting the foundation for further experiments in generative design.

Daniel implements numerous examples using a programming language called Processing. Instead, I decided to write my own versions using JavaScript, React, Three.js, and D3. For this blog series, I intend to implement my learnings from each chapter.

Previous Entries in the Blog Series

Introduction

Source Code

The source code for this post is located on GitHub.

Introduction to Euclidean Vectors

The book references example code in the Processing programming language that simulates a bouncing ball in two-dimensional space. Below is the core logic:

Bounce.pde
void draw()
{
  xpos = xpos + (xspeed * xdirection);
  ypos = ypos + (yspeed * ydirection);
  // width-rad and rad refer to the boundaries of the screen
  if (xpos > width-rad || xpos < rad) {
    // invert direction if an edge has been hit
    xdirection *= -1;
  }
  if (ypos > height-rad || ypos < rad) {
    // invert direction if an edge has been hit
    ydirection *= -1;
  }
  ellipseMode(RADIUS);
  fill(random(256));
  ellipse(xpos, ypos, rad, rad);
}

This image from the Processing code output shows the ball's movement through vector space. The circle's path is tracked by selecting a random color for the ball on each iteration:

Bouncing Ball Processing

From Wikipedia:

In mathematics, physics, and engineering, a Euclidean vector or simply a vector (sometimes called a geometric vector or spatial vector) is a geometric object that has magnitude (or length) and direction. Euclidean vectors can be added and scaled to form a vector space. A Euclidean vector is frequently represented by a directed line segment, or graphically as an arrow connecting an initial point $A$ with a terminal point $B$ .

To expand this example to the third dimension, additional variables called zpos and zspeed are required. Obviously, this approach does not scale well to $n$ dimensions since each needs new speed and position variables. While vectors alone don't expand the physics functionality (e.g., the circle's motion), they streamline and minimize the amount of code required to include new dimensions. In JavaScript, we can write a simple class to organize the components and implement vector operations such as addition.

src/components/NatureOfCode/One/NVector/nVector.js
class NVector {
  constructor(...components) {
    this.components = components;
  }

  get dimensions() {
    return this.components.length;
  }

  // assumes that the second vector has the same dimensions as the first
  add(otherVector) {
    return new NVector(
      ...this.components.map((component, index) => component + otherVector.components[index])
    );
  }
}

After instantiating two NVector objects, a third vector can be created to capture the sum. This vector addition is the basis for simulating motion:

let circleLocation = new NVector(1, 2, 3);
const circleVelocity = new NVector(4, 5, 6);
circleLocation = circleLocation.add(circleVelocity);
// { components: [5, 7, 9] }
console.log(circleLocation);

In other words (with location as $l$ and velocity as $v$ ):

\overrightarrow{l} = \overrightarrow{l} + \overrightarrow{v}

Or:

l_{x} = l_{x} + v_{x}

l_{y} = l_{y} + v_{y}

l_{z} = l_{z} + v_{z}

Vector subtraction behaves the same way as addition:

src/components/NatureOfCode/One/NVector/nVector.js
// assumes that the second vector has the same dimensions as the first
subtract(otherVector) {
  return new NVector(
    ...this.components.map((component, index) => component - otherVector.components[index])
  );
}

For multiplication, there are both scalar and vector products:

src/components/NatureOfCode/One/NVector/nVector.js
scale(scalar) {
  return new NVector(...this.components.map(component => component * scalar));
}

// assumes that the second vector has the same dimensions as the first
dot(otherVector) {
  return this.components.reduce((sum, component, index) => sum + component * otherVector.components[index], 0);
}

Scalar multiplication can be written as (where $n$ is a single number):

\overrightarrow{w} = \overrightarrow{u} * n

Or:

w_{x} = u_{x} * n

w_{y} = u_{y} * n

w_{z} = u_{z} * n

const u = new NVector(1, 3, 5);
const n = 3
const w = u.scale(n);
// { components: [3, 9, 15] }
console.log(w);

For dot products (where $n$ is the dimension of vector space):

\mathbf {u} \cdot \mathbf {v} =\sum _{i=1}^{n}u_{i}v_{i}=u_{1}v_{1}+\cdots +u_{n}v_{n}

const u = new NVector(1, 3, -5);
const v = new NVector(4, -2, -1);
const w = u.dot(v);
// 3
console.log(w);

Finally, a vector's magnitude (or length) can be calculated like so:

\|\mathbf {u} \|={\sqrt {u_{1}^{2}+\cdots +u_{n}^{2}}}.

This is useful for normalizing a vector (which we do in the bouncing sphere example later in the post):

\mathbf {\hat {u}} ={\frac {\mathbf {u}}{\|\mathbf {u} \|}}

With this code in hand (plus a graphics library called Three.js), we can begin to model vectors in three-dimensional space:

// additionally, a line is drawn between the two vectors
const vectors = [
  // use the graphics library's implementation of a vector instead of NVector
  new THREE.Vector3(-10, -10, -10),
  new THREE.Vector3(10, 10, 10)
];

Bouncing Sphere

Using our knowledge of vectors and a Vector class (THREE.Vector3, which is our graphics library's implementation of our NVector class above), we can expand the bouncing ball Processing example into the third dimension. First, we create a sphere with a random starting position vector within the bounds of our space (i.e., -50 to 50).

src/components/NatureOfCode/One/BouncingSphere/boucingSphere.js
const generateSphere = () => {
  const x = Math.random() * 100 - 50;
  const y = Math.random() * 100 - 50;
  const z = Math.random() * 100 - 50;
  const sphereLocationVector = new THREE.Vector3(x, y, z);
  const sphereGeometry = new THREE.SphereGeometry(5, 32, 32);
  const sphereMaterial = new THREE.MeshStandardMaterial({color: 0xa4ff90, roughness: 0});
  const sphere = new THREE.Mesh(sphereGeometry, sphereMaterial);
  sphere.position.set(sphereLocationVector.x, sphereLocationVector.y, sphereLocationVector.z);
  return sphere;
};

Similarly to the draw function in the two-dimensional example above, the animate function uses the screen's boundaries as a signal to invert the direction of the sphere's motion by negating the corresponding component in the velocity vector:

src/components/NatureOfCode/One/BouncingSphere/sceneInit.js
generateVelocityVector() {
  const x = (this.isXPositiveDirection ? 1 : -1) * 15;
  const y = (this.isYPositiveDirection ? 1 : -1) * 15;
  const z = (this.isZPositiveDirection ? 1 : -1) * 15;
  return new THREE.Vector3(x, y, z);
}

animate() {
  const sphere = this.scene.getObjectByName('sphere');
  window.requestAnimationFrame(this.animate.bind(this));
  if (sphere.position.x > this.topXPosition || sphere.position.x < this.bottomXPosition) {
    this.isXPositiveDirection = !this.isXPositiveDirection;
  }
  if (sphere.position.y > this.topYPosition || sphere.position.y < this.bottomYPosition) {
    this.isYPositiveDirection = !this.isYPositiveDirection;
  }
  if (sphere.position.z > this.topZPosition || sphere.position.z < this.bottomZPosition) {
    this.isZPositiveDirection = !this.isZPositiveDirection;
  }
  const timeDelta = this.clock.getDelta();
  const sphereVelocityVector = this.generateVelocityVector();
  sphereVelocityVector.multiplyScalar(timeDelta);
  sphereVelocityVector.normalize();
  sphere.position.add(sphereVelocityVector);
  this.render();
  this.controls.update();
}

After leveraging the clock's time delta (i.e., the time elapsed since the last frame) for scalar multiplication and normalizing the sphere velocity vector, we see the sphere move through our three-dimensional environment.

Bouncing Sphere with Random Acceleration

In the first bouncing sphere model, the initial position is random. However, the path that the sphere takes is a deterministic loop. Next, we add random acceleration to the velocity. In other words, the final algorithm is (with $\overrightarrow{u}$ as the initial velocity, $\overrightarrow{a}$ as acceleration, and $t$ as time):

\overrightarrow{v} = \overrightarrow{u} + \overrightarrow{a} * t

With the final velocity as ${\mathbf {v}}$ :

\mathbf {\hat {v}} ={\frac {\mathbf {v}}{\|\mathbf {v} \|}}

With the normalized vector as ${\mathbf {\hat {v}}}$ and location as $l$ :

{\overrightarrow{l} = \overrightarrow{l} + \displaystyle \mathbf {\hat {v}}}

Implemented in JavaScript, this is:

src/components/NatureOfCode/One/BouncingSphereWithAcceleration/sceneInit.js
generateVelocityVector() {
  const x = (this.isXPositiveDirection ? 1 : -1) * 15;
  const y = (this.isYPositiveDirection ? 1 : -1) * 15;
  const z = (this.isZPositiveDirection ? 1 : -1) * 15;
  return new THREE.Vector3(x, y, z);
}

generateRandomAccelerationVector() {
  const x = Math.random() * 30 - 15;
  const y = Math.random() * 30 - 15;
  const z = Math.random() * 30 - 15;
  return new THREE.Vector3(x, y, z);
}

animate() {
  const sphere = this.scene.getObjectByName('sphere-acceleration');
  window.requestAnimationFrame(this.animate.bind(this));
  if (sphere.position.x > this.topXPosition || sphere.position.x < this.bottomXPosition) {
    this.isXPositiveDirection = !this.isXPositiveDirection;
  }
  if (sphere.position.y > this.topYPosition || sphere.position.y < this.bottomYPosition) {
    this.isYPositiveDirection = !this.isYPositiveDirection;
  }
  if (sphere.position.z > this.topZPosition || sphere.position.z < this.bottomZPosition) {
    this.isZPositiveDirection = !this.isZPositiveDirection;
  }
  const timeDelta = this.clock.getDelta();
  const sphereVelocityVector = this.generateVelocityVector();
  const sphereAccelerationVector = this.generateRandomAccelerationVector();
  sphereVelocityVector.multiplyScalar(timeDelta);
  sphereVelocityVector.add(sphereAccelerationVector);
  sphereVelocityVector.normalize();
  sphere.position.add(sphereVelocityVector);
  this.render();
  this.controls.update();
}

Next Section

Chapter two examines forces and laws of motion.

AWS Billing Conductor SP/RI Benefit Utility

March 13, 2024 · 10 min read

Scottie Enriquez

Human in Los Angeles, CA

About

This is a tool that I developed and open sourced at AWS. Find the latest in the GitHub repository here. It's released under MIT-0.

AWS Billing Conductor (ABC) Overview

AWS Billing Conductor is a priced service in the AWS billing suite designed to support showback and chargeback workflows for any AWS customer who needs to enforce visibility boundaries within their Organization or add custom rates unique to their business. This alternative version of the monthly bill is called a pro forma bill.

How Billing Conductor Handles Savings Plans (SPs) and Reserved Instances (RIs)

It’s important to note that AWS Billing Conductor does not change the application of SPs or RIs in the account’s billing family. It only affects how the application is visible in the pro forma views. To conceptualize the difference between the two, consider the intention behind each of the two products. When applying SPs and RIs, the AWS billing system prioritizes maximizing the discount benefit of each product to save customers the most money possible. When calculating pro forma costs, AWS Billing Conductor prioritizes creating the prescribed view for each billing group by enforcing strict visibility boundaries within the Organization.

By default, AWS Billing Conductor shares the benefits of Savings Plans and Reserved Instances that were purchased in a linked account belonging to a billing group with all accounts placed in the same billing group. However, benefits from any Savings Plans or Reserved Instances owned outside a billing group are not included in that billing group’s pro forma cost. A few examples of how SP and RI benefits will or will not appear in pro forma data:

A Savings Plan was purchased in the payer account, which is not in any billing group. Billing groups will not see any SP benefit in their pro forma view. Sharing purchases made outside of billing groups (e.g., in the payer account) is the primary use case for this tool.
Linked account 1 is in billing group A, and linked account 1 has received benefit from a Savings Plan or Reserved Instance that was purchased outside the billing group A in the consolidated bill (i.e., what the customer pays to AWS). Linked account 1 will not see any benefit in their pro forma view.
Linked account 2 owns an RI and is in billing group A. Linked account 2 consumed its own RI during the month. It will see the benefit of that RI in its pro forma view as well.
Linked account 2 owns an RI and is in billing group A. Linked account 2 did not have any usage that the RI could apply to and neither did any account in billing group A. In the consolidated bill (i.e., what the customer pays to AWS), the RI benefit was applied to linked account 3, which is not in billing group A. Linked account 2’s pro forma view will show the RI as unused.
Linked account 2 owns an RI and is in billing group A. Linked account 2 does not have usage that the RI could apply to. However, linked account 3 (also in billing group A) does have usage that the RI could apply to. In the consolidated bill (i.e., what the customer pays to AWS), the RI applied to linked account 4, which does not belong to billing group A. In the pro forma view, the RI was applied to linked account 3 because ABC constrains application to the billing group where it was purchased regardless of whether sharing is enabled.

Utility Logic Overview

This utility shows how ABC custom line items can be used to distribute the benefits of SPs and RIs purchased outside of billing groups (e.g., in a payer account) to linked accounts belonging to billing groups. The solution's logic is as follows:

Trigger on the fifth of every month using EventBridge
Determine the date range of the previous billing period (i.e., the first and last day of the previous full month)
- For example, if the current date is January 5th, the previous billing period would be December 1st to 31st
Get the account associations from Billing Conductor for the previous billing period (i.e., which accounts belonged to which billing group during the last full month)
Pull the number of EC2 running hours by instance type via the Cost Explorer API and calculate normalized hours based on normalization factor
- Include normalized Fargate usage if the INCLUDE_FARGATE_FOR_SAVINGS_PLANS feature flag is enabled (disabled by default)
- Include normalized Lambda usage if the INCLUDE_LAMBDA_FOR_SAVINGS_PLANS feature flag is enabled (disabled by default)
Pull the number of RDS running hours by instance type via the Cost Explorer API and calculate normalized hours based on normalization factor
Pull the net savings for the previous billing period per Savings Plan and Reserved Instance
Divide the net savings for each commitment proportionally across the linked accounts that belong to a billing group
- Each linked account's percentage is its normalized usage divided by the total normalized usage for all accounts belonging to a billing group
Create a custom line per commitment per account
Write the custom line items to Billing Conductor if the DRY_RUN flag is disabled (enabled by default for testing purposes), otherwise only return the output (viewable in the Lambda Console)
If an error occurs, an SNS topic is notified
In regard to managed services, the initial solution only covers RDS, but includes code comments about how to expand to other services such as ElastiCache and OpenSearch

Architecture

The core functionality resides in a Lambda function built using AWS Serverless Application Model (SAM). The infrastructure is defined using a CloudFormation template with the following resources intended to be deployed in the payer account:

Lambda function using Python 3.12
EventBridge rule using a cron expression to trigger on the fifth day of every month (i.e., so that the bill for the previous month is finalized)
An SNS topic to subscribe to on errors
An IAM policy and execution role with the minimum required permissions

Minimum IAM Permissions Required

From the CloudFormation file:

template.yaml
Policies:
  - Statement:
      - Sid: BillingConductorAndCostExplorer
        Effect: Allow
        Action:
          - billingconductor:ListAccountAssociations
          - billingconductor:CreateCustomLineItem
          - ce:GetCostAndUsage
          - ce:GetReservationUtilization
          - ce:GetSavingsPlansUtilizationDetails
          - organizations:ListAccounts
        Resource:
          - '*'
  - Statement:
      - Sid: SNSPublishToFailureTopic
        Effect: Allow
        Action:
          - sns:Publish
        Resource:
          - !Ref rLambdaFailureTopic

Local Setup

Creating a Virtual Environment

git clone git@github.com:aws-samples/aws-billing-conductor-sp-ri-benefit-utility.git
cd aws-billing-conductor-sp-ri-benefit-utility
python3.12 -m venv '.venv'
. .venv/bin/activate
pip install -r sam_sp_ri_utility/requirements.txt

Running Unit Tests

pytest

Deployment

Building Using AWS SAM

sam build

Deploying Using AWS SAM

First, ensure that local AWS credentials are configured correctly.

sam deploy --guided

Leveraging the Sample

By default, the Lambda function does not write the custom line items to Billing Conductor. To disable dry run mode, change the Lambda environment variable called DRY_RUN to Disabled either via the Console or CloudFormation template. Before doing so, we strongly recommend that you review what would have been written to ensure that the benefit distribution meets your business requirements. For feature ideas and/or questions that could apply to all ABC users, please open an issue in this repository. Contact your account team or open an AWS support case for 1:1 discussions that require specifics that cannot be shared publicly.

Edge Cases and Additional Considerations

This utility assumes that ABC has been configured and some linked accounts are associated with billing groups. The payer (or account where purchases are centrally made) must not belong to a billing group. It also assumes that there is at least one month of ABC data for each billing group given that it looks back to the previous month.
If some or all of the linked accounts that belong to billing groups have commitment purchases, be aware that these accounts would receive benefits in the pro forma data both from the purchases made at the linked account level and outside the billing group (e.g., in the payer) as well. However, this utility does not allocate any benefits for purchases made within a linked account belonging to a billing group since ABC does this already.
Spot usage is ignored by the EC2 and Fargate normalized usage calculation functions. This is intended to mirror the way that Savings Plans and Reserved Instances are applied by AWS billing systems. Unused ODCRs (i.e., usage types containing UnusedBox and UnusedDed) are also excluded from the total eligible usage and every linked accounts' eligible usage.
Not all sizes (e.g., m6i.metal) are contained in the normalization map. If an instance size is not found, the normalization factor defaults to 1.0. In addition, only size is currently considered. Users may also want to customize the weights based on instance family as well (e.g., for GPU usage).
It is possible that a commitment can have negative net savings due to low utilization. If this occurs, the negative value will be distributed to the eligible linked accounts as a fee.
The distribution logic does not match benefits to usage (e.g., only applying the benefits for an RDS RI to accounts with the region, instance type, etc. specified by the commitment). This aims to support centralized purchasing strategies.
In regard to non-US currency payers, since the sample utility distributes benefits based on normalized usage hours, it is not reliant on any one currency for its calculation logic (other than net savings). However, we recommend that every customer using the utility, regardless of the currency they use, validates that the custom line items produce the expected results.
Fargate and Lambda usage are ignored by default because Compute Savings Plans cover EC2 usage first due to higher savings percentage over On-Demand. To enable these features, change the INCLUDE_FARGATE_FOR_SAVINGS_PLANS and/or INCLUDE_LAMBDA_FOR_SAVINGS_PLANS Lambda environment variable(s) to Enabled.
Lambda has a 15-minute maximum timeout. If the function cannot complete within that time period, the code may need to leverage a different offering like Fargate that can support longer run times.

Cost

For a complete list of resources deployed by this utility, see the template.yaml file. The Lambda function leverages ARM and runs once per month by default. The number of seconds will vary based on the environment. See the Lambda pricing for details. The core logic leverages the Cost Explorer API for the following:

Pulling Savings Plans and Reserved Instances utilization
Fetching cost and usage for EC2, RDS, Lambda, and Fargate

Each Cost Explorer API request costs 0.01 USD. Monitor these costs via Cost Explorer by filtering to the Cost Explorer service and/or by API operation (i.e., GetCostAndUsage, GetReservationUtilization, GetSavingsPlansUtilDetails, and GetSavingsPlansUtilization). The unit tests located in the sam_sp_ri_utility/test directory mock API calls by patching SDK methods. To test locally without incurring costs, modify these Python objects to emulate API calls.

Writing Optimized Functions Using AWS Lambda Power Tuning

November 3, 2023 · 8 min read

Scottie Enriquez

Human in Los Angeles, CA

Solution Overview

As I wrote about previously, AWS users are shifting left on costs using DevOps and automation. While tools like Infracost are powerful for estimating costs for Lambda and other services, they alone do not provide optimization or tuning feedback during the development lifecycle. This is where a tool like AWS Lambda Power Tuning assists:

AWS Lambda Power Tuning is an open-source tool that can help you visualize and fine-tune the memory and power configuration of Lambda functions. It runs in your own AWS account, powered by AWS Step Functions, and it supports three optimization strategies: cost, speed, and balanced.

Lambda pricing is determined by the number of invocations and the execution duration. There are several strategies for decreasing duration costs including using Graviton for 20% savings (which this solution does for both Lambda and CodeBuild), leveraging the latest runtime versions, taking advantage of execution reuse, etc. In addition to these, optimizing memory allocation is a key mechanism for efficiency. From the documentation:

The [duration] price depends on the amount of memory you allocate to your function. In the AWS Lambda resource model, you choose the amount of memory you want for your function and are allocated proportional CPU power and other resources. An increase in memory size triggers an equivalent increase in CPU available to your function.

Without running the Lambda function using different configurations, it is unclear what is the most optimal memory amount for cost and/or performance. This solution demonstrates AWS Lambda Tuning Tools integration with a CodeSuite CI/CD pipeline to bring Lambda tuning information to the pull request process and code review discussion. The source code is hosted on GitHub.

Solution Architecture

Diagram

This solution deploys several resources:

The AWS Lambda Power Tuning application
A CodeCommit repository preloaded with Terraform code for a Lambda function to tune
A CodeBuild project triggered by pull request state changes that invoke the AWS Lambda Power Tuning state machine
A CodePipeline with manual approvals to deploy the Terraform for changes pushed to the main branch
An S3 bucket to store Terraform state remotely
An S3 bucket to store CodePipeline artifacts

Preparing Your Development Environment

While this solution is for writing and deploying Terraform HCL syntax, I wrote the infrastructure code for the deployment pipeline and dependent resources using AWS CDK, which is my daily driver for infrastructure as code. I intentionally used Terraform for the target Lambda function to clearly differentiate between the code for resources managed by the pipeline and the pipeline itself.

The following dependencies are required to deploy the pipeline infrastructure:

An AWS account
Node.js
Terraform
AWS CDK
Source code

Rather than installing Node.js, CDK, Terraform, and all other dependencies on your local machine, you can alternatively create a Cloud9 IDE with these pre-installed via the Console or with a CloudFormation template:

Resources:
  rCloud9Environment:
    Type: AWS::Cloud9::EnvironmentEC2
    Properties:
      AutomaticStopTimeMinutes: 30
      ConnectionType: CONNECT_SSH 
      Description: Environment for writing and deploying CDK 
      # AWS Free Tier eligible
      InstanceType: t2.micro	
      Name: PowerTuningCDKPipelineCloud9Environment
      # https://docs.aws.amazon.com/cloud9/latest/user-guide/vpc-settings.html#vpc-settings-create-subnet
      SubnetId: subnet-EXAMPLE 

Installation and Deployment

To install and deploy the pipeline, use the following commands:

git clone https://github.com/scottenriquez/lambda-power-tuned.git
cd lambda-power-tuned
python3 -m venv .venv
. .venv/bin/activate
cd lambda_power_tuned
pip install -r requirements.txt
# https://docs.aws.amazon.com/cdk/v2/guide/bootstrapping.html
cdk bootstrap
cdk deploy

Using the Deployment Pipeline

The CodePipeline pipeline is triggered at creation, but there are manual approval stages to prevent any infrastructure from being created without intervention. Feel free to deploy the Terraform, but it is not required for generating tuning information via a pull request. The CodePipeline is triggered by changes to main.

Pipeline

Next, make some code changes to see the performance impact. To modify the Lambda code, either use the CodeCommit GUI in the Console or clone the repository to your development environment. First, create a branch called feature off of main. Then make some kind of code change, commit to feature, and open a pull request. This automatically triggers the build, which does the following:

Add a comment to the pull request with a hyperlink back to the CodeBuild run
Initialize Terraform against the deployment state to detail resources changed relative to main
Add a comment to the pull request with the resource_changes property from the Terraform plan
Reinitialize the environment to create a transient deployment of the feature branch infrastructure to leverage for tuning purposes
Generate an input file for AWS Lambda Power Tuning
Run the execute-power-tuning.sh Bash code to invoke the state machine and capture results
Add a comment to the pull request with a hyperlink to the tuning results for easy consumption

The results are encoded into the query string of the hyperlink, so the tuning results can easily be shared. As shown by the results of the example function included in the repository, 128MB is the cheapest configuration.

Diving Into the Pull Request Build Logic

The Python code for describing the deployment pipeline lives in power_tuned_lambda_stack.py. The build logic is spread across the pull request project's buildspec and a Bash script residing in the CodeCommit repository. The CodeBuild logic is responsible for creating and destroying the transient testing environment, while the execute-power-tuning.sh contains the specific logic needed to tune the target Lambda function(s). The following code snippets (with comments explaining the build phase) contain the core logic for integrating AWS Lambda Power Tuning into the pull request:

lambda_power_tuned/lambda_power_tuned/lambda_power_tuned_stack.py
pull_request_codebuild_project = aws_codebuild.Project(self, 'PullRequestCodeBuildProject',
    build_spec=aws_codebuild.BuildSpec.from_object({
        'version': '0.2',
        'phases': {
            'install': {
                'commands': [
                    'git checkout $CODEBUILD_SOURCE_VERSION',
                    'yum -y install unzip util-linux jq',
                    f'wget https://releases.hashicorp.com/terraform/{terraform_version}/terraform_{terraform_version}_linux_arm64.zip',
                    f'unzip terraform_{terraform_version}_linux_arm64.zip',
                    'mv terraform /usr/local/bin/',
                    'export BUILD_UUID=$(uuidgen)'
                ]
            },
            'build': {
                'commands': [
                    'aws codecommit post-comment-for-pull-request --repository-name $REPOSITORY_NAME --pull-request-id $PULL_REQUEST_ID --content \"The pull request CodeBuild project has been triggered. See the [logs for more details]($CODEBUILD_BUILD_URL).\" --before-commit-id $SOURCE_COMMIT --after-commit-id $DESTINATION_COMMIT',
                    # create plan against the production function (i.e., what is currently in main)
                    f'terraform init -backend-config="bucket={terraform_state_s3_bucket.bucket_name}"',
                    'terraform plan -out tfplan-pr-$BUILD_UUID.out',
                    # format plan output into Markdown
                    'terraform show -json tfplan-pr-$BUILD_UUID.out > plan-$BUILD_UUID.json',
                    'echo "\`\`\`json\n$(cat plan-$BUILD_UUID.json | jq \'.resource_changes\')\n\`\`\`" > plan-formatted-$BUILD_UUID.json',
                    # write plan to the pull request comments
                    # limit to 10,000 bytes to due the CodeCommit limit pull request content
                    'aws codecommit post-comment-for-pull-request --repository-name $REPOSITORY_NAME --pull-request-id $PULL_REQUEST_ID --content \"Terraform resource changes:\n$(cat plan-formatted-$BUILD_UUID.json | head -c 10000)\" --before-commit-id $SOURCE_COMMIT --after-commit-id $DESTINATION_COMMIT',
                    # reinitialize and create a new state file to manage the transient environment for performance tuning
                    f'terraform init -reconfigure -backend-config="bucket={terraform_state_s3_bucket.bucket_name}" -backend-config="key=pr-$BUILD_UUID.tfstate"',
                    'terraform apply -auto-approve',
                    # execute the state machine and get tuning results
                    # defer tuning logic and configuration to the repository for developer customization
                    'sh execute-power-tuning.sh',
                    # destroy transient environment
                    'terraform destroy -auto-approve'
                ]
            }
        }
    }),
    source=aws_codebuild.Source.code_commit(
        repository=lambda_repository),
    badge=True,
    environment=aws_codebuild.BuildEnvironment(
        build_image=aws_codebuild.LinuxBuildImage.AMAZON_LINUX_2_ARM_3,
        environment_variables={
            'REPOSITORY_NAME': aws_codebuild.BuildEnvironmentVariable(
                value=lambda_repository.repository_name),
            'STATE_MACHINE_ARN': aws_codebuild.BuildEnvironmentVariable(
                value=power_tuning_tools_application.get_att('Outputs.StateMachineARN').to_string())
        },
        compute_type=aws_codebuild.ComputeType.SMALL,
        privileged=True
    ),
    role=terraform_apply_codebuild_iam_role)

Since the CodeBuild project does not have contextual awareness of what the Terraform HCL in the CodeCommit repository is describing (e.g., how many Lambda functions exist), the developer can implement the tuning logic in execute-power-tuning.sh. For this example, this is simply grabbing the Lambda ARN, formatting the AWS Lambda Power Tuning input file, and executing the state machine. However, this logic could be expanded for multiple Lambda functions and other use cases.

lambda_power_tuned/lambda_power_tuned/terraform/execute-power-tuning.sh
#!/bin/bash
# obtain ARN from Terraform and build input file
TARGET_LAMBDA_ARN=$(terraform output -raw arn)
echo $(jq --arg arn $TARGET_LAMBDA_ARN '. += {"lambdaARN" : $arn}' power-tuning-input.json) > power-tuning-input-$BUILD_UUID.json
POWER_TUNING_INPUT_JSON=$(cat power-tuning-input-$BUILD_UUID.json)
# start execution
EXECUTION_ARN=$(aws stepfunctions start-execution --state-machine-arn $STATE_MACHINE_ARN --input "$POWER_TUNING_INPUT_JSON"  --query 'executionArn' --output text)
echo -n "Execution started..."
# poll execution status until completed
while true;
do
    # retrieve execution status
    STATUS=$(aws stepfunctions describe-execution --execution-arn $EXECUTION_ARN --query 'status' --output text)
    if test "$STATUS" == "RUNNING"; then
        # keep looping and wait if still running
        echo -n "."
        sleep 1
    elif test "$STATUS" == "FAILED"; then
        # exit if failed
        echo -e "\nThe execution failed, you can check the execution logs with the following script:\naws stepfunctions get-execution-history --execution-arn $EXECUTION_ARN"
        break
    else
        # print execution output if succeeded
        echo $STATUS
        echo "Execution output: "
        # retrieve output
        aws stepfunctions describe-execution --execution-arn $EXECUTION_ARN --query 'output' --output text > power-tuning-output-$BUILD_UUID.json
        break
    fi
done
# get output URL and comment on pull request
POWER_TUNING_OUTPUT_URL=$(cat power-tuning-output-$BUILD_UUID.json | jq -r '.stateMachine .visualization')
aws codecommit post-comment-for-pull-request --repository-name $REPOSITORY_NAME --pull-request-id $PULL_REQUEST_ID --content "Lambda tuning is complete. See the [results for full details]($POWER_TUNING_OUTPUT_URL)." --before-commit-id $SOURCE_COMMIT --after-commit-id $DESTINATION_COMMIT

Lastly, note that there is an AWS Lambda Power Tuning input file included in the CodeCommit repository that can be modified as well. The "lambdaARN" property is excluded because it will be dynamically added by the build for the transient environment. For more details on the input and output configurations, see the documentation on GitHub.

lambda_power_tuned/lambda_power_tuned/terraform/power-tuning-input.json
{
	"powerValues": [
		128,
		256,
		512,
		1024
	],
	"num": 50,
	"payload": {},
	"parallelInvocation": true,
	"strategy": "cost"
}

Cleanup

If you deployed resources via the deployment pipeline, be sure to either use the DestroyTerraform CodeBuild project or run:

# set the bucket name variable or replace with a value
# the bucket name nomenclature is 'terraform-state-' followed by a UUID
# this can also be found via the Console
terraform init -backend-config="bucket=$TERRAFORM_STATE_S3_BUCKET_NAME"
terraform destroy

To destroy the pipeline itself run:

cdk destroy

If you spun up a Cloud9 environment, be sure to delete that as well.

Disclaimer

At the time of writing this blog post, I currently work for Amazon Web Services. The opinions and views expressed here are my own and not the views of my employer.

AWS re:Invent 2022

November 29, 2022 · 10 min read

Scottie Enriquez

Human in Los Angeles, CA

Overview

I learn best by doing, so with every release cycle, I take the time to build fully functional examples and digest the blog posts and video content. Below are some of my favorite releases from re:Invent 2022. You can find all source code in this GitHub repository.

Compute Optimizer Third-Party Metrics

Compute Optimizer is a powerful and free offering from AWS that analyzes resource usage and provides recommendations. Most commonly, it produces rightsizing and termination opportunities for EC2 instances. However, in my experience, the most significant limitation for customers is that Compute Optimizer does not factor memory or disk utilization into findings by default. As a result, AWS customers that use CloudWatch metrics have their findings enhanced, but other customers who use third-party alternatives to capture memory and disk utilization did not. AWS announced third-party metric support for Compute Optimizer, including Datadog.

To test this new feature, we need a few things:

Compute Optimizer enabled for the proper AWS account(s)
Datadog AWS integration enabled
An EC2 instance (i.e., candidate for rightsizing) with the Datadog agent installed

First, opt in to Compute Optimizer in your AWS account. Next, enable AWS integration in your Datadog account. This can be done in an automated fashion via a CloudFormation stack. It's also worth noting that Datadog offers a 14-day free trial.

Back in the AWS Console for Compute Optimizer, select Datadog as an external metrics ingestion source.

Lastly, we need to deploy an EC2 instance. The following CDK stack creates a VPC, EC2 instance (t3.medium; be aware of charges) with the Datadog agent installed, security group, and an IAM role. Before deploying the stack, be sure to set DD_API_KEY and DD_SITE environment variables. The EC2 instance, role, and security group are also configured for Instance Connect.

ec2-instance-with-datadog/lib/ec2-instance-with-datadog-stack.ts
export class Ec2InstanceWithDatadogStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // networking
    const vpc = new ec2.Vpc(this, 'VPC', {
      ipAddresses: ec2.IpAddresses.cidr('10.0.0.0/16'),
      natGateways: 0
    });
    const selection = vpc.selectSubnets({
      // using public subnets as to not incur NAT Gateway charges
      subnetType: ec2.SubnetType.PUBLIC
    });
    const datadogInstanceSecurityGroup = new ec2.SecurityGroup(this, 'datadog-instance-sg', {
      vpc: vpc,
      allowAllOutbound: true,
    });
    // IP range for EC2 Instance Connect
    datadogInstanceSecurityGroup.addIngressRule(ec2.Peer.ipv4('18.206.107.24/29'), ec2.Port.tcp(22), 'allow SSH access for EC2 Instance Connect');

    // IAM
    const datadogInstanceRole = new iam.Role(this, 'datadog-instance-role', {
      assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('EC2InstanceConnect'),
      ],
    });

    // EC2 instance
    const userData = ec2.UserData.forLinux();
    userData.addCommands(
      'sudo yum install ec2-instance-connect',
      // set these environment variables with your Datadog API key and site
      `DD_API_KEY=${process.env.DD_API_KEY} DD_SITE="${process.env.DD_SITE}" bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"`,
    );
    const ec2Instance = new ec2.Instance(this, 'ec2-instance', {
      vpc: vpc,
      vpcSubnets: {
        subnetType: ec2.SubnetType.PUBLIC,
      },
      role: datadogInstanceRole,
      securityGroup: datadogInstanceSecurityGroup,
      // note: this will incur a charge
      instanceType: ec2.InstanceType.of(
        ec2.InstanceClass.T3,
        ec2.InstanceSize.MEDIUM,
      ),
      machineImage: new ec2.AmazonLinuxImage({
        generation: ec2.AmazonLinuxGeneration.AMAZON_LINUX_2,
      }),
      userData: userData
    });
  }
}

Once successfully deployed, metrics for the EC2 instance will appear in your Datadog account.

Finally, wait up to 30 hours for a finding to appear in Compute Optimizer with the proper third-party APM metrics.

AWS Lambda SnapStart

Cold starts are one of the most common drawbacks of serverless adoption. Specific runtimes, such as Java, are more affected by this, especially in conjunction with frameworks like Spring Boot. SnapStart aims to address this:

After you enable Lambda SnapStart for a particular Lambda function, publishing a new version of the function will trigger an optimization process. The process launches your function and runs it through the entire Init phase. Then it takes an immutable, encrypted snapshot of the memory and disk state, and caches it for reuse. When the function is subsequently invoked, the state is retrieved from the cache in chunks on an as-needed basis and used to populate the execution environment. This optimization makes invocation time faster and more predictable, since creating a fresh execution environment no longer requires a dedicated Init phase.

For now, SnapStart only supports the Java runtime.

With the release came support via CloudFormation and CDK. However, at the time of writing, CDK only supports SnapStart via the L1 construct: CfnFunction. The L2 Function class does not yet have support, so this may be a temporary blocker for CDK projects. Using CDK, I wrote a simple stack to test a trivial function:

java11-snapstart-lambda/lib/java11-snapstart-lambda-stack.ts
export class Java11SnapstartLambdaStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);
    // artifact bucket and ZIP deployment
    const artifactBucket = new s3.Bucket(this, 'ArtifactBucket');
    const artifactDeployment = new s3Deployment.BucketDeployment(this, 'DeployFiles', {
      sources: [s3Deployment.Source.asset('./artifact')],
      destinationBucket: artifactBucket,
    });

    // IAM role
    const lambdaExecutionRole = new iam.Role(this, 'LambdaExecutionRole', {
      assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com'),
    });
    lambdaExecutionRole.addManagedPolicy(iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSLambdaBasicExecutionRole'));
    
    // Lambda functions
    const withSnapStart = new lambda.CfnFunction(this, 'WithSnapStart', {
      code: {
        s3Bucket: artifactDeployment.deployedBucket.bucketName,
        s3Key: 'corretto-test.zip'
      },
      functionName: 'withSnapStart',
      handler: 'example.Hello::handleRequest',
      role: lambdaExecutionRole.roleArn,
      runtime: 'java11',
      snapStart: { applyOn: 'PublishedVersions' }
    });
    const withoutSnapStart = new lambda.CfnFunction(this, 'WithoutSnapStart', {
      code: {
        s3Bucket: artifactDeployment.deployedBucket.bucketName,
        s3Key: 'corretto-test.zip'
      },
      functionName: 'withoutSnapStart',
      handler: 'example.Hello::handleRequest',
      role: lambdaExecutionRole.roleArn,
      runtime: 'java11'
    });
  }
}

In Jeff Barr's post, he used a Spring Boot function and achieved significant performance benefits. Next, I wanted to see if there were any benefits to a barebones Java 11 function, given that there is no additional charge for SnapStart. With a few tests, I reproduced a slight decrease in total duration.

Cold start without SnapStart (577.84 milliseconds):

Cold start with SnapStart (537.94 milliseconds):

A few cold start tests are hardly conclusive, but I'm excited to see how AWS customers' performance and costs fare at scale. One thing to note is that in both my testing and the Jeff Barr example, the billed duration increased with SnapStart while the total duration decreased (i.e., this may be faster but come with an indirect cost).

AWS CodeCatalyst

I started my career as a .NET developer writing C#. My first experience with professional software development involved using Team Foundation Server. Even as a consultant focused on AWS about a year ago, many customers I worked for primarily used Azure DevOps to manage code, CI/CD pipelines, etc. While it may seem strange to use a Microsoft tool for AWS, the developer experience felt more unified than AWS CodeSuite in my opinion. CodeCommit, CodeBuild, and CodePipeline feel like entirely separate services within the AWS Console. While they are easily integrated via automation like CloudFormation or CDK, navigating between the services in the UI often takes several clicks.

Enter CodeCatalyst. In addition to the release blog post, there is an excellent AWS Developers podcast episode outlining the vision for the product. I'm paraphrasing, but these are the four high-level problems that CodeCatalsyt aims to solve in addition to the feedback above:

Setting up the project itself
Setting up CI/CD
Setting up infrastructure and environments
Onboarding new developers

CodeCatalyst does not live in the AWS Console. It's a separate offering that integrates via Builder ID authentication. While CodeCatalyst can be used to create resources that reside in an account (i.e., via infrastructure as code), the underlying repositories, pipelines, etc. that power the developer experience are not exposed to the user. In addition to this, the team recognized that many customers have at least some of the tooling in place that CodeCatalyst provides. As such, it supports third-party integration for various components (e.g., Jira for issues, GitHub for a repository, GitHub Actions for CI/CD, etc.).

One of the most compelling features of CodeCatalyst is blueprints. Blueprints aim to provide fully functional starter kits encapsulating useful defaults and best practices. For example, I chose the .NET serverless blueprint that provisioned a Lambda function's source code and IaC in a Git repository with a CI/CD pipeline.

AWS Application Composer Preview

Application Composer is a new service from AWS that allows developers to map out select resources using a GUI with the feel of an architecture diagram. These resources can be connected to one another (e.g., an EventBridge Schedule to trigger Lambda). A subset of attributes can also be modified, such as the Lambda runtime.

While the creation process is UI-driven, the output is a SAM template (i.e., a CloudFormation template with a Transform statement). For example, the diagram above creates the following:

Transform: AWS::Serverless-2016-10-31
Resources:
  Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub ${AWS::StackName}-bucket-${AWS::AccountId}
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: aws:kms
              KMSMasterKeyID: alias/aws/s3
      PublicAccessBlockConfiguration:
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
  BucketBucketPolicy:
    Type: AWS::S3::BucketPolicy
    Properties:
      Bucket: !Ref Bucket
      PolicyDocument:
        Id: RequireEncryptionInTransit
        Version: '2012-10-17'
        Statement:
          - Principal: '*'
            Action: '*'
            Effect: Deny
            Resource:
              - !GetAtt Bucket.Arn
              - !Sub ${Bucket.Arn}/*
            Condition:
              Bool:
                aws:SecureTransport: 'false'
  S3Function:
    Type: AWS::Serverless::Function
    Properties:
      Description: !Sub
        - Stack ${AWS::StackName} Function ${ResourceName}
        - ResourceName: S3Function
      CodeUri: src/Function
      Handler: index.handler
      Runtime: nodejs18.x
      MemorySize: 3008
      Timeout: 30
      Tracing: Active
      Events:
        Bucket:
          Type: S3
          Properties:
            Bucket: !Ref Bucket
            Events:
              - s3:ObjectCreated:*
              - s3:ObjectRemoved:*
  S3FunctionLogGroup:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Retain
    Properties:
      LogGroupName: !Sub /aws/lambda/${S3Function}

This service has the potential to offer the best of both worlds: an easy-to-use GUI and a deployable artifact. There's a clear focus on serverless design for now, but I'd like to see if this expands to other areas (e.g., VPC design). It's also worth noting that Application Composer utilizes the browser's file API for Google Chrome and Microsoft Edge to save the latest template changes locally. I'd love to see CDK L2 construct support here in addition to CloudFormation also.

Amazon RDS Managed Blue/Green Deployments

When updating databases, using a blue/green deployment technique is an appealing option for users to minimize risk and downtime. This method of making database updates requires two database environments: your current production environment, or blue environment, and a staging environment, or green environment.

I find this release particularly valuable, given that many AWS customers are trying to maximize their use of Graviton for managed services, including RDS. Graviton processors are designed by AWS and achieve significant price-performance improvements. They also offer savings versus Intel chips. Typically, the adoption of Graviton for EC2 is a high-lift engineering activity since code and dependencies must support ARM. However, with managed services, AWS handles software dependency management. This makes RDS an excellent candidate for Graviton savings. Due to the stateful nature of databases, changes introduce additional risks. Blue/Green Deployments mitigate much of this risk by having two fully functional environments coexisting.

To test this feature, I provisioned a MySQL RDS instance with an older version on an Intel instance with a previous-generation general-purpose SSD. A Blue/Green Deployment can then be created via the Console and CLI, which spawns a second instance. I then modified the Green instance to use gp3 storage, a Graviton instance type (db.t4g.medium), and the latest version of MySQL.

Once the Green instance modifications were finished, I then switched over the instances.

Amazon CodeWhisperer Support for C# and TypeScript

CodeWhisperer, Amazon's response to GitHub Copilot, is described as an ML-powered coding companion. I had yet to test the preview, but this release is relevant to me, given I write mostly TypeScript and C# these days. Moreover, TypeScript is particularly interesting to the cloud community, given that it is the de facto standard for CDK as the first language supported. CodeWhisperer is available as part of the AWS Toolkit for Visual Studio Code and the JetBrains suite, but I opted to give it a test run in Cloud9, AWS's cloud-based IDE.

CodeWhisperer is proficient at generating code against the AWS SDK, such as functions to stop an EC2 instance or fetch objects from an S3 bucket. With regards to CDK, it generated simple constructs sufficiently for me. However, CodeWhisperer tended to generate recommendations line-by-line instead of in large blocks for larger and more complex constructs. In addition, the recommendations seemed to be context-aware (i.e., recommending valid properties and methods based on class definitions). These two use cases alone provide a great deal of opportunity since most of the time I spend writing code with AWS SDK and CDK tends to be spent reading documentation.

The Nature of Code Companion Series: Introduction Chapter

July 24, 2022 · 4 min read

Scottie Enriquez

Human in Los Angeles, CA

About the Book

Recently, I started reading a fantastic book called The Nature of Code by Daniel Shiffman. From the description:

How can we capture the unpredictable evolutionary and emergent properties of nature in software? How can understanding the mathematical principles behind our physical world help us to create digital worlds? This book focuses on a range of programming strategies and techniques behind computer simulations of natural systems, from elementary concepts in mathematics and physics to more advanced algorithms that enable sophisticated visual results. Readers will progress from building a basic physics engine to creating intelligent moving objects and complex systems, setting the foundation for further experiments in generative design.

Random Walk

A random walk traces a path through a Cartesian plane going in a random direction with each step (i.e., one pixel). The walks are built by plotting individual pixels as rectangles in Scalable Vector Graphics (SVGs). The program starts at (200, 400) for each walk to represent the center of the Cartesian plane. The walk function chooses a random direction and updates the internal state to indicate that a step has been taken.

walk(pixels) {
    const step = Math.floor(Math.random() * 4);
    switch (step) {
        case 0:
            this.coordinates.x++;
            break;
        case 1:
            this.coordinates.x--;
            break;
        case 2:
            this.coordinates.y++;
            break;
        default:
            this.coordinates.y--;
            break;
    }
    pixels.push({
        x: this.coordinates.x,
        y: this.coordinates.y
    });
}

The walkWeightedRight function illustrates the same functionality but with a non-uniform distribution. In this code, there's a 70% chance of moving to the right.

walkWeightedRight(pixels) {
    const step = Math.floor(Math.random() * 10);
    if (step <= 6) {
        this.coordinates.x++;
    }
    else if (step === 7) {
        this.coordinates.x--;
    }
    else if (step === 8) {
        this.coordinates.y++;
    }
    else {
        this.coordinates.y--;
    }
    pixels.push({
        x: this.coordinates.x,
        y: this.coordinates.y
    });
}

The randomWalk function calls the walk or walkWeightedRight function until an edge is hit. The SVG is then rendered based on the pixels stored in memory representing the path.

randomWalk(weightedRight) {
    const pixels = [];
    this.steps.current = 0;
    while (this.steps.current <= this.steps.max &&
        this.coordinates.x < width - 1 && this.coordinates.x > 0
        && this.coordinates.y < height - 1
        && this.coordinates.y > 0)
    {
        if (weightedRight) {
            this.walkWeightedRight(pixels);
        }
        else {
            this.walk(pixels);
        }
        this.steps.current++;
    }
    return pixels;
}

The random walks are capped at 10,000 pixels for performance reasons.

Random Numbers with Normal Distribution

This example plots random numbers generated with a normal distribution (i.e., no specific weights).

generateRandomData() {
    const datasetSize = 100;
    const maxValue = 100;
    const data = [];
    for(let index = 0; index < datasetSize; index++) {
        data[index] = {
            index: index,
            value: Math.floor(Math.random() * maxValue)
        }
    }
    return data;
}

Bell Curve (Frequency Distribution)

This example shows how to create a bell curve for one thousand monkeys ranging in height from 200 to 300 pixels with a normal distribution. First, the code generates the data.

generateHeightData() {
    const data = [];
    const datasetSize = 1000;
    const baseHeight = 200;
    const maxRandomValue = 100;
    for(let index = 0; index < datasetSize; index++) {
        data[index] = {
            index: index,
            // generate a height between 200 and 300
            value: baseHeight + (Math.floor(Math.random() * maxRandomValue))
        }
    }
    return data.sort((current, next) => { return current.value - next.value });
}

Next, the code computes the standard deviation.

computeMean(array) {
    let sum = 0;
    for(let index = 0; index < array.length; index++) {
        sum += array[index].value;
    }
    return sum / array.length;
}

computeStandardDeviation(data, mean) {
    let sumSquareDeviation = 0;
    for(let index = 0; index < data.length; index++) {
        sumSquareDeviation += Math.pow(data[index].value - mean, 2);
    }
    return Math.sqrt(sumSquareDeviation / data.length);
}

Lastly, the code groups each monkey by standard deviations for the x-axis and plots the frequency counts for the y-axis.

generateHeightBellCurve() {
    const data = this.generateHeightData();
    const meanHeight = this.computeMean(data);
    const standardDeviationHeight = this.computeStandardDeviation(data, meanHeight);
    const bellCurveData = {};
    for(let index = 0; index < data.length; index++) {
        data[index].standardDeviations = Math.round((data[index].value - meanHeight) / standardDeviationHeight);
        if(!bellCurveData[data[index].standardDeviations]) {
            bellCurveData[data[index].standardDeviations] = {
                standardDeviations: data[index].standardDeviations,
                count: 1
            }
        }
        else {
            bellCurveData[data[index].standardDeviations].count++;
        }
    }
    return Object.keys(bellCurveData).map(key => bellCurveData[key]).sort((one, other) => { return one.standardDeviations - other.standardDeviations });
}

Next Section

Chapter one explores Euclidean vectors and the basics of motion.

Writing Cost-Conscious Terraform Using Infracost and AWS Developer Tools

July 16, 2022 · 7 min read

Scottie Enriquez

Human in Los Angeles, CA

Solution Overview

My current role focuses on every facet of AWS cost optimization. Much of this entails helping to remediate existing infrastructure and usage. Many customers ask how they can shift left on cloud costs, like they do with security. Ultimately, cost consciousness needs to be injected into every aspect of the engineering lifecycle: from the initial architecture design to implementation and upkeep.

One such aspect is providing developers visibility into the impact of their code changes. Infrastructure as code has made it easy to deploy cloud resources faster and at larger scale than ever before, but this means that cloud bills can also scale up quickly in parallel. This solution demonstrates how to integrate Infracost into a deployment pipeline to bring cost impact to the pull request process and code review discussion. The source code is hosted on GitHub.

Solution Architecture

Diagram

This solution deploys several resources:

A CodeCommit repository pre-loaded with Terraform code for a VPC, EC2 instance, S3 bucket, and Lambda function to serve as some example infrastructure costs to monitor
A CodeBuild project triggered by pull request state changes that analyzes cost changes relative to the main branch
A CodePipeline with manual approvals to deploy the Terraform for changes pushed to the main branch
An SNS topic to notify developers of cost changes
An S3 bucket to store Terraform state remotely
An S3 bucket to store CodePipeline artifacts

Preparing Your Development Environment

While this solution is for writing, deploying, and analyzing Terraform HCL syntax, I wrote the infrastructure code for the deployment pipeline and dependent resources using AWS CDK, which is my daily driver for infrastructure as code. Of course, the source code could be rewritten using Terraform or CDK for Terraform, but I used CDK for the sake of a quick prototype that only creates AWS resources (i.e., no need for additional providers). In addition, Infracost currently only supports Terraform, but there are plans for CloudFormation and CDK in the future.

The following dependencies are required to deploy the pipeline infrastructure:

An AWS account
Node.js
Terraform
AWS CDK
An Infracost API key
Source code

Resources:
  rCloud9Environment:
    Type: AWS::Cloud9::EnvironmentEC2
    Properties:
      AutomaticStopTimeMinutes: 30
      ConnectionType: CONNECT_SSH 
      Description: Environment for writing and deploying CDK 
      # AWS Free Tier eligible
      InstanceType: t2.micro	
      Name: InfracostCDKPipelineCloud9Environment
      # https://docs.aws.amazon.com/cloud9/latest/user-guide/vpc-settings.html#vpc-settings-create-subnet
      SubnetId: subnet-EXAMPLE 

Installation, Deployment, and Configuration

Before deploying the CDK application, store the Infracost API key in an SSM parameter SecureString called /terraform/infracost/api_key.

To install and deploy the pipeline, use the following commands:

git clone https://github.com/scottenriquez/infracost-cdk-pipeline.git
cd infracost-cdk-pipeline/infracost-cdk-pipeline/
npm install
# https://docs.aws.amazon.com/cdk/v2/guide/bootstrapping.html
cdk bootstrap
cdk deploy

Before testing the pipeline, subscribe to the SNS topic via the Console. For testing purposes, use email to get the cost change data delivered.

Using the Deployment Pipeline

The CodePipeline resource is triggered at creation, but there are manual approval stages to prevent any infrastructure from being created without intervention. Feel free to deploy the Terraform, but it is not required for generating cost differences via a pull request. The CodePipeline is triggered by changes to main.

Approval

Make some code changes to see the cost impact. To modify the Terraform code, either use the CodeCommit GUI in the Console or clone the repository to your development environment. First, create a branch called feature off of main. Then modify ec2.tf to use a different instance type:

infracost-cdk-pipeline/lib/terraform/ec2.tf
resource "aws_instance" "server" {
  # Amazon Linux 2 Kernel 5.10 AMI 2.0.20220606.1 x86_64 HVM in us-east-1
  # if deploying outside of us-east-1, you must use the corresponding AL2 AMI for your region
  ami           = "ami-0cff7528ff583bf9a"
  # changed from t3.micro
  instance_type = "m5.large"
  subnet_id     = module.vpc.private_subnets[0]

  root_block_device {
    volume_type = "gp3"
    volume_size = 50
  }
}

Infracost also supports usage estimates in addition to resource costs. For example, changing the storage GBs for the S3 bucket in infracost-usage.yml will also update the cost comparison and estimate. These values are hardcoded and version-controlled here, but Infracost is also experimenting with fetching actual usage data via CloudWatch.

infracost-cdk-pipeline/lib/terraform/infracost-usage.yml
version: 0.1
resource_usage:
  aws_lambda_function.function:
    monthly_requests: 10000 
    request_duration_ms: 250
  aws_s3_bucket.bucket:
    standard:
      # changed from 10000
      storage_gb: 15000
      monthly_tier_1_requests: 1000 

Commit these changes to the feature branch and open a pull request. Doing so will trigger the CodeBuild project that computes the cost delta and publishes the payload to the SNS topic if the amount increases. Assuming you subscribed to the SNS topic via email, some JSON should be in your inbox. Here's an abridged example output:

{
	"version": "0.2",
	"currency": "USD",
	"projects": [{
		"name": "codecommit::us-east-1://TerraformRepository/.",
		"metadata": {
			"path": "/tmp/main",
			"infracostCommand": "breakdown",
			"type": "terraform_dir",
			"branch": "main",
			"commit": "2e6eafd94811a0c9ac814a8c31132dc3badc0b9f",
			"commitAuthorName": "AWS CodeCommit",
			"commitAuthorEmail": "noreply-awscodecommit@amazon.com",
			"commitTimestamp": "2022-07-16T05:47:50Z",
			"commitMessage": "Initial commit by AWS CodeCommit",
			"vcsRepoUrl": "codecommit::us-east-1://TerraformRepository",
			"vcsSubPath": "."
		}
	}],
	"totalHourlyCost": "0.41661461198630137000733251",
	"totalMonthlyCost": "304.12866675",
	"pastTotalHourlyCost": "0.33101461198630137000733251",
	"pastTotalMonthlyCost": "241.64066675",
	"diffTotalHourlyCost": "0.0856",
	"diffTotalMonthlyCost": "62.488",
	"timeGenerated": "2022-07-16T06:21:02.155239211Z",
	"summary": {
		"totalDetectedResources": 3,
		"totalSupportedResources": 3,
		"totalUnsupportedResources": 0,
		"totalUsageBasedResources": 3,
		"totalNoPriceResources": 0,
		"unsupportedResourceCounts": {},
		"noPriceResourceCounts": {}
	}
}

Diving Into the Pull Request Build Logic

The TypeScript for describing the deployment pipeline lives in infracost-cdk-pipeline-stack.ts. The following code snippet (with comments explaining the install and build phases) contains the core logic for integrating Infracost into the pull request:

infracost-cdk-pipeline/lib/infracost-cdk-pipeline-stack.ts
const pullRequestCodeBuildProject = new codebuild.Project(this, 'TerraformPullRequestCodeBuildProject', {
    buildSpec: codebuild.BuildSpec.fromObject({
        version: '0.2',
        phases: {
            install: {
                commands: [
                    // checkout the feature branch
                    'git checkout $CODEBUILD_SOURCE_VERSION',
                    'sudo yum -y install unzip python3-pip jq',
                    'sudo pip3 install git-remote-codecommit',
                    `wget https://releases.hashicorp.com/terraform/${terraformVersion}/terraform_${terraformVersion}_linux_amd64.zip`,
                    `unzip terraform_${terraformVersion}_linux_amd64.zip`,
                    'sudo mv terraform /usr/local/bin/',
                    'curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh',
                    // clone the main branch
                    `git clone ${terraformRepository.repositoryCloneUrlGrc} --branch=${mainBranchName} --single-branch /tmp/main`,
                    // generate Infracost baseline file for main
                    'infracost breakdown --path /tmp/main --usage-file infracost-usage.yml --format json --out-file infracost-main.json'
                ]
            },
            build: {
                commands: [
                    // initialize Terraform with remote state
                    `terraform init -backend-config="bucket=${terraformStateBucket.bucketName}"`,
                    'terraform plan',
                    // compute diff based on baseline created from main
                    'infracost diff --path . --compare-to infracost-main.json --usage-file infracost-usage.yml --format json --out-file infracost-pull-request.json',
                    // parse JSON to get total monthly difference
                    `DIFF_TOTAL_MONTHLY_COST=$(jq '.diffTotalMonthlyCost | tonumber | floor' infracost-pull-request.json)`,
                    // if there's a cost increase, publish the diff to the SNS topic
                    `if [[ $DIFF_TOTAL_MONTHLY_COST -gt 0 ]]; then aws sns publish --topic-arn ${terraformCostTopic.topicArn} --message file://infracost-pull-request.json; fi`
                ]
            }
        }
    })
});

More advanced notification logic, such as using the percentage increase for an alert threshold, could be implemented to minimize noise for developers. Additionally, offloading the logic to a Lambda function and invoking it via the CLI or SNS would allow for more robust and testable logic than a simple shell script. Alternatively, the cost delta could be added as a comment on the source pull request. Choose the option that makes the most sense for your code review process.

Conclusion

Technology alone will not resolve all cost optimization challenges. However, integrating cost analysis into code reviews is integral to shaping a cost-conscious culture. It is much better to find and address cost spikes before infrastructure is deployed. Seeing a large cost increase from infracost diff is scary, but seeing it in Cost Explorer later is far scarier.

Cleanup

If you deployed resources via the deployment pipeline, be sure to either use the DestroyTerraform CodeBuild project or run:

# set the bucket name variable or replace with a value
# the bucket name nomenclature is 'terraform-state-' followed by a UUID
# this can also be found via the Console
terraform init -backend-config="bucket=$TERRAFORM_STATE_S3_BUCKET_NAME"
terraform destroy

To destroy the pipeline itself run:

cdk destroy

If you spun up a Cloud9 environment, be sure to delete that as well.

Disclaimer

At the time of writing this blog post, I currently work for Amazon Web Services. The opinions and views expressed here are my own and not the views of my employer.

A CDK Companion for Rahul Nath's .NET Lambda Course

June 17, 2022 · 6 min read

Scottie Enriquez

Human in Los Angeles, CA

The Course and Companion

Rahul Nath recently released a course called AWS Lambda for the .NET Developer on Udemy and Gumroad. I had a ton of fun going through the exercises and highly recommend purchasing a copy. While working through the material, I implemented the solutions with infrastructure as code using AWS CDK in C# and .NET 6. I also containerized most of the Lambda functions and wrote unit tests for both the functions and infrastructure. You can find all my source code on GitHub.

Technology Decisions and Benefits

While infrastructure as code (IaC) has existed within the AWS ecosystem for over a decade, adoption has exploded in recent years due to the ability to manage large amounts of infrastructure at scale and standardize design across an organization. There are many options including CloudFormation (CFN), CDK, and Terraform for IaC and Serverless Application Model (SAM) and Serverless Framework for development. This article from A Cloud Guru quickly sums up the pros and cons of each IaC option. I choose this particular stack for some key reasons:

Docker ensures that the Lambda functions run consistently across local development, builds, and production environments and simplifies dependency management
CDK allows the infrastructure to be described as C# instead of YAML, JSON, or HCL
CDK provides the ability to inject more robust logic than intrinsic functions in CloudFormation and more modularity as well while still being an AWS-supported offering
CDK supports unit testing

Elaborating on the final point, here is an example unit test for ensuring that a DynamoDB table is destroyed when the stack is. The default behavior is for the table to be retained, leading to clutter and cost since this is a non-production project. This is an example how of IaC can be meaningfully tested:

[Fact]
public void Stack_DynamoDb_ShouldHaveDeletionPolicyDelete()
{
    // arrange
    App app = new App();
    LambdaWithApiGatewayStack stack = new LambdaWithApiGatewayStack(app, "LambdaWithApiGatewayStack");

    // act
    Template template = Template.FromStack(stack);

    // assert
    template.HasResource("AWS::DynamoDB::Table", new Dictionary<string, object>()
    {
        {"DeletionPolicy", "Delete"}
    });
}

Dependencies

To build and run this codebase, the following dependencies must be installed:

.NET 6
Node.js
Docker
AWS CDK
Credentials configured in ~/.aws/credentials(easily done with the AWS CLI)

My Development Environment and CPU Architecture Considerations

I developed all the code on my M1 MacBook Pro using JetBrains Rider. Because of my machine's ARM processor, it's key to note that all of my Dockerfiles use ARM images (e.g., public.ecr.aws/lambda/dotnet:6-arm64) and are deployed to Graviton2 Lambda environments. I suspect that most folks reading this are using x86 Windows machines, so here is a modified Dockerfile illustrating the requisite changes:

LambdaWithAPIGateway/src/LambdaWithApiGateway.DockerFunction/src/LambdaWithApiGateway.DockerFunction/Dockerfile
# ARM
# FROM public.ecr.aws/lambda/dotnet:6-arm64 AS base
# x86
FROM public.ecr.aws/lambda/dotnet:6 AS base

# ARM
# FROM mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-amd64 as build
# x86
FROM mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim as build
WORKDIR /src
COPY ["LambdaWithApiGateway.DockerFunction.csproj", "LambdaWithApiGateway.DockerFunction/"]
RUN dotnet restore "LambdaWithApiGateway.DockerFunction/LambdaWithApiGateway.DockerFunction.csproj"

WORKDIR "/src/LambdaWithApiGateway.DockerFunction"
COPY . .
RUN dotnet build "LambdaWithApiGateway.DockerFunction.csproj" --configuration Release --output /app/build

FROM build AS publish
RUN dotnet publish "LambdaWithApiGateway.DockerFunction.csproj" \
            --configuration Release \ 
            # ARM
            # --runtime linux-arm64
            # x86
            --runtime linux-x64 \
            --self-contained false \ 
            --output /app/publish \
            -p:PublishReadyToRun=true  

FROM base AS final
WORKDIR /var/task
COPY --from=publish /app/publish .
CMD ["LambdaWithApiGateway.DockerFunction::LambdaWithApiGateway.DockerFunction.Function::FunctionHandler"]

The CDK code for the Lambda function also requires a slight change:

LambdaWithAPIGateway/src/LambdaWithApiGateway/LambdaWithApiGatewayStack.cs
DockerImageFunction sqsDockerImageFunction = new DockerImageFunction(this, "LambdaFunction",
    new DockerImageFunctionProps()
    {
        // ARM
        // Architecture = Architecture.ARM_64,
        // x86
        Architecture = Architecture.X86_64,
        Code = sqsDockerImageCode,
        Description = ".NET 6 Docker Lambda function for polling SQS",
        Role = sqsDockerFunctionExecutionRole,
        Timeout = Duration.Seconds(30) 
    }
);

Using Cloud9

AWS offers a browser-based IDE called Cloud9 that has nearly all required dependencies installed. The IDE can be provisioned from the AWS Console or via infrastructure as code. Unfortunately, Cloud9 does not support Graviton-based instances yet. Below is a CloudFormation template for provisioning an environment with the source code pre-loaded:

Resources:
  rCloud9Environment:
    Type: AWS::Cloud9::EnvironmentEC2
    Properties:
      AutomaticStopTimeMinutes: 30
      ConnectionType: CONNECT_SSM
      Description: Web-based cloud development environment
      InstanceType: m5.large	
      Name: Cloud9Environment
      Repositories: 
        - PathComponent: /repos/rahul-nath-dotnet-lambda-course-cdk-companion
          RepositoryUrl: https://github.com/scottenriquez/rahul-nath-dotnet-lambda-course-cdk-companion.git

Note that the instance must be deployed to a public subnet. The Cloud9 AMI does not have .NET 6 pre-installed. To do so, run the following commands:

sudo rpm -Uvh https://packages.microsoft.com/config/centos/7/packages-microsoft-prod.rpm
sudo yum install dotnet-sdk-6.0

Code Structure

Each section of the course has a separate solution in the repository:

FirstLambda is a simple ZIP Lambda function that returns the uppercase version of a string
LambdaWithDynamoDb is a simple Lambda function that queries a DynamoDB table
LambdaWithApiGateway is a full CRUD app using DynamoDB for storage
LambdaTriggers are event-driven Lambda functions triggered by SNS and SQS

Each solution is structured in the same way. I generated the CDK app using the CLI and used the Lambda templates to create my functions like so:

# create the CDK application
# the name is derived from the directory
# this snippet assumes the directory is called Lambda
cdk init app --language csharp
# install the latest version of the .NET Lambda templates
dotnet new -i Amazon.Lambda.Templates
cd src/
# create the function
dotnet new lambda.image.EmptyFunction --name Lambda.DockerFunction
# add the projects to the solution file
dotnet sln add Lambda.DockerFunction/src/Lambda.DockerFunction/Lambda.DockerFunction.csproj
dotnet sln add Lambda.DockerFunction/test/Lambda.DockerFunction.Tests/Lambda.DockerFunction.Tests.csproj
# build the solution and run the sample unit test to verify that everything is wired up correctly
dotnet test Lambda.sln

Each Lambda function has projects for the handler code and unit tests. All CDK code for infrastructure resides in the corresponding *Stack.cs file. Here is some example IaC for a Lambda function triggered by SQS:

LambdaTriggers/src/LambdaTriggers/LambdaTriggersStack.cs
public class LambdaTriggersStack : Stack
{
    public LambdaTriggersStack(Construct scope, string id, IStackProps props = null) : base(scope, id, props)
    {
        Queue queue = new Queue(this, "Queue");
        Role sqsDockerFunctionExecutionRole = new Role(this, "SqsDockerFunctionExecutionRole", new RoleProps {
            AssumedBy = new ServicePrincipal("lambda.amazonaws.com"),
            ManagedPolicies = new IManagedPolicy[]
            {
                new ManagedPolicy(this, "ManagedPolicy", new ManagedPolicyProps()
                {
                    Document = new PolicyDocument(new PolicyDocumentProps()
                    {
                        Statements = new []
                        {
                            new PolicyStatement(new PolicyStatementProps()
                            {
                                Actions = new [] { "sqs:*" },
                                Resources = new [] { queue.QueueArn }
                            }),
                            new PolicyStatement(new PolicyStatementProps
                            {
                                Actions = new []
                                {
                                    "logs:CreateLogGroup",
                                    "logs:CreateLogStream",
                                    "logs:PutLogEvents"
                                },
                                Effect = Effect.ALLOW,
                                Resources = new [] { "*" }
                            })
                        }
                    })
                })
            }
        });
        DockerImageCode sqsDockerImageCode = DockerImageCode.FromImageAsset("src/LambdaTriggers.SqsDockerFunction/src/LambdaTriggers.SqsDockerFunction");
        DockerImageFunction sqsDockerImageFunction = new DockerImageFunction(this, "LambdaFunction",
            new DockerImageFunctionProps()
            {
                Architecture = Architecture.ARM_64,
                Code = sqsDockerImageCode,
                Description = ".NET 6 Docker Lambda function for polling SQS",
                Role = sqsDockerFunctionExecutionRole,
                Timeout = Duration.Seconds(30) 
            }
        );
        SqsEventSource sqsEventSource = new SqsEventSource(queue);
        sqsDockerImageFunction.AddEventSource(sqsEventSource);
    }
}

Resource Deployment

To deploy the infrastructure, navigate to the corresponding section folder and use the CDK CLI like so:

cd LambdaTriggers
cdk deploy

Resource Cleanup

To destroy resources, run this command in the same directory:

cdk destroy

Using the New Terraform for CDK Convert Feature

August 2, 2021 · 5 min read

Scottie Enriquez

Human in Los Angeles, CA

I previously wrote a blog post about getting started with Terraform for CDK and the benefits. At that time, the latest version was 0.3. Last week, version 0.5 was released. In this version, some new experimental features could make adopting CDK for Terraform exponentially easier.

The Convert Command

The CLI command takes in a Terraform file and converts it to the language specified.

cat terraform.tf | cdktf convert --language csharp

I started with a single terraform.tf file that creates an Azure App Service.

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "=2.46.0"
    }
  }
}

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "cdktf_convert_rg" {
  name     = "cdktf-convert-resource-group"
  location = "Central US"
}

resource "azurerm_app_service_plan" "cdktf_convert_app_service_plan" {
  name                = "cdktf-convert-appserviceplan"
  location            = azurerm_resource_group.cdktf_convert_rg.location
  resource_group_name = azurerm_resource_group.cdktf_convert_rg.name
  sku {
    tier = "Free"
    size = "F1"
  }
}

resource "azurerm_app_service" "cdktf_convert_app_service" {
  name                = "cdktf-convert-app-service"
  location            = azurerm_resource_group.cdktf_convert_rg.location
  resource_group_name = azurerm_resource_group.cdktf_convert_rg.name
  app_service_plan_id = azurerm_app_service_plan.cdktf_convert_app_service_plan.id
}

The command creates a C# snippet.

using Gen.Providers.Azurerm;

new AzurermProvider(this, "azurerm", new Struct {
    Features = new [] { new Struct { } }
});

var azurermResourceGroupCdktfConvertRg = new ResourceGroup(this, "cdktf_convert_rg", new Struct {
    Location = "Central US",
    Name = "cdktf-convert-resource-group"
});

var azurermAppServicePlanCdktfConvertAppServicePlan =
new AppServicePlan(this, "cdktf_convert_app_service_plan", new Struct {
    Location = azurermResourceGroupCdktfConvertRg.Location,
    Name = "cdktf-convert-appserviceplan",
    ResourceGroupName = azurermResourceGroupCdktfConvertRg.Name,
    Sku = new [] { new Struct {
        Size = "F1",
        Tier = "Free"
    } }
});

new AppService(this, "cdktf_convert_app_service", new Struct {
    AppServicePlanId = azurermAppServicePlanCdktfConvertAppServicePlan.Id,
    Location = azurermResourceGroupCdktfConvertRg.Location,
    Name = "cdktf-convert-app-service",
    ResourceGroupName = azurermResourceGroupCdktfConvertRg.Name
});

While this alone is extremely powerful, the C# code cannot be executed until the provider objects (i.e., Gen.Providers.Azurerm from the using statement) are generated with cdktf get. I see the use case for this command being translation of individual files for migration into an existing CDK for Terraform project. The --language flag currently supports all languages that CDK does. Instead, the option to generate an entire project from a folder seems much more helpful for converting entire solutions.

Initializing from an Existing Terraform Project

Rather than converting a single file, the init command has been updated to support creating from an existing project. At the time of writing, only TypeScript is supported.

cdktf init --from-terraform-project terraform-project-folder --template typescript

I tested against a Terraform example on GitHub from Futurice that creates a scheduled Lambda function. I forked and updated the template to work with Terraform version 1.0.3. The HCL is split across multiple files (i.e., main, variables, outputs, and permissions). I also created a Lambda function via the SAM CLI and built a ZIP artifact. The updated init command was smart enough to merge all of the .tf files into a single stack. However, the command does not migrate folders and assets outside of Terraform (i.e., my Lambda code, SAM folders, etc.). For now, these will need to be copied manually. Find the full output project here.

Notes About Conversion

Interacting with the Provider

I did not specify the region in the source HCL provider like so:

provider "aws" {
  region = "us-east-1"
}

To modify the provider settings, instantiate a provider object. The convert method will translate this, but it was not apparent to me how to code this manually.

new AwsProvider(this, 'aws', {
    region: 'us-east-1'
});

Counts

At the time of writing, the count meta-argument does not work consistently yet. I've opened up an issue on GitHub accordingly. The following HCL throws an error when converting:

resource "aws_instance" "multiple_server" {
  count = 4
  ami = "ami-0c2b8ca1dad447f8a"
  instance_type = "t2.micro"
  tags = {
    Name = "Server ${count.index}"
  }
}

I'm not sure if the intent is that this will be translated into a for loop or if the count meta-argument will just be modified. In any case, this can easily be rewritten using the general-purpose language in a much cleaner way (i.e., a loop).

I've seen a common pattern in Terraform templates that uses the count attribute to create resources conditionally. In the snippet below, a Lambda function resource is created based on whether or not an S3 bucket name is specified.

resource "aws_lambda_function" "local_zipfile" {
    count = var.function_s3_bucket == "" ? 1 : 0
    filename = var.function_zipfile
}

This pattern does not convert directly because in CDK for Terraform, count is set via an escape hatch using the addOverride method. The underlying Terraform configuration will be modified, but there is not a way to access individual constructs in the list of constructs in the code. However, this is another opportunity to leverage the benefits of using a general-purpose language by using conditionals, lists, for loops, etc.

Built-In Functions

Terraform built-in functions are converted and supported by CDK for Terraform. Below is a simple example using the max() function in the instance's tag:

resource "aws_instance" "ec2_instance" {
  ami = "ami-0c2b8ca1dad447f8a"
  instance_type = "t2.micro"
  tags = {
    Name = "Server ${max(1, 2, 12)}"
  }
}

This converts to the following TypeScript:

new aws.Instance(this, "ec2_instance", {
    ami: "ami-0c2b8ca1dad447f8a", 
    instanceType: "t2.micro", 
    tags: {
        name: "Server ${max(1, 2, 12)}",
    },
});

The string containing the built-in function is preserved in the cdk.tf.json build artifact file and evaluated accordingly. As best practices form, I'm curious how often built-in functions will be used versus their corresponding equivalents in the general-purpose language. While this is useful for easily converting templates with built-in functions, I would argue that there are many benefits to rewriting this logic in TypeScript (i.e., unit testing, readability, etc.).

Goodhart's Law and Vestigial Structures​

Developer Productivity Is Not for You​

We Gave Away the Knowledge for Free (Under MIT)​

SaaS Is Not Going to Save Us​

No One Left to Pay​

A Focus on Humanity​

Disclaimer​

Motivation and Background​

Preparing for the Exam​

Materials and Getting Started​

00: eksctl Configuration​

01: First Deployment with Nginx (CKAD Topic)​

02: Pod Communication over IP (CKAD Topic)​

03: First Service (CKAD Topic)​

04: Elastic Load Balancers for Kubernetes Service (CKAD Topic)​

05: Ingress (CKAD Topic)​

06: Jobs and CronJobs (CKAD Topic)​

07: Metrics Server and Pod Autoscaling (CKAD Topic)​

08: Resource Management (CKAD Topic)​

09: Karpenter​

10: Persistent Volumes Using EBS (CKAD Topic)​

11: Prometheus and Grafana​

12: Container Insights​

13: EKS Split Cost Allocation Data in Cost and Usage Reports​

14: ConfigMap (CKAD Topic)​

15: Secrets (CKAD Topic)​

16: Multi-Container Pods (CKAD Topic)​

17: Deployment Strategies (CKAD Topic)​

18: Probes (CKAD Topic)​

19: SecurityContext (CKAD Topic)​

20: ServiceAccounts and Role-Based Access Control (CKAD Topic)​

21: NetworkPolicy (CKAD Topic)​

22: ArgoCD​

23: cdk8s​

24: OpenFaaS​

Disclaimer​

About the Book​

Previous Entries in the Blog Series​

Source Code​

Introduction to Euclidean Vectors​

Bouncing Sphere​

Bouncing Sphere with Random Acceleration​

Next Section​

About​

AWS Billing Conductor (ABC) Overview​

How Billing Conductor Handles Savings Plans (SPs) and Reserved Instances (RIs)​

Utility Logic Overview​

Architecture​

Minimum IAM Permissions Required​

Local Setup​

Creating a Virtual Environment​

Running Unit Tests​

Deployment​

Building Using AWS SAM​

Deploying Using AWS SAM​

Leveraging the Sample​

Edge Cases and Additional Considerations​

Cost​

Solution Overview​

Solution Architecture​

Preparing Your Development Environment​

Installation and Deployment​

Using the Deployment Pipeline​

Diving Into the Pull Request Build Logic​

Cleanup​

Disclaimer​

Overview​

Compute Optimizer Third-Party Metrics​

AWS Lambda SnapStart​

AWS CodeCatalyst​

AWS Application Composer Preview​

Amazon RDS Managed Blue/Green Deployments​

Amazon CodeWhisperer Support for C# and TypeScript​

About the Book​

Random Walk​

Random Numbers with Normal Distribution​

Bell Curve (Frequency Distribution)​

Next Section​

Solution Overview​

Solution Architecture​

Goodhart's Law and Vestigial Structures

Developer Productivity Is Not for You

We Gave Away the Knowledge for Free (Under MIT)

SaaS Is Not Going to Save Us

No One Left to Pay

A Focus on Humanity

Disclaimer

Motivation and Background

Preparing for the Exam

Materials and Getting Started

00: eksctl Configuration

01: First Deployment with Nginx (CKAD Topic)

02: Pod Communication over IP (CKAD Topic)

03: First Service (CKAD Topic)

04: Elastic Load Balancers for Kubernetes Service (CKAD Topic)

05: Ingress (CKAD Topic)

06: Jobs and CronJobs (CKAD Topic)

07: Metrics Server and Pod Autoscaling (CKAD Topic)

08: Resource Management (CKAD Topic)

09: Karpenter

10: Persistent Volumes Using EBS (CKAD Topic)

11: Prometheus and Grafana

12: Container Insights

13: EKS Split Cost Allocation Data in Cost and Usage Reports

14: ConfigMap (CKAD Topic)

15: Secrets (CKAD Topic)

16: Multi-Container Pods (CKAD Topic)

17: Deployment Strategies (CKAD Topic)

18: Probes (CKAD Topic)

19: SecurityContext (CKAD Topic)

20: ServiceAccounts and Role-Based Access Control (CKAD Topic)

21: NetworkPolicy (CKAD Topic)

22: ArgoCD

23: cdk8s

24: OpenFaaS

Disclaimer

About the Book

Previous Entries in the Blog Series

Source Code

Introduction to Euclidean Vectors

Bouncing Sphere

Bouncing Sphere with Random Acceleration

Next Section

About

AWS Billing Conductor (ABC) Overview

How Billing Conductor Handles Savings Plans (SPs) and Reserved Instances (RIs)

Utility Logic Overview

Architecture

Minimum IAM Permissions Required

Local Setup

Creating a Virtual Environment

Running Unit Tests

Deployment

Building Using AWS SAM

Deploying Using AWS SAM

Leveraging the Sample

Edge Cases and Additional Considerations

Cost

Solution Overview

Solution Architecture

Preparing Your Development Environment

Installation and Deployment

Using the Deployment Pipeline

Diving Into the Pull Request Build Logic

Cleanup

Disclaimer

Overview

Compute Optimizer Third-Party Metrics

AWS Lambda SnapStart

AWS CodeCatalyst

AWS Application Composer Preview

Amazon RDS Managed Blue/Green Deployments

Amazon CodeWhisperer Support for C# and TypeScript

About the Book

Random Walk

Random Numbers with Normal Distribution

Bell Curve (Frequency Distribution)

Next Section

Solution Overview

Solution Architecture