Codecov Self-Hosted

Deployment Guide and Reference Architecture

This reference architecture document and technical deployment guide outlines the best practices and recommendations to ensure a  successful launch, management and maintenance of Codecov Self-Hosted.

The intended audience of this guide are the Devops engineers, team leads and other technical team members who might be interested in the installation, administration and maintenance of the Codecov Self-Hosted. It will also offer some insights into the internal workings of the application.

Codecov Self-Hosted uses Kubernetes to ensure modern deployment architecture as well as ease of scalability and maintenance. Throughout the documentation we refer to our architecture as IaC (Infrastructure-as-Code).

Codecov rocket ship taking off

Codecov Application Architecture

Codecov is built using Python (server) and JavaScript (client), leveraging various open-source frameworks (e.g. Django, ReactJS).

At a high level, the functional map of the application is:

To summarize and have a clear view of the application and infrastructure stack requirements, the main elements are listed below:

  • Nginx (web server, reverse proxy)
  • Python (server-side application language)
  • JavaScript / ReactJS
  • Celery Queue System
  • PostgreSQL (database)
  • Redis (key/value storage caching and session sharing between application servers)
  • Minio Compatible Storage* (See detailed explanation below in the Minio section)
  • Managed Kubernetes cluster

Codecov Self-hosted, what is provided?

If you look at the above diagram, Codecov offers infrastructure setup IaC using Helm, terraform, or other, native methods of deployment for the “Big 3” cloud providers. More information and the latest enterprise releases are available in our enterprise repository.

The middle tier (solid box), where we outline the web, worker, and API pods is essentially the layer of the application, which is provided by Codecov to the customer. We maintain this IaC and offer regularly scheduled releases to provide our customers with the latest features, bug fixes, and improvements to the underlying architecture.

Incoming traffic, load balancers, and ingress to the application are highly dependent on the customers’ current infrastructure. Due to the number of various configurations, we do not offer sample configurations for Ingress, Load Balancers, or API gateways. Likewise, the backend storage namely the PostgreSQL Database, Redis, and Minio-compatible storage should also be provided by the customer. We recommend leveraging managed services available from the Big 3 cloud providers. Codecov engineers have experience with GCP, Azure, and AWS however supporting these services on custom-built infrastructure would fall outside the scope of our responsibilities as far as support goes. That said, we will always do our best to address customers’ needs within our capabilities and available resources.

Application inter-component communication

All communications between the client and the server components should be configured to be over secured SSL connections. This includes communications with the git provider and Codecov Self-Hosted.

System Response Times

Application response times must be at a maximum of 3 seconds and should optimally be targeted at a sub-second response. Load testing needs to be organized and performed by the customer to confirm that the response times meet the requirements of a load testing environment mirroring production infrastructure. Enough time should be allowed to have the possibility of scaling the infrastructure capacity and executing performance profiling and optimization wherever required.

There is a requirement for synchronous API calls to external systems. In this case, the response time will be the sum of both systems and might be occasionally higher than the requirement.

API calls to external systems (i.e. GitHub, GitLab, BitBucket, or CI Providers) need to be taken into consideration when measuring system response times.

Infrastructure Setup

The Customer will leverage their internal infrastructure, tech support, and DevOps teams to complete the prerequisites, which are needed by the Codecov Self-hosted application. In most cases, it should be a relatively simple process of designating a Kubernetes cluster and installing the Codecov application with “helm install”.

The role of the Codecov team is to provide guidance during the infrastructure setup, offer stack blueprints, and suggest initial configurations to be compatible with the customer’s internal systems.

To help the Codecov team troubleshoot customer issues it is imperative that your internal team has adequate access to the staging environment, logs, and monitoring tools.

Application aspects to consider when designing a Codecov infrastructure are:

  • Number of coverage uploads per hour
  • The average size of the uploaded coverage report
  • Integrations and other systems requirements and their requests per minute
  • Database storage size
  • Redis size
  • Worker size
  • Data growth rate per year and for how many years you are planning ahead (1-3)

Sizing is an approximated estimation. The system needs use-case-specific data points and feedback loops to adjust the system over time based on the monitoring of the infrastructure. The feedback loop is created by monitoring and complemented with load testing to verify the initial sizing and adjust/plan accordingly. 

Storage, Database and Redis

Codecov self-hosted requires a MinIO-compatible storage backend. This is where Codecov will store raw coverage reports and archives of all previous uploads.  This is typically found on storage solutions such as AWS S3, Azure Blob Storage, or GCP Buckets. Custom storage solutions, which are compatible with MinIO can be used, but Codecov can only offer limited support without having intimate knowledge of the customer’s infrastructure. 

Similar to the storage solution, Codecov will lean on the customer’s infrastructure team to provide Database (PostgreSQL) and Redis (non-clustered only) services.

MinIO

  1. Codecov always uses MinIO as a software-level SDK/dependency. In enterprise, we use it to interface with all forms of storage. 
  2. We can use MinIO as infrastructure. Even if we use MinIO as infrastructure, it is still interacted with using the MinIO SDK discussed in 1.
  3. If the user has AWS S3 or GCP, they do not have to run MinIO as infrastructure, the MinIO SDK we use in software can connect directly to those providers.
  4. MinIO will not connect directly to Azure blob storage, at least at the time of writing this guide. If the user wants to use Azure’s blob storage, they have to run MinIO as an infrastructure in front of it. 
  5. If the user brings their own storage such as a data center, attached file storage in AWS, etc. They must run MinIO as infrastructure to interface with that storage, but these use cases are generally supported.

Load Testing

Load testing is t completed by the customer in UAT and on a replica of production at a predefined cadence. Together with monitoring, it will provide additional data to make sure that all of the parts of the infrastructure and application perform according to expectations.

In the load test environment, a customer will need tools like LoadRunner or JMeter to simulate a production capacity load. Also, machines will need to be available to generate the load. Usually, 2-3 driver machines are needed, if a customer plans to upload many reports per hour and multiple reports per single CI run, for example, to test against various OS versions. Think time should be set at 15 seconds initially.

The load test environment is then attached to a test integration environment to make sure the integration load is taken into consideration. Database size should be similar to the expected production database over the next year including the same record mix. Data can be scrambled.

Key issues that affect system performance are the frequency of the upload of the coverage uploads and the coverage report size. 

Information needed during performance testing will be:

  • Database query CPU usage
  • Database index advisor
  • Servers CPU usage in at least 1-minute increments
  • Servers Memory usage in at least 1-minute increments
  • Codecov logs with PostgreSQL slow query log turned on
  • Network traffic for each server
  • I/O profile for DB, Redis, and S3
  • Celery queue size, which should be trending towards zero

System Scaling

The Kubernetes-based infrastructure will address the customer’s needs to scale over time whenever required, by analyzing the APM (application performance monitoring) data. 

Two types of scalability are supported:

  1. horizontal (add more instances) and 
  2. vertical (add more resources to an instance). 

Load testing, monitoring, and performance tuning will help gather more information regarding how to improve the infrastructure to make the system scale over time. 

Often scaling is based on request count per minute per target. The threshold recommendation should be based on CPU utilization. For example, 80% CPU usage from the web pod may trigger a scaling-up action by one of the nodes. Likewise, auto-scaling can be performed based on business hours or during expected times of heavy loads, such as significant changes to the CI process, which increase the number of uploads or test runs.

One of the main benefits Kubernetes offers out of the box is the “The Horizontal Pod Autoscaler”, which is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The controller periodically adjusts the number of replicas in a replication controller or deployment to match the observed metrics such as average CPU utilization, average memory utilization, or any other custom metric to the target specified by the user.

High Availability

High availability is achieved by having no single point of failure, load balancing, and redundancy. The Codecov topology separates the web nodes, API nodes, and worker nodes to distribute the load and survive single service failures. 

It is recommended to have a DB replica that can be used as failover and to direct report requests; likewise, it can be utilized for swapping during blue/green deployments (see more on blue/green deployments below).

Backups

Database backups will follow the customer’s internal policies and standards. The application file system will be backed up immediately before executing a database backup, to maintain application consistency at a specific point in time as files can be tightly coupled to the database.

Monitoring and Logging

Monitoring of every infrastructure and application element is required for:

  1. Infrastructure level health purposes – This usually involves monitoring at the pod/node level to ensure it is up and operational.
  2. System-level components health purposes – This is monitoring the health information within that machine with information like CPU usage, Network, Memory, IO, etc. This information is taken outside of the Codecov Self-hosted application. Components that should be monitored are:
    1. Database
    2. Web Servers
    3. Load balancers
    4. Celery Queue
    5. Redis
    6. S3
  3. Performance monitoring for real-time and trending purposes – This is primarily done through an APM tool of the customer’s choice.

Codecov.io communicates any issues with performance, security, and application health via statsd metrics. 

The logs from Codecov Self-hosted services should be harvested to a central location and analyzed using tools such as Kibana, Splunk, or DataDog. An automated process and alerting can be created to react to log events.

OAuth

Codecov utilizes OAuth, a standard authentication methodology that maintains the state between requests. With OAuth, a user is uniquely identified and granted access to a system by a unique string named “access token” that is provided after a valid login with the user’s credentials. The access token has a certain validity in time, and it is stored encrypted within the client accessing the system. The client, to be authorized and to be able to get a valid response from the Git provider, will have to pass a valid OAuth token on every authenticated request via the HTTP headers. Once the access token expires, it can be re-validated with a valid “refresh token” if required, alternatively, a new login flow can be initiated to obtain a valid access token. 

Codecov’s login and user creation flows are foundationally supported by GitHub’s OAuth application flow. GitHub’s OAuth application flow is compliant with the OAuth 2 application standard.

Bitbucket and Gitlab use similar, industry-standard, 3-legged OAuth flows.

To provide a rough outline of what occurs when one clicks the “Login with GitHub” button in Codecov’s UI:

  1. The user is redirected to https://<GitProviderURL>.com/login/oauth/authorize
  2. They are presented with the scopes the Codecov application is requesting on behalf of the user’s Git provider account.
  3. If the user clicks “Authorize” they are redirected back to the Codecov application.
  4. During this redirect, the Codecov server is granted access code from the Git provider. The server exchanges this code with the Git provider (along with its Codecov’s GitHub OAuth app client id and secret) and receives an access token.
  5. This access token is encrypted and stored in Codecov’s database.
  6. When a request requiring that user’s authorization needs to be made, the OAuth token is accessed in the database, decrypted, and used to make a corresponding API call to GitHub servers.

It is worth noting that Codecov also deploys a GitHub app to support some of the Codecov application’s core functionality. While GitHub apps allow for authorization and login flows similar to a GitHub OAuth app, Codecov does not use the GitHub app for user auth. All user authorization and authentication are handled via the GitHub OAuth app.

Codecov Request Flow

  1. Upload with token yes/no?
    1. If yes: query DB for token and grab the matching repo
    2. If no: check the “service” param and confirm if it’s in the allowed list of tokenless uploaders. 
  2. Upload request is sent from the CI tool to the backend, to get a pre-signed PUT (URL for storage).
    1. Minio responds with the storage URL and result URL
      1. Celery task created to process the upload 
        1. Worker grabs the task
          1. If not there, waits for an available worker to process
    2. The report file is sent via presigned PUT to the storage URL 
    3. If success show output to the user
    4. Else return error output 
  3. Reach out to the git provider to get information about Repo that belongs to the uploaded commit.
  4. Bot enabled?
    1. Yes. Use bot for Oauth
    2. No. Try to use the user’s account
  5. Commit validated against the API.
  6. Check if Yaml is available in the git provider (repo) if so, combine it with any existing global or team-level YAML.
  7. Processing task picked up by the worker
  8. Codecov raw report created by the worker for a given commit
  9. The report is processed for notification and storage by Codecov
  10. Archive report
  11. Receive notification task
  12. Attempt notification (PR comment) post to the git provider
  13. Start sending notification(s)
  14. Request complete 

Troubleshooting

In order to troubleshoot common issues with your Codecov installation, it is important to ensure that log files from the web, API, and worker pods are sent to a central repository and are available to search and filter by commit ID (SHA) and other facets. 

  1. Error messages in the UI (i.e. random 500, 404’s etc.)
  2. PR comments aren’t posting 

First, it is recommended to view the recent files and look for any “Error” messages or logs explaining why a specific error might occur.

A common way to troubleshoot a particular problem is to first search the logs for a commit SHA and view all the events associated with the commit. Codecov outputs a variety of information in the logs, which traces the flow of the request from the step of the uploaded report to the last step of sending notifications (i.e. a Pull Request or Merge Request comment and status checks). 

DevOps & CI/CD Development and Deployment

A few key principles are:

  • Always make necessary config and IaC changes and test them on stage
  • The CI/CD process will help with automated deployment and reduce potential downtime
  • Once changes are tested on stage, be sure to add all code to git, add any migrations, if required, and proceed to deploy following a strategy such as Blue/Green deployments.

The deployment between environments should be consistent, repeatable, and scripted to avoid human error. In addition to that, the deployment process would need at times to interact with the infrastructure to make sure the software is deployed correctly and in a timely manner. Such health checks will be handled by GCP monitoring or APM tools and go beyond the scope of this document.

Blue / Green Deployments

Blue/Green deployment is considered the “safest” strategy (compared to rolling deployments or ad-hoc deployments) and is used by many for production workloads. To provide that safety, the currently running system should be maintained, and an entire copy of the application should be provisioned. Once the new version is live, you should test it however and how long you need to. Since the current version is still live and serving, there’s no rush to finalize the deployment. At this point, users are still unaware of the update.

When you get confident enough with the quality of the version, you would start routing users to the new version instead of the old version. It can be all users at once (a.k.a cut-off or cut-over), or gradually add more users to the new version.

Draining

A common practice is “draining” the old deployment before transitioning, meaning allowing current active sessions to end naturally, and only starting new sessions using the new version.

Switch Back

Usually, you will want to keep one previous version available even after the full transition has been completed. This ensures you can undo and switch back to the old working version easily.

Stage

A common variant of this strategy is staged deployment/slot swapping. The scope of the deployment (or scope of isolation) in this case is the entire application (meaning we have an entire copy of the whole system on stand-by). In this practice, we always have two copies of our application infrastructure around, even if an upgrade is not currently in progress. One environment is the production, and the other is pre-production, usually called ‘stage’. We can use the stage regularly for stuff like QA/UAT/Repros/CI. We always deploy to stage, never to production. Therefore, the stage will always be ahead of prod, representing the latest candidate for production. As in regular Blue/Green, we should test and validate the stage environment without affecting prod, and once we are good for deployment, we simply swap the routing between the two environments, meaning stage becomes prod and prod becomes stage. This is usually done using updating DNS records (but not necessarily). Again, we can easily swap back if something bad happens.

Before we redirect you to GitHub...
In order to use Codecov an admin must approve your org.