Reference architectures
DETAILS: Tier: Free, Premium, Ultimate Offering: Self-managed
The GitLab reference architectures provide recommended scalable and elastic deployments as starting points for target loads. They are designed and tested by the GitLab Test Platform and Support teams.
Available reference architectures
The following reference architectures are available as recommended starting points for your environment.
The architectures are named in terms of peak load, based on user count or requests per second (RPS). RPS is calculated based on average real data.
NOTE: Each architecture is designed to be scalable and elastic. They can be adjusted accordingly based on your workload, upwards or downwards. For example, some known heavy scenarios such as using large monorepos or notable additional workloads.
For details about what each reference architecture is tested against, see the Testing Methodology section of each page.
GitLab package (Omnibus)
The following is the list of Linux package based reference architectures:
- Up to 20 RPS or 1,000 users API: 20 RPS, Web: 2 RPS, Git (Pull): 2 RPS, Git (Push): 1 RPS
- Up to 40 RPS or 2,000 users API: 40 RPS, Web: 4 RPS, Git (Pull): 4 RPS, Git (Push): 1 RPS
- Up to 60 RPS or 3,000 users API: 60 RPS, Web: 6 RPS, Git (Pull): 6 RPS, Git (Push): 1 RPS
- Up to 100 RPS or 5,000 users API: 100 RPS, Web: 10 RPS, Git (Pull): 10 RPS, Git (Push): 2 RPS
- Up to 200 RPS or 10,000 users API: 200 RPS, Web: 20 RPS, Git (Pull): 20 RPS, Git (Push): 4 RPS
- Up to 500 RPS or 25,000 users API: 500 RPS, Web: 50 RPS, Git (Pull): 50 RPS, Git (Push): 10 RPS
- Up to 1000 RPS or 50,000 users API: 1000 RPS, Web: 100 RPS, Git (Pull): 100 RPS, Git (Push): 20 RPS
Cloud native hybrid
The following is a list of Cloud Native Hybrid reference architectures, where select recommended components can be run in Kubernetes:
- Up to 40 RPS or 2,000 users API: 40 RPS, Web: 4 RPS, Git (Pull): 4 RPS, Git (Push): 1 RPS
- Up to 60 RPS or 3,000 users API: 60 RPS, Web: 6 RPS, Git (Pull): 6 RPS, Git (Push): 1 RPS
- Up to 100 RPS or 5,000 users API: 100 RPS, Web: 10 RPS, Git (Pull): 10 RPS, Git (Push): 2 RPS
- Up to 200 RPS or 10,000 users API: 200 RPS, Web: 20 RPS, Git (Pull): 20 RPS, Git (Push): 4 RPS
- Up to 500 RPS or 25,000 users API: 500 RPS, Web: 50 RPS, Git (Pull): 50 RPS, Git (Push): 10 RPS
- Up to 1000 RPS or 50,000 users API: 1000 RPS, Web: 100 RPS, Git (Pull): 100 RPS, Git (Push): 20 RPS
Before you start
First, consider whether a self-managed approach is the right choice for you and your requirements.
Running any application in production is complex, and the same applies for GitLab. While we aim to make this as smooth as possible, there are still the general complexities based on your design. Typically you have to manage all aspects such as hardware, operating systems, networking, storage, security, GitLab itself, and more. This includes both the initial setup of the environment and the longer term maintenance.
You must have a working knowledge of running and maintaining applications in production if you decide to go down this route. If you aren't in this position, our Professional Services team offers implementation services. Those who want a more managed solution long term, can explore our other offerings such as GitLab SaaS or GitLab Dedicated.
If you are considering using the Self Managed approach, we encourage you to read through this page in full, specifically the following sections:
Deciding which architecture to start with
The reference architectures are designed to strike a balance between three important factors: performance, resilience, and cost. They are designed to make it easier to set up GitLab at scale. However, it can still be a challenge to know which one meets your requirements and where to start accordingly.
As a general guide, the more performant and/or resilient you want your environment to be, the more complex it is.
This section explains the things to consider when picking a reference architecture.
Expected load (RPS or user count)
The right architecture size depends primarily on your environment's expected peak load. The most objective measure of this load is through peak Requests per Second (RPS) coming into the environment.
Each architecture is designed to handle specific RPS targets for different types of requests (API, Web, Git). These details are described in the Testing Methodology section on each page.
Finding out the RPS can depend notably on the specific environment setup and monitoring stack. Some potential options include:
-
GitLab Prometheus with queries like
sum(irate(gitlab_transaction_duration_seconds_count{controller!~'HealthController|MetricsController|'}[1m])) by (controller, action)
. - Other monitoring solutions.
- Load Balancer statistics.
If you can't determine your RPS, we provide an alternative sizing method based on equivalent User Count by Load Category. This count is mapped to typical RPS values, considering both manual and automated usage.
Initial sizing guide
To determine which architecture to pick for the expected load, see the following initial sizing guide table:
Load Category |
Requests per Second (RPS) |
Typical User Count |
Reference Architecture |
|||
---|---|---|---|---|---|---|
API | Web | Git Pull | Git Push | |||
X Small | 20 | 2 | 2 | 1 | 1,000 | Up to 20 RPS or 1,000 users |
Small | 40 | 4 | 4 | 1 | 2,000 | Up to 40 RPS or 2,000 users |
Medium | 60 | 6 | 6 | 1 | 3,000 | Up to 60 RPS or 3,000 users |
Large | 100 | 10 | 10 | 2 | 5,000 | Up to 100 RPS or 5,000 users |
X Large | 200 | 20 | 20 | 4 | 10,000 | Up to 200 RPS or 10,000 users |
2X Large | 500 | 50 | 50 | 10 | 25,000 | Up to 500 RPS or 25,000 users |
3X Large | 1000 | 100 | 100 | 20 | 50,000 | Up to 1000 RPS or 50,000 users |
NOTE: Before you select an initial architecture, review this section thoroughly. Consider other factors such as High Availability (HA) or use of large monorepos, as they may impact the choice beyond just RPS or user count.
NOTE: After you select an initial reference architecture, you can scale up and down according to your needs if metrics support.
Standalone (non-HA)
For environments serving 2,000 or fewer users, we recommend a standalone approach by deploying a non-HA, single or multi-node environment. With this approach, you can employ strategies such as automated backups for recovery. These strategies provide a good level of recovery time objective (RPO) or recovery point objective (RTO) while avoiding the complexities that come with HA.
With standalone setups, especially single node environments, various options are available for installation and management. The options include the ability to deploy directly by using select cloud provider marketplaces that reduce the complexity a little further.
High Availability (HA)
High Availability ensures every component in the GitLab setup can handle failures through various mechanisms. However, to achieve this is complex, and the environments required can be sizable.
For environments serving 3,000 or more users, we generally recommend using an HA strategy. At this level, outages have a bigger impact against more users. All the architectures in this range have HA built in by design for this reason.
Do you need High Availability (HA)?
As mentioned previously, achieving HA comes at a cost. The environment requirements are sizable as each component needs to be multiplied, which comes with additional actual and maintenance costs.
For a lot of our customers with fewer than 3,000 users, we've found that a backup strategy is sufficient and even preferable. While this does have a slower recovery time, it also means you have a much smaller architecture and less maintenance costs as a result.
As a general guideline, employ HA only in the following scenarios:
- When you have 3,000 or more users.
- When GitLab being down would critically impact your workflow.
Scaled-down High Availability (HA) approach
If you still need HA for fewer users, you can achieve it with an adjusted 3K architecture.
Zero-downtime upgrades
Zero-downtime upgrades are available for standard environments with HA (Cloud Native Hybrid is not supported). This allows for an environment to stay up during an upgrade. However, this process is more complex as a result and has some limitations as detailed in the documentation.
When going through this process, it's worth noting that there may still be brief moments of downtime when the HA mechanisms take effect.
In most cases, the downtime required for doing an upgrade shouldn't be substantial. Use this approach only if it's a key requirement for you.
Cloud Native Hybrid (Kubernetes HA)
As an additional layer of HA resilience, you can deploy select components in Kubernetes, known as a Cloud Native Hybrid reference architecture. For stability reasons, stateful components such as Gitaly cannot be deployed in Kubernetes.
Cloud Native Hybrid is an alternative and more advanced setup compared to a standard reference architecture. Running services in Kubernetes is complex. Use this setup only if you have strong working knowledge and experience in Kubernetes.
GitLab Geo (Cross Regional Distribution / Disaster Recovery)
With GitLab Geo, you can achieve distributed environments in different regions with a full Disaster Recovery (DR) setup in place. GitLab Geo requires at least two separate environments:
- One primary site.
- One or more secondary sites that serve as replicas.
If the primary site becomes unavailable, you can fail over to one of the secondary sites.
Use this advanced and complex setup only if DR is a key requirement for your environment. You must also make additional decisions on how each site is configured. For example, if each secondary site would be the same architecture as the primary or if each site is configured for HA.
Large monorepos / Additional workloads
Large monorepos or significant additional workloads can affect the performance of the environment notably. Some adjustments may be required depending on the context.
If this situation applies to you, reach out to your Customer Success Manager or our Support team for further guidance.
Cloud provider services
For all the previously described strategies, you can run select GitLab components on equivalent cloud provider services such as the PostgreSQL database or Redis.
For more information, see the recommended cloud providers and services.
Decision Tree
Read through the above guidance in full first before you refer to the following decision tree.
%%{init: { 'theme': 'base' } }%%
graph TD
L0A(<b>What Reference Architecture should I use?</b>)
L1A(<b>What is your <a href=#expected-load-rps--user-count>expected load</a>?</b>)
L2A("60 RPS / 3,000 users or more?")
L2B("40 RPS / 2,000 users or less?")
L3A("<a href=#do-you-need-high-availability-ha>Do you need HA?</a><br>(or zero-downtime upgrades)")
L3B[Do you have experience with<br/>and want additional resilience<br/>with select components in Kubernetes?]
L4A><b>Recommendation</b><br><br>60 RPS / 3,000 user architecture with HA<br>and supported reductions]
L4B><b>Recommendation</b><br><br>Architecture closest to <a href=#expected-load-rps--user-count>expected load</a> with HA]
L4C><b>Recommendation</b><br><br>Cloud Native Hybrid architecture<br>closest to <a href=#expected-load-rps--user-count>expected load</a>]
L4D>"<b>Recommendation</b><br><br>Standalone 20 RPS / 1,000 user or 40 RPS / 2,000 user<br/>architecture with Backups"]
L0A --> L1A
L1A --> L2A
L1A --> L2B
L2A -->|Yes| L3B
L3B -->|Yes| L4C
L3B -->|No| L4B
L2B --> L3A
L3A -->|Yes| L4A
L3A -->|No| L4D
L5A("<a href=#gitlab-geo-cross-regional-distribution--disaster-recovery>Do you need cross regional distribution</br> or disaster recovery?"</a>) --> |Yes| L6A><b>Additional Recommendation</b><br><br> GitLab Geo]
L4A ~~~ L5A
L4B ~~~ L5A
L4C ~~~ L5A
L4D ~~~ L5A
L5B("Do you have <a href=#large-monorepos>Large Monorepos</a> or expect</br> to have substantial <a href=#additional-workloads>additional workloads</a>?") --> |Yes| L6B><b>Additional Recommendation</b><br><br> Contact Customer Success Manager or Support]
L4A ~~~ L5B
L4B ~~~ L5B
L4C ~~~ L5B
L4D ~~~ L5B
classDef default fill:#FCA326
linkStyle default fill:none,stroke:#7759C2
Requirements
Before implementing a reference architecture, see the following requirements and guidance.
Supported CPUs
The architectures are built and tested across various cloud providers, primarily GCP and AWS. To ensure the widest range of compatibility, CPU targets are intentionally set to the lowest common denominator across these platforms:
Depending on other requirements such as memory or network bandwidth and cloud provider availability, different machine types are used accordingly throughout the architectures. We expect that the target CPUs above perform well.
If you want, you can select a newer machine type series and have improved performance as a result.
Additionally, ARM CPUs are supported for Linux package environments and for any cloud provider services.
NOTE: Any "burstable" instance types are not recommended due to inconsistent performance.
Supported disk types
Most standard disk types are expected to work for GitLab. However, be aware of the following specific call-outs:
- Gitaly requires at least 8,000 input/output operations per second (IOPS) for read operations, and 2,000 IOPS for write operations.
- We don't recommend the use of any disk types that are "burstable" due to inconsistent performance.
Other disk types are expected to work with GitLab. Choose based on your requirements such as durability or cost.
Supported infrastructure
GitLab should run on most infrastructures such as reputable cloud providers (AWS, GCP, Azure) and their services, or self-managed (ESXi) that meet both:
- The specifications detailed in each architecture.
- Any requirements in this section.
However, this does not guarantee compatibility with every potential permutation.
See Recommended cloud providers and services for more information.
Large Monorepos
The architectures were tested with repositories of varying sizes that follow best practices.
However, large monorepos (several gigabytes or more) can significantly impact the performance of Git and in turn the environment itself. Their presence and how they are used can put a significant strain on the entire system from Gitaly to the underlying infrastructure.
The performance implications are largely software in nature. Additional hardware resources lead to diminishing returns.
WARNING: If this applies to you, we strongly recommend you follow the linked documentation and reach out to your Customer Success Manager or our Support team for further guidance.
Large monorepos come with notable cost. If you have such a repository, follow these guidance to ensure good performance and to keep costs in check:
- Optimize the large monorepo. Using features such as LFS to not store binaries, and other approaches for reducing repository size, can dramatically improve performance and reduce costs.
- Depending on the monorepo, increased environment specifications may be required to compensate. Gitaly might require additional resources along with Praefect, GitLab Rails, and Load Balancers. This depends on the monorepo itself and its usage.
- When the monorepo is significantly large (20 gigabytes or more), further additional strategies may be required such as even further increased specifications or in some cases, a separate Gitaly backend for the monorepo alone.
- Network and disk bandwidth is another potential consideration with large monorepos. In very heavy cases, bandwidth saturation is possible if there's a high amount of concurrent clones (such as with CI). Reduce full clones wherever possible in this scenario. Otherwise, additional environment specifications may be required to increase bandwidth. This differs based on cloud providers.
Additional workloads
These architectures have been designed and tested for standard GitLab setups based on real data.
However, additional workloads can multiply the impact of operations by triggering follow-up actions. You may need to adjust the suggested specifications to compensate if you use:
- Security software on the nodes.
- Hundreds of concurrent CI jobs for large repositories.
- Custom scripts that run at high frequency.
- Integrations in many large projects.
- Server hooks.
- System hooks.
Generally, you should have robust monitoring in place to measure the impact of any additional workloads to inform any changes needed to be made. Reach out to your Customer Success Manager or our Support team for further guidance.
Load Balancers
The architectures make use of up to two load balancers depending on the class:
- External load balancer - Serves traffic to any external facing components, primarily Rails.
- Internal load balancer - Serves traffic to select internal components that are deployed in an HA fashion such as Praefect or PgBouncer.
The specifics on which load balancer to use, or its exact configuration is beyond the scope of GitLab documentation. The most common options are to set up load balancers on machine nodes or to use a service such as one offered by cloud providers. If deploying a Cloud Native Hybrid environment, the charts can handle the external load balancer setup by using Kubernetes Ingress.
Each architecture class includes a recommended base machine size to deploy directly on machines. However, they may need adjustment based on factors such as the chosen load balancer and expected workload. Of note machines can have varying network bandwidth that should also be taken into consideration.
The following sections provide additional guidance for load balancers.
Balancing algorithm
To ensure equal spread of calls to the nodes and good performance, use a least-connection-based load balancing algorithm or equivalent wherever possible.
We don't recommend the use of round-robin algorithms as they are known to not spread connections equally in practice.
Network Bandwidth
The total network bandwidth available to a load balancer when deployed on a machine can vary notably across cloud providers. Some cloud providers, like AWS, may operate on a burst system with credits to determine the bandwidth at any time.
The required network bandwidth for your load balancers depends on factors such as data shape and workload. The recommended base sizes for each architecture class have been selected based on real data. However, in some scenarios such as consistent clones of large monorepos, the sizes may need to be adjusted accordingly.
No swap
Swap is not recommended in the reference architectures. It's a failsafe that impacts performance greatly. The architectures are designed to have enough memory in most cases to avoid the need for swap.
Praefect PostgreSQL
Praefect requires its own database server. To achieve full HA, a third-party PostgreSQL database solution is required.
We hope to offer a built-in solution for these restrictions in the future. In the meantime, a non-HA PostgreSQL server can be set up using the Linux package as the specifications reflect. For more details, see the following issues:
Recommended cloud providers and services
NOTE: The following lists are non-exhaustive. Other cloud providers not listed here may work with the same specifications, but they have not been validated. For the cloud provider services not listed here, use caution, as each implementation can be notably different. Test thoroughly before using them in production.
The following architectures are recommended for the following cloud providers based on testing and real life usage:
Reference Architecture | GCP | AWS | Azure | Bare Metal |
---|---|---|---|---|
Linux package |
|
|||
Cloud Native Hybrid |
Additionally, the following cloud provider services are recommended for use as part of the architectures:
Cloud Service | GCP | AWS | Azure | Bare Metal |
---|---|---|---|---|
Object Storage |
|
|
|
|
Database |
|
|
|
|
Redis |
|
|
|
- For optimal performance, especially in larger environments (500 RPS / 25k users or higher), use the Enterprise Plus edition for GCP Cloud SQL. You might need to adjust the maximum connections higher than the service's defaults, depending on your workload.
- To ensure good performance, deploy the Premium tier of Azure Cache for Redis.
Best practices for the database services
Use an external database service that runs a standard, performant, and supported PostgreSQL version.
If you choose to use a third-party external service:
- The HA Linux package PostgreSQL setup encompasses PostgreSQL, PgBouncer, and Consul. All of these components are no longer required when using a third party external service.
- The number of nodes required for HA may vary depending on the service. The requirements for one deployment may vary from those for Linux package installations.
- For optimal performance, enable Database Load Balancing with Read Replicas. Match the node counts to those used in standard Linux package deployments. This approach is particularly important for larger environments (more than 200 requests per second or 10,000+ users).
- Ensure that if a pooler is included in a service, it can handle the total load without bottlenecks. For example, Azure Database for PostgreSQL flexible server can optionally deploy a PgBouncer pooler in front of the database. However, PgBouncer is single threaded, which may cause bottlenecks under heavy load. To mitigate this issue, you can use database load balancing to distribute the pooler across multiple nodes.
- To use GitLab Geo, the service should support cross-region replication.
Unsupported database services
The following database cloud provider services are not recommended due to lack of support or known issues:
- Amazon Aurora is incompatible and not supported. For more details, see 14.4.0.
- Azure Database for PostgreSQL Single Server is not supported as the service is now deprecated and runs on an unsupported version of PostgreSQL. It also has notable performance and stability issues.
-
Google AlloyDB and Amazon RDS Multi-AZ DB cluster are not tested and are not recommended. Both solutions are not expected to work with GitLab Geo.
- Amazon RDS Multi-AZ DB instance is a separate product and is supported.
Best practices for the Redis services
Use an external Redis service that runs a standard, performant, and supported version. Do not run the Redis service in Cluster mode as it is unsupported by GitLab.
Redis is primarily single threaded. For environments targeting up to 200 RPS or 10,000 or more users, separate the instances into cache and persistent data to achieve optimum performance at this scale.
Best practices for object storage
GitLab has been tested against various object storage providers that are expected to work.
Use a reputable solution that has full S3 compatibility.
Deviating from the suggested reference architectures
The further away you move from the reference architectures, the harder it is to get support. With each deviation, you introduce a layer of complexity that complicates troubleshooting potential issues.
These architectures use the official Linux packages or Helm Charts to install and configure the various components. The components are installed on separate machines (virtualized or Bare Metal). Machine hardware requirements listed in the Configuration column. Equivalent VM standard sizes are listed in the GCP/AWS/Azure columns of each available architecture.
You can run GitLab components on Docker, including Docker Compose. Docker is well supported and provides consistent specifications across environments.
However, it is still an additional layer and might add some support complexities. For example, not being able to run strace
in containers.
Unsupported designs
While we try to have a good range of support for GitLab environment designs, certain approaches don't work effectively. The following sections detail these unsupported approaches.
Stateful components in Kubernetes
Running stateful components in Kubernetes, such as Gitaly Cluster, is not supported.
Gitaly Cluster is only supported on conventional virtual machines. Kubernetes strictly limits memory usage. However, the memory usage of Git is unpredictable, which can cause sporadic out of memory (OOM) termination of Gitaly pods. The OOM termination leads to significant disruptions and potential data loss. Hence, Gitaly is not tested or supported in Kubernetes. For more information, see epic 6127.
This applies to stateful components such as Postgres and Redis. You can use other supported cloud provider services, unless specifically called out as unsupported.
Autoscaling of stateful nodes
As a general guidance, only stateless components of GitLab can be run in autoscaling groups, namely GitLab Rails and Sidekiq. Other components that have state, such as Gitaly, are not supported in this fashion. For more information, see issue 2997.
This applies to stateful components such as Postgres and Redis. You can use other supported cloud provider services, unless specifically called out as unsupported.
Cloud Native Hybrid setups are generally preferred over autoscaling groups. Kubernetes better handles components that can only run on one node, such as database migrations and Mailroom.
Deploying one environment over multiple data centers
GitLab doesn't support deploying a single environment across multiple data centers. These setups can result in significant issues, such as network latency or split-brain scenarios if a data center fails.
Several GitLab components require an odd number of nodes to function correctly, such as Consul, Redis Sentinel, and Praefect. Splitting these components across multiple data centers can negatively impact their functionality.
This limitation applies to all potential GitLab environment setups, including Cloud Native Hybrid alternatives.
For deploying GitLab over multiple data centers or regions, we offer GitLab Geo as a comprehensive solution.
Validation and test results
The Test Platform team does regular smoke and performance tests for these architectures to ensure they remain compliant.
Why we perform the tests
The Quality Department measures and improves the performance of GitLab. They create and validate architectures to ensure reliable configurations for self-managed users.
For more information, see our handbook page.
How we perform the tests
Testing occurs against all architectures and cloud providers in an automated and ad-hoc fashion. Two tools are used for testing:
- The GitLab Environment Toolkit Terraform and Ansible scripts for building the environments.
- The GitLab Performance Tool for performance testing.
Network latency on the test environments between components on all cloud providers were measured at <5 ms. This an observation, not a recommendation.
We aim to have a test smart approach where architectures tested have a good range and can also apply to others. Testing focuses on installing a 10k Linux package on GCP. This approach serves as a reliable indicator for other architectures, cloud providers, and Cloud Native Hybrids.
The architectures are cross-platform. Everything runs on VMs through the Linux package. Testing occurs primarily on GCP. However, they perform similarly on hardware with equivalent specifications on other cloud providers or if run on-premises (bare-metal).
GitLab tests these architectures using the GitLab Performance Tool. We use specific coded workloads based on sample customer data. Select the architecture that matches your scale.
Each endpoint type is tested with the following number of RPS per 1,000 users:
- API: 20 RPS
- Web: 2 RPS
- Git (Pull): 2 RPS
- Git (Push): 0.4 RPS (rounded to the nearest integer)
The above RPS targets were selected based on real customer data of total environmental loads corresponding to the user count, including CI and other workloads.
How to interpret the results
NOTE: Read our blog post on how our QA team leverages GitLab performance testing tool.
Testing is done publicly, and all results are shared.
The following table details the testing done against the architectures along with the frequency and results. Additional testing is continuously evaluated, and the table is updated accordingly.
Reference Architecture |
GCP (* also proxy for Bare-Metal) | AWS | Azure | |||
---|---|---|---|---|---|---|
Linux package | Cloud Native Hybrid | Linux package | Cloud Native Hybrid | Linux package | ||
Up to 20 RPS or 1,000 users | Weekly | |||||
Up to 40 RPS or 2,000 users | Weekly | Planned | ||||
Up to 60 RPS or 3,000 users | Weekly | Weekly | ||||
Up to 100 RPS or 5,000 users | Weekly | |||||
Up to 200 RPS or 10,000 users | Daily | Weekly | Weekly | Weekly | ||
Up to 500 RPS or 25,000 users | Weekly | |||||
Up to 1000 RPS or 50,000 users | Weekly |
Cost calculator templates
The following table lists initial cost templates for the different architectures across GCP, AWS, and Azure. These costs were calculated using each cloud provider's official calculator.
However, be aware of the following caveats:
- The table list only a rough estimate compute templates for Linux package architectures.
- They do not take into account dynamic elements such as disk, network, or object storage, which can notably impact costs.
- Due to the nature of Cloud Native Hybrid, it's not possible to give a static cost calculation for that deployment.
- Committed use discounts apply if they are set as default in the cloud provider calculator.
- Bare metal costs are also not included here as they vary depending on each configuration.
For accurate estimate of costs for your environment, take the closest template and adjust it to match your specifications and expected usage.
Maintaining a reference architecture environment
Maintaining a reference architecture environment is generally the same as any other GitLab environment.
In this section you can find links to documentation for relevant areas and specific architecture notes.
Scaling an environment
The reference architectures are designed as a starting point, and are elastic and scalable throughout. You might want to adjust the environment for your specific needs after deployment for reasons such as additional performance capacity or reduced costs. This behavior is expected. Scaling can be done iteratively or wholesale to the next architecture size, if metrics suggest that a component is exhausted.
NOTE: If a component is continuously exhausting its given resources, reach out to our Support team before performing any significant scaling.
For most components, vertical and horizontal scaling can be applied as usual. However, before doing so, be aware of the following caveats:
- When scaling Puma or Sidekiq vertically, the amount of workers must be adjusted to use the additional specifications. Puma is scaled automatically on the next reconfigure. However, you might have to change Sidekiq configuration beforehand.
- Redis and PgBouncer are primarily single threaded. If these components are seeing CPU exhaustion, they may need to be scaled out horizontally.
- The Consul, Redis Sentinel, and Praefect components require an odd number of nodes for a voting quorum when deployed in HA form.
- Scaling certain components significantly can result in notable knock on effects that affect the performance of the environment. For more guidance, see Scaling knock on effects.
Conversely, if you have robust metrics in place that show the environment is over-provisioned, you can scale downwards. You should take an iterative approach when scaling downwards, to ensure there are no issues.
Scaling knock on effects
In some cases, scaling a component significantly may result in knock on effects for downstream components, impacting performance. The architectures are designed with balance in mind to ensure components that depend on each other are congruent in terms of specifications. Notably scaling a component may result in additional throughput being passed to the other components it depends on. As a result, they may need to be scaled as well.
NOTE: The architectures have been designed to have elasticity to accommodate an upstream component being scaled. However, reach out to our Support team before you make any significant changes to your environment to be safe.
The following components can impact others when they have been significantly scaled:
- Puma and Sidekiq - Notable scale ups of either Puma or Sidekiq workers will result in higher concurrent connections to the internal load balancer, PostgreSQL (via PgBouncer if present), Gitaly (via Praefect if present) and Redis respectively.
- Redis is primarily single-threaded. In some cases, you may need to split Redis into separate instances (for example, cache and persistent) if the increased throughput causes CPU exhaustion in a combined cluster.
- PgBouncer is also single threaded but a scale out might result in a new pool being added that in turn might increase the total connections to Postgres. It's strongly recommended to only do this if you have experience in managing Postgres connections and to seek assistance if in doubt.
- Gitaly Cluster / PostgreSQL - A notable scale out of additional nodes can have a detrimental effect on the HA system and performance due to increased replication calls to the primary node.
Scaling from a non-HA to an HA architecture
In most cases, vertical scaling is only required to increase an environment's resources. However, if you are moving to an HA environment, additional steps are required for the following components to switch over to their HA forms.
For more information, see the following documentation:
- Redis to multi-node Redis w/ Redis Sentinel
- Postgres to multi-node Postgres w/ Consul + PgBouncer
- Gitaly to Gitaly Cluster w/ Praefect
Upgrades
Upgrading a reference architecture environment is same as any other GitLab environment. The main Upgrade GitLab section has detailed steps on how to approach this. Zero-downtime upgrades are also available.
NOTE: You should upgrade a reference architecture in the same order as you created it.
Monitoring
You can monitor your infrastructure and GitLab using various options. See the selected monitoring solution's documentation for more information.
NOTE: GitLab application is bundled with Prometheus and various Prometheus compatible exporters that could be hooked into your solution.
Update history
The following is a history of notable updates for reference architectures (2021-01-01 onward, ascending order). We aim to update it at least once per quarter.
You can find a full history of changes on the GitLab project.
2024:
- 2024-08: Updated Expected Load section with some more examples on how to calculate RPS.
- 2024-08: Updated Redis configuration on 40 RPS or 2k User page to have correct Redis configuration.
- 2024-08: Updated Sidekiq configuration for Prometheus in Monitoring node on 2k.
- 2024-08: Added Next Steps breadcrumb section to the pages to help discoverability of additional features.
- 2024-05: Updated the 60 RPS or 3k User and 100 RPS or 5k User pages to have latest Redis guidance on co-locating Redis Sentinel with Redis itself.
-
2024-05: Renamed
Cost to run
section toCost calculator templates
to better reflect the calculators are only a starting point and need to be adjusted with specific usage to give more accurate cost estimates. - 2024-04: Updated recommended sizing for Webservice nodes for Cloud Native Hybrids on GCP. Also adjusted NGINX pod recommendation to be run on Webservice node pool as a DaemonSet.
- 2024-04: Updated 20 RPS / 1,000 User architecture specs to follow recommended memory target of 16 GB.
- 2024-04: Updated Reference Architecture titles to include RPS for further clarity and to help right sizing.
- 2024-02: Updated recommended sizing for Load Balancer nodes if deployed on VMs. Also added notes on network bandwidth considerations.
- 2024-02: Removed the Sidekiq Maximum Concurrency setting in examples as this is deprecated and no longer required to be set explicitly.
- 2024-02: Adjusted the Sidekiq recommendations on 2k to disable Sidekiq on Rails nodes and updated architecture diagram.
- 2024-01: Updated recommendations for Azure for all Reference Architecture sizes and latest cloud services.
2023:
- 2023-12-12: Updated notes on Load Balancers to be more reflective that any reputable offering is expected to work.
- 2023-11-03: Expanded details on what each Reference Architecture is designed for, the testing methodology used and added details on how to scale environments.
- 2023-11-03: Added expanded notes on disk types, object storage and monitoring.
- 2023-10-25: Adjusted Sidekiq configuration example to use Linux Package role.
- 2023-10-15: Adjusted the Sidekiq recommendations to include a separate node for 2k and tweaks to instance type and counts for 3k and 5k.
- 2023-10-08: Added more expanded notes throughout to warn about the use of Large Monorepos and their impacts for increased awareness.
- 2023-10-04: Updated name of Task Runner pod to its new name of Toolbox.
- 2023-10-02: Expanded guidance on using an external service for Redis further, in particular for separated Cache and Persistent services with 10k and up.
- 2023-09-21: Expanded details on the challenges of running Gitaly in Kubernetes.
- 2023-09-20: Removed references to Grafana after deprecation and removal.
- 2023-08-30: Expanded section on Geo under the Decision Tree.
- 2023-08-08: Switched configuration example to use the Sidekiq role for Linux package.
- 2023-08-03: Fixed an AWS Machine type typo for the 50k architecture.
- 2023-06-30: Update PostgreSQL configuration examples to remove a now unneeded setting to instead use the Linux package default.
- 2023-06-30: Added explicit example on main page that reflects Google Memorystore is recommended.
- 2023-06-11: Fixed IP examples for the 3k and 5k architectures.
- 2023-05-25: Expanded notes on usage of external Cloud Provider Services and the recommendation of separated Redis servers for 10k environments and up.
- 2023-05-03: Updated documentation to reflect correct requirement of Redis 6 instead of 5.
- 2023-04-28: Added a note that Azure Active Directory authentication method is not supported for use with Azure PostgreSQL Flexible service.
- 2023-03-23: Added more details about known unsupported designs.
- 2023-03-16: Updated Redis configuration examples for multi-node to have correct configuration to ensure all components can connect.
- 2023-03-15: Updated Gitaly configuration examples to the new format.
- 2023-03-14: Updated cost estimates to no longer include NFS VMs.
- 2023-02-17: Updated Praefect configuration examples to the new format.
- 2023-02-14: Added examples of what automation may be considered additional workloads.
- 2023-02-13: Added a new before you start section that gives more context about what's involved with running production software self-managed. Also added more details for Standalone setups and cloud provider services in the decision tree section.
- 2023-02-01: Switched to use more common complex terminology instead of the less known involved.
- 2023-01-31: Expanded and centralized the requirements section on the main page.
- 2023-01-26: Added notes on migrating Git data from NFS, that object data is still supported on NFS and handling SSH keys correctly across multiple Rails nodes.
2022:
-
2022-12-14: Removed guidance for using NFS for Git data as support for this is now ended with
15.6
or later. -
2022-12-12: Added a note to clarify the difference between Amazon RDS Multi-AZ DB cluster and instance, with the latter being supported. Also, increase PostgreSQL maximum connections setting to new default of
500
. -
2022-12-12: Updated Sidekiq maximum concurrency configuration to match new default of
20
. - 2022-11-16: Corrected guidance for Praefect and Gitaly in reduced 3k architecture section that an odd number quorum is required.
- 2022-11-15: Added guidance on how to handle GitLab Secrets in Cloud Native Hybrids and further links to the GitLab Charts documentation.
- 2022-11-14: Fixed a typo with Sidekiq configuration for the 10k architecture.
- 2022-11-09: Added guidance on large monorepos and additional workloads impact on performance. Also, expanded Load Balancer guidance around SSL and a recommendation for least connection based routing methods.
- 2022-10-18: Adjusted Object Storage guidance to make it clearer that it's recommended over NFS.
- 2022-10-11: Updated guidance for Azure to recommend up to 2k only due to performance issues.
- 2022-09-27: Added the decision tree section to help users better decide what architecture to use.
- 2022-09-22: Added explicit step to enable Incremental Logging when only Object Storage is being used.
- 2022-09-22: Expanded guidance on recommended cloud providers and services.
-
2022-09-09: Expanded Object Storage guidance and updated that NFS support for Git data ends with
15.6
. - 2022-08-24: Added a clearer note about Gitaly Cluster not being supported in Kubernetes.
- 2022-08-24: Added a section on supported CPUs and types.
- 2022-08-18: Updated architecture tables to be clearer for Object Storage support.
- 2022-08-17: Increased Cloud Native Hybrid pool specifications for 2k architecture to ensure enough resources present for pods. Also, increased Sidekiq worker count.
-
2022-08-02: Added note to use newer Gitaly check command from GitLab
15.0
and later. - 2022-07-25: Moved the troubleshooting section to a more general location.
-
2022-07-14: Added guidance that Amazon Aurora is no longer compatible and not supported from GitLab
14.4.0
and later. -
2022-07-07: Added call out note to remove the
default
section from Gitaly storages configuration as it's required. - 2022-06-08: Moved Incremental Logging guidance to a separate section.
- 2022-04-29: Expanded testing results' section with new regular pipelines.
- 2022-04-26: Updated Praefect configuration to reflect setting name changes.
- 2022-04-15: Added missing setting to enable Object Storage correctly.
- 2022-04-14: Expanded Cloud Native Hybrid guidance with AWS machine types.
- 2022-04-08: Added cost estimates for AWS and Azure.
- 2022-04-06: Updated configuration examples for most components to be correctly included for Prometheus monitoring auto discovery.
- 2022-03-30: Expanded validation and testing result's section with more clearly language and more detail.
- 2022-03-21: Added a note saying additional specifications may be needed for Gitaly in some scenarios.
-
2022-03-04: Added guidance for preventing the GitLab
kas
service running on nodes where not required. - 2022-03-01: Fixed a typo for Praefect TLS port in configuration examples.
- 2022-02-22: Added guidance to enable the Gitaly Pack-objects cache.
- 2022-02-22: Added a general section on recommended Cloud Providers and services.
- 2022-02-14: Added link to a blog post about GPT testing added.
- 2022-01-26: Merged testing process and cost estimates into one section with expanded details.
- 2022-01-13: Expanded guidance on recommended Kubernetes platforms.
2021:
- 2021-12-31: Fix typo for 25k Redis AWS machine size.
- 2021-12-28: Add Cloud Provider breakdowns to testing process & results section.
- 2021-12-17: Add more detail to testing process and results section.
- 2021-12-17: Add note on Database Load Balancing requirements when using a modified 3k architecture.
- 2021-12-17: Add diagram for 1k architecture (single node).
- 2021-12-15: Add sections on estimated costs (GCP), testing process and results and further Cloud Provider service details.
- 2021-12-14: Expanded external database service guidance for components and what cloud provider services are recommended.
- 2021-11-24: Added recommendations for Database Load Balancing.
- 2021-11-04: Added more details about testing targets used for the architectures.
- 2021-10-13: Added guidance around optionally enabling Incremental Logging by using Redis.
-
2021-10-07: Updated Sidekiq configuration to include required
external_url
setting. - 2021-10-02: Expanded guidance around Gitaly Cluster and Gitaly Sharded.
- 2021-09-29: Added a note on what Cloud Native Hybrid architecture to use with small user counts.
- 2021-09-27: Changed guidance to now co-locate Redis Sentinel beside Redis on the same node.
- 2021-08-18: Added 2k Cloud Native Hybrid architecture.
- 2021-08-04: Added links to performance test results for each architecture.
- 2021-07-30: Fixed the replication settings in PostgreSQL configuration examples to have correct values.
- 2021-07-22: Added 3k Cloud Native Hybrid architecture.
- 2021-07-16: Updated architecture diagrams to correctly reflect no direct connection between Rails and Sidekiq.
- 2021-07-15: Updated Patroni configuration to include Rest API authentication settings.
- 2021-07-15: Added 5k Cloud Native Hybrid architecture.
- 2021-07-08: Added 25k Cloud Native Hybrid architecture.
- 2021-06-29: Added 50k Cloud Native Hybrid architecture.
- 2021-06-23: Made additions to main page for Cloud Native Hybrid and reduce 3k architecture.
- 2021-06-16: Updated PostgreSQL steps and configuration to use the latest roles and prep for any Geo replication.
- 2021-06-14: Updated configuration examples for Monitoring node to follow latest.
- 2021-06-11: Expanded notes on external services with more detail.
- 2021-06-09: Added additional guidance and expand on how to correctly manage GitLab secrets and database migrations.
- 2021-06-09: Updated Praefect configuration examples to follow the new storages format.
- 2021-06-03: Removed references for the Unicorn webserver, which has been replaced by Puma.
- 2021-04-23: Updated Sidekiq configuration examples to show how to correctly configure multiple workers on each node.
- 2021-04-23: Added initial guidance on how to modify the 3k Reference Architecture for lower user counts.
- 2021-04-13: Added further clarification on using external services (PostgreSQL, Redis).
- 2021-04-12: Added additional guidance on using Load Balancers and their routing methods.
- 2021-04-08: Added additional guidance on how to correctly configure only one node to do database migrations for Praefect.
- 2021-04-06: Expanded 10k Cloud Native Hybrid documentation with more details and clear naming.
- 2021-03-04: Expanded Gitaly Cluster documentation to all other applicable Reference Architecture sizes.
- 2021-02-19: Added additional Object Storage guidance of using separated buckets for different data types as per recommendations.
- 2021-02-12: Added documentation for setting up Object Storage with Rails and Sidekiq.
- 2021-02-12: Added documentation for setting up Gitaly Cluster for the 10k Reference Architecture.
- 2021-02-09: Added the first iteration of the 10k Cloud Native Hybrid reference architecture.
- 2021-01-07: Added documentation for using Patroni as PostgreSQL replication manager.