🔎 SUMMARY

SaaS uptime doesn’t guarantee you can recover fast or restore the collaboration context you need.

The Shared Responsibility model/Limited Liability model gap shows up as money lost through rework, downtime, compliance drag, and trust friction. It’s usually driven by short retention, missing metadata after exports or migrations, untested restores and recovery copies that are not separated from production admins.

To bridge the gap use a simple operation model: keep independent recovery copies, generate audit-ready artifacts continuously, and run regular restore drills.

In January 2017, GitLab.com experienced a major outage that lasted around 18 hours, all because an engineer accidentally deleted data from the main database server. They ended up losing about six hours’ worth of database changes permanently. 

The postmortem shows how recovery paths can look good on paper, then fall apart under pressure. And, of course, the business only realizes this after the outage was already expensive.

The same failure pattern shows up across SaaS portfolios. Vendors focus on uptime and making collaboration smooth at scale. They roll out controls for retention, recovery, and audit. But defaults, plan limits, and everyday operational constraints often fail enterprise requirements unless customers configure, test, and prove those controls.

A lot of organizations treat the Shared Responsibility model—sometimes called the Limited Liability Model—more like a box to check during procurement rather than something they fund, test, and review. That’s where the losses for enterprises start to add up.

We’ll circle back to the GitLab incident later as a case study for execution failure. But first, let’s take a look at the highest-cost gaps between what platforms do and what enterprises need.

The Economics of the Shared Responsibility Model Gap

Gaps in the Shared Responsibility model can lead to unexpected costs and missed commitments. Weak recovery and weak proof create predictable expenses. Uptime alone does not cap them.

Four Cost Buckets

Reconstruction and rework

When restore paths only give back partial results, teams end up spending time and money to rebuild lost data, metadata, and relationships. 

Take this example: Jira CSV-based recovery can break attachments and needs some special handling for issue links. If 10,000 issues are affected and around 20% need manual fixes after export/import, at 5 minutes per issue, we are talking about 167 hours of rework. With loaded labor rates of $66 to $87/hour, that adds up to roughly $11.0k to $14.5k in direct labor, not even counting the extra downtime. 

Downtime and productivity loss

Every hour that core tools are down or just not usable is time you’re paying for but can’t turn into work. 

For instance, if you have 100 developers at $66 to $87/hr facing 6 hours of downtime, you’re looking at around $40k to $53k in lost paid time. Even when the tools are backed up, the time it takes to switch contexts adds more costs. A commonly cited estimate from Microsoft suggests it takes about 23 minutes to get back on track after being interrupted. It could add about $2.5k to $3.3k for those 100 developers.

Compliance and audit drag

The costs here aren’t just about fines. You’ve also got to factor in the work needed for remediation, audit exceptions, and delays in procurement when you can’t prove that controls are working.

For example, in enterprise deals that hinge on SOC 2 Type II, the reporting period can stretch on for months. If there are significant gaps, it can delay getting the enterprise ready for the reporting windows.

Trust and commercial friction

Incidents come with downstream costs that go beyond the technical side. There is a loss of confidence, added scrutiny, and extra process overhead. Internally, teams might lose trust and have to create workarounds. Externally, customers and users might slow down adoption, ask for more support, renegotiate contracts, or even churn.

Take security questionnaires, for example. They’re pretty standard and often seen as a hassle in B2B settings. After an incident, they get even more complicated and time-sensitive. Producing answers can take up to 15 hours each time, and it adds up with each review cycle.

Five Gaps Where Default Platform Controls Fail Enterprise Obligations

Enterprise exposure often stems from defaults, retention windows, and operational constraints. Here are five gaps that highlight these issues.

Note on terminology: organizations tier systems differently. In this post, “critical systems” means the applications where loss of access, integrity, or collaboration context would cause material, operational, or compliance impact. Start with the top tier, then expand.

Gap 1: Retention Duration

Service providers often maintain long-term retention, mainly for their own operational needs and critical system info. This helps them run, secure and troubleshoot their services. Platforms like GitLab, GitHub, or Atlassian follow standards like SOC 2 or  ISO 27001, which usually require them to retain certain internal data for much longer than 365 days. 

This can give customers a false sense of security. They often assume that the same long-term retention applies to their own data too. In reality, that’s usually not the case, and customer-level data retention and recoverability may be much more limited. For instance, GitHub Enterprise audit log events and Microsoft Purview Audit (Standard) audit records cover 180 days by default.  Enterprise obligations (legal, contractual, and regulatory) often call for records to be kept for several years. Examples include HIPAA documentation retention requirements of six years and SEC Rule 17a-4 broker-dealer record retention requirements of not less than 3–6 years.

Shared Responsibility model boundary

The provider takes care of audit logging and retention controls, but they come with defaults and limits based on your plan. It’s up to you to define the necessary retention periods, configure those settings, and ensure coverage across critical systems.

Impact if you do nothing

  • If an incident happens, late detection makes it tough to figure out what went wrong because the trail of “who did what when” is gone.
  • Litigation risks increase if you can’t keep or produce electronically stored information under Rule 37(e).
  • Not producing required historical records can lead to audit findings, remediation, and enforcement exposure in regulated environments.

Test
Can you piece together “what changed” and “who did it” over the required timeframe?

How to close the gap

  • Implement a backup strategy with a flexible retention policy of up to unlimited.
  • Set clear retention requirements for each system with designated owners and a regular review schedule.
  • Centralize and keep audit and configuration evidence beyond the platform defaults (SIEM, archive, backup) and check coverage periodically.

Metric to track

The percentage of critical systems where the evidence lookback meets or exceeds the required duration (goal: 100%)

Gap 2: Metadata Preservation

One big pitfall is treating migration tools like they’re a backup solution. 

“Migration” can refer to shifting data within the same platform (cloud-to-cloud, or on-premises to cloud) or switching to a completely different platform. In any case, exports often bring back the primary records, but they can miss the collaboration layer that makes systems usable. We’re talking about relationships, discussions, review context and history here.

Cross-platform migrations raise the stakes because each platform structures and stores metadata slightly differently, so one-to-one preservation is not guaranteed.Vendors warn against this in their documentation. For instance, GitHub mentions that migration archives are just migration artifacts, not backups. They also point out that migration outcomes might include warnings where data was skipped or migrated with caveats.

Shared Responsibility model boundary

The provider puts out APIs and migration or export mechanisms. It’s up to you to ensure that recovery keeps the metadata and relationships essential for smooth operation.

Impact if you do nothing

  • Teams might get their code back, but they’ll lose the decision-making context.
  • Rework can pile up since intent and review history are missing.
  • Incident response can drag on because scope and intent are harder to reconstruct.

Test

After a restore, can teams explain what changed and why using the recovered collaboration context, or do they only have the raw repository?

How to close the gap

  • Go for backups that keep full metadata and relationships.
  • Run granular restore tests to make sure the metadata is intact.
  • Define and document key metadata for each system.

Metric to track

  • The percentage of systems where the last restore test confirmed metadata integrity. Aim for 100%.
  • Measure the average time to reconstitute in-scope systems. Set a target and check it quarterly.

Gap 3: Tested Recovery

Many organizations run backups, but recovery fails because restores are not tested in real-world scenarios. In SaaS, restore speed is often bound by API and platform limits. Even a correct backup can miss RTO if restoration depends on thousands of API operations that can’t be executed fast enough. Breakpoints often show up only during incidents, such as corrupted archives, missing credentials, and unclear runbooks.

Shared Responsibility model boundary

The provider sets the stage with APIs and exports that have rate limits and operational constraints. Meanwhile, you have to show that recovery meets RTO/RPO by routinely testing restores in realistic conditions and keeping a record of the results.

Impact if you do nothing

  • A restore might work, but if it misses the RTO, it turns into a business outage. Teams end up waiting, rebuilding things manually, or dealing with only partial restorations.
  • The scope of investigation and remediation grows.
  • Restore throughput becomes a bottleneck that competes with ongoing delivery work. 

Test

Can you restore within RTO if your most privileged production credentials are compromised and primary data is deleted or corrupted?

How to close the gap

  • Conduct quarterly restore tests for your most important systems with documented RTO/RPO results and pass/fail criteria.
  • Automate restore testing wherever you can.
  • Maintain restore runbooks that include escalation paths and evidence collection.

Metric to track

  • Restore success rate, defined as the percentage of quarterly tests that meet RTO/RPO. Start by measuring the critical systems, then gradually expand to other systems based on priority.
  • Percentage of critical systems with a documented, tested restore runbook. Goal: 100%.

Gap 4: Independent Recovery

Recovery copies that are stored in the same tenant/account and under the same credentials as production can be deleted, encrypted, or invalidated by someone using a compromised admin account. Replication doesn’t solve this issue. It improves availability, but it also spreads deletions and corruption around.

Shared Responsibility model boundary

The provider builds availability and redundancy. You make sure recovery points exist outside the danger zone of production-admin compromises.

Impact if you do nothing

  • One compromised account can wipe out all your recovery points. Even if some copies survive, you might not have a reliable restore point after a destructive event.
  • Logical corruption and ransomware can spread across replicated systems. With time-limited rollback tools, if you catch it too late, you may end up with no clean point-in-time restore.

Test

If privileged admin credentials get stolen, can that account delete backups or invalidate recovery copies? 

How to close the gap

  • Store backups in a different account/tenant with separate admin credentials.
  • Add deletion safeguards that match your risk level, like immutable storage (WORM) or time-delayed deletion.

Metric to track

Percentage of critical systems where production admin can delete backups. Goal: 0%.

Gap 5: Compliance Evidence

Auditors and enterprise buyers want proof that backups run, restores are tested, retention is applied, and failures are reviewed. SOC 2 Type II is built around this concept. It checks how well controls are working over a specific audit period. In the Availability criteria, A1.2 covers backup and recovery infrastructure, and A1.3 dives into testing recovery procedures, including the regular check on backup completeness.

This is where a lot of teams trip up. They run backups, but cannot prove when the last successful restore test was, when a retention policy was in effect, or who reviewed failures.

Shared Responsibility model boundary

The provider offers platform audit logs, often with retention limits. You produce evidence packs: backup coverage, retention settings, restore-test results, exception handling, and exported logs.

Impact if you do nothing

  • Weak evidence can slow down audits, procurement processes, and customer security reviews. Missing elements can lead to SOC 2 exceptions, requiring you to fix things and push reporting deadlines back.
  • Collecting data manually turns into weeks of screenshot chasing and log reconstruction across IT, Sec, and GRC teams.

Test

If evidence lives only in the compromised tenant, can an attacker tamper with or delete it? 

How to close the gap

  • Automate evidence capture for backup coverage, retention settings, and restore-test results.
  • Export and protect audit and config logs outside the production admin plane.
  • Collect evidence continuously.

Metrics to track

  • Time to produce an audit-ready evidence pack for in-scope systems. Aim for under 24 hours.
  • Percentage of systems with automated evidence reporting. Goal: 100%.

Learn more about the Shared Responsibility model for different service providers:

📌 Shared Responsibility Model in Azure DevOps
📌 Atlassian Cloud Shared Responsibility Model
📌 GitLab Shared Responsibility Model
📌 GitHub Shared Responsibility Model
📌 Microsoft 365: What Are Your Duties Within The Shared Responsibility Model

Three Real-Life Cases: When the Gap Caused Loss

Let’s take a look at three incidents that map directly to one or more gaps above. 

GitLab.com (2017)

Gaps exposed: metadata preservation, tested recovery

As we talked about in the intro, in 2017, GitLab.com accidentally removed data from the main database server. This outage affected roughly 5,000 projects, 5,000 comments, and 700 new users. Issues and snippets were also impacted. 

They noted that several recovery options just weren’t available when they needed them. The replica wasn’t usable. Azure disk snapshots hadn’t been enabled for the database servers. pg_dump backups were failing due to a PostgreSQL version mismatch, leaving the expected S3 backups out of commission. Recovery relied on an LVM snapshot created for staging, and copying and restoring it took many hours due to slow disks. 

Even though GitLab was the provider here, this kind of failure can happen anywhere. “Backup exists” and “recovery works” are not the same without a tested, metadata-complete recovery. If you can’t restore the collaboration layer, you risk losing operational continuity.

What would reduce the impact?

  • Independent backups that capture both repos and the collaboration layer, allowing for point-in-time restores.
  • Restore testing that proves RTO/RPO and checks the metadata integrity.

Atlassian Cloud (2022)

Gaps exposed: tested recovery and independent recovery

On April 5, 2022, Atlassian reported that an internal maintenance script deleted 883 cloud sites, affecting 775 customers. This meant that many organizations lost access to Jira, Confluence, and other Atlassian Cloud products. For some customers, the outage dragged on for as long as 14 days

This incident shows the tested recovery gap at scale. The presence of backups did not translate into a fast, predictable recovery path when hundreds of tenants had to be restored under pressure. Plus, it exposed how many customers lacked independence in their recovery options. Most organizations didn’t have a way to control their recovery points or maintain continuity while waiting for the vendor to sort things out.

Losing access to key systems for several days led to problems like missed approvals, manual reconciliations, and a whole backlog to tackle once access returned.

What would reduce the impact?

  • Customer-controlled copies outside Atlassian Cloud to support continuity of operations and decision-making while restoration progressed.
  • The ability to restore important projects or spaces in a separate environment (self-managed instance or alternative workspace) to maintain workflow.
  • Restore drills that test “time-to-operate” under real constraints (volume, permissions, and dependencies).

Code Spaces (2014)

Gaps exposed: independent recovery and tested recovery

In June 2014, Code Spaces, a code-hosting SaaS provider, faced a DDoS attack along with an extortion demand. The attacker got into the company’s AWS console. When Code Spaces tried to take back control, the attacker deleted resources like EBS snapshots, S3 buckets, AMIs, and several instances.

Later on, Code Spaces announced that most of its data, backups, machine configurations, and “offsite backups” were either partially or completely deleted over about a 12-hour period. Shortly after, the company announced it would cease trading.

This incident shows the independent recovery gap. Recovery copies were reachable from the same compromised control plane. Because the privileged credentials were compromised, both production and recovery got hit. Plus, there wasn’t any solid restore path that could hold up under those circumstances.

You see a similar failure mode in Saas when tenant admins can delete recovery copies.

What would reduce the impact?

  • Backups are kept in a separate AWS account with different credentials, so the attacker couldn’t access them.
  • Immutable backup storage, so even if someone had console access, backups couldn’t be deleted for a set retention period.
  • A third-party backup outside AWS entirely to spread out the risk across different providers.

How to Close the Shared Responsibility Model Gap

SaaS vendors keep platforms running. You need to make sure recovery and proof are predictable. In practice, this operating model breaks down into three main parts.

First, keep recovery points you control away from the same tenant and admin blast radius as your production. At the same time, make sure retention matches your obligations, not just platform defaults.

Next, connect your evidence to your assurance needs (SOC 2, ISO 27001, customer security reviews) without turning compliance into a manual project.

Lastly, prove it works. Run restore tests for your key systems on a regular schedule and under real constraints.

📌 For teams using GitHub, GitLab, Bitbucket, Azure DevOps, Jira, Confluence, and Microsoft 365, GitProtect could be a solid way to get that operating model in place.

It offers customer-controlled backups that sit outside the source platform, supports policy-driven retention, and gives you exportable reports for audits and security reviews.

It backs up core data and collaboration metadata, allows for granular restores, and reduces reliance on default platform recovery windows.

GitProtect supports flexible deployment and storage options that reduce dependency on a single cloud or admin plane. If it fits into your continuity strategy, cross-platform migration can also give you another recovery path.

Comments are closed.

You may also like