Production Incident Response on Launchverse: A 10-Minute Playbook

Production incidents are not the time to read documentation. This is the documentation you read before the incident so the first 10 minutes are muscle memory. We focus on the three things you'll do most often: cancelling a bad deploy mid-flight, rolling back to a known-good commit, and confirming the live URL is healthy again.

Before anything else: is the platform up?

Open status.launchverse.app in another tab. If the platform itself is degraded, the response is "wait" — your rollback won't help and you'll just waste build minutes thrashing.

If status is green, the incident is yours and the playbook below is the right tool.

Scenario 1 — A bad deploy is still building

You pushed a commit, the build is in flight, and you've spotted the bug before it reaches Ready.

What to do

Open the project's Deployments tab.
Find the in-flight row. Status will be one of Queued, Building, Provisioning, Deploying.
Hit Cancel.

What happens

The engine receives a cancel signal within ~1 second.
The container is stopped (or its build is aborted before the container starts).
The build slot is released back into your team's concurrent-deploy pool, so the next push can start immediately.
The row is marked Canceled. You aren't billed for the partial build minutes.
The previous green container keeps serving traffic the entire time — your live URL never sees the failing deploy.

Cancel works regardless of which trigger started the build: push, manual redeploy, env-var-triggered rebuild, rollback, promote, and PR preview build. They all share the same cancel path.

Scenario 2 — A bad deploy is already live

The build went Ready, the platform swapped traffic to it, and now your live URL is throwing 5xx (or something subtler — a broken redirect, a missing env var, a UI regression that healthchecks won't notice).

Option A — Let auto-rollback do its job

If your project has auto-rollback configured and your health probe returns 5xx for the broken state, the platform will roll back automatically within 60–90 seconds of the deploy finishing. You'll receive:

A deploy-failed email with the diagnostic data.
A Failed commit status posted back to GitHub on the offending SHA.
A new Deployments row stamped auto-rolled-back to <prev commit> showing the recovery deploy.
A 30-minute cooldown stamped on the project so a chain reaction can't rollback-loop you.

If you're inside the cooldown window or the regression doesn't fail your healthcheck (a silent functional bug, a UI-only break), auto-rollback won't fire. Move to option B.

Option B — Roll back manually

Open the project's Deployments tab.
Find the last Ready deploy before the bad one. Use the commit message and build-finished timestamp to identify it — not just the most recent green, in case the most recent green is also bad.
Hit Rollback.
Watch the new deploy stream live. Cancel is available throughout. Status is realtime.
When the new deploy hits Ready, the platform swaps traffic back to the older commit.

The rollback is itself a deploy, billable in build minutes (unless it triggers within the auto-rollback path, which is treated as platform-driven recovery and not billed). It goes through the same admission gate as any other deploy — concurrent-deploy cap applies, plan-tier caps apply, idempotency-key dedup applies.

Confirm it actually worked

After the rollback reaches Ready, before you go back to bed:

Open the live URL in a private window. Bypass cache.
Hit your healthcheck endpoint directly: curl -i https://<your-domain>/healthz.
Look at the Observability tab's deploy-success-rate over the last hour — the dip should be small and self-recovering, not still falling.

If the live URL is still broken after the rollback, you rolled back to another broken commit. Roll back again to the one before that, and treat fixing the underlying bug as more urgent than continuing the chain.

Scenario 3 — The deploy succeeded, but the application is misbehaving

This is the one auto-rollback can't help with: the healthcheck is fine, but customers are complaining. Two diagnostic paths:

Read the build log

Open the deployment row from the Deploys tab. The log is split into stages (clone → install → build → start). Working backwards from the end:

Did start log a stack trace? That's usually your answer.
Did build complete without error but warn about a missing env var? Check the Environment tab — a variable might not have been promoted to the production environment.
Did install succeed but produce a different package-lock hash than your local? Lockfile drift can break runtime behaviour even when the build is green.

Read the live container's runtime log

If the build is fine but the running container is throwing, the right log isn't the build log — it's the container's stdout / stderr after start. Project → Deployments → click the running row → Live logs tab. The stream is bounded so a noisy log can't fill the buffer indefinitely; if your application emits more than ~1 MB/min consider sending logs to an external sink.

Re-run with a known-good env

If you suspect an environment-variable regression but the variables look right to you, the cleanest test is to Rollback to the previous deploy and verify it goes green. If the rollback also fails, the regression isn't in your code — it's almost certainly an env or upstream-dependency change (a database password rotated, a third-party API rate-limited you, a CDN purged a critical asset).

Reading the AI-Explain panel

Every Failed deploy row has an AI-Explain button. It runs against the first red line in the log and proposes:

A plain-English explanation of the failure.
A code-level fix as a draft PR against the offending repo.

Two things to know about it:

The PR path requires the Launchverse GitHub App to be installed on the repo with write scope on contents. If it isn't, the panel gives you the exact install URL inline; you can fix it in two clicks. If the installation is there but scopes are wrong you'll see the specific scope that's missing.
The explanation is generated against the log only, not your application's source. It's good at "missing env var", "syntax error on line N", "package not found", "port conflict". It's not good at "this is a logic bug that compiled clean and is now silently producing the wrong output". For those you still need a human.

Post-incident checklist

After the live URL is healthy again, before you close the laptop:

Did the team get notified? Deploy-failed emails fire automatically; confirm at least one teammate saw it. If not, check that team members have email notifications enabled.
Is the offending commit fixed forward? A rollback is a holding pattern, not a fix. Open a PR with the actual repair, let CI go through the preview environment, promote when green.
Did auto-rollback fire when it should have? If the answer is "no" and the live URL was unambiguously 5xx, check that your health probe is configured. Auto-rollback can only catch what your healthcheck catches.
Did Cancel work? If you tried to cancel and it took longer than ~5 seconds to land, email support with the project name and deploy ID — that's worth a platform-level look.

What the platform won't do for you

Setting expectations honestly:

The platform will not detect functional regressions that pass your healthcheck. Visual breaks, broken redirects, slow but-not-failing endpoints — auto-rollback is blind to these. You need synthetic monitoring on the user-facing flow for that, not infrastructure-level probes.
The platform will not roll back database schema changes. If your deploy migrated the schema, rolling back the container leaves you on the new schema running old code. Make migrations backwards-compatible (see zero-downtime deployments).
The platform will not undo destructive data changes. A deleted row from a buggy seeder cannot be restored by a deploy rollback. Use point-in-time recovery on your managed database for that.

Each of these is a sharp edge that's been the subject of post-mortems on platforms older than Launchverse. The fix is process — backwards-compatible migrations, PITR-enabled databases, synthetic monitoring on the critical path — not waiting for the platform to grow telepathy.

Rollback and promote deployments — the configuration that makes auto-rollback do its job without false positives.
PR preview deployments — fix-forward on a preview before promoting to production.
Zero-downtime deployments — the deploy-side primitives that make rollback safe in the first place.

Before anything else: is the platform up?

Open status.launchverse.app in another tab. If the platform itself is degraded, the response is "wait" — your rollback won't help and you'll just waste build minutes thrashing.

If status is green, the incident is yours and the playbook below is the right tool.

Scenario 1 — A bad deploy is still building

You pushed a commit, the build is in flight, and you've spotted the bug before it reaches Ready.

What to do

Open the project's Deployments tab.
Find the in-flight row. Status will be one of Queued, Building, Provisioning, Deploying.
Hit Cancel.

What happens

The engine receives a cancel signal within ~1 second.
The container is stopped (or its build is aborted before the container starts).
The build slot is released back into your team's concurrent-deploy pool, so the next push can start immediately.
The row is marked Canceled. You aren't billed for the partial build minutes.
The previous green container keeps serving traffic the entire time — your live URL never sees the failing deploy.

Cancel works regardless of which trigger started the build: push, manual redeploy, env-var-triggered rebuild, rollback, promote, and PR preview build. They all share the same cancel path.

Scenario 2 — A bad deploy is already live

Option A — Let auto-rollback do its job

A deploy-failed email with the diagnostic data.
A Failed commit status posted back to GitHub on the offending SHA.
A new Deployments row stamped auto-rolled-back to <prev commit> showing the recovery deploy.
A 30-minute cooldown stamped on the project so a chain reaction can't rollback-loop you.

If you're inside the cooldown window or the regression doesn't fail your healthcheck (a silent functional bug, a UI-only break), auto-rollback won't fire. Move to option B.

Option B — Roll back manually

Open the project's Deployments tab.
Find the last Ready deploy before the bad one. Use the commit message and build-finished timestamp to identify it — not just the most recent green, in case the most recent green is also bad.
Hit Rollback.
Watch the new deploy stream live. Cancel is available throughout. Status is realtime.
When the new deploy hits Ready, the platform swaps traffic back to the older commit.

Confirm it actually worked

After the rollback reaches Ready, before you go back to bed:

Open the live URL in a private window. Bypass cache.
Hit your healthcheck endpoint directly: curl -i https://<your-domain>/healthz.
Look at the Observability tab's deploy-success-rate over the last hour — the dip should be small and self-recovering, not still falling.

Scenario 3 — The deploy succeeded, but the application is misbehaving

This is the one auto-rollback can't help with: the healthcheck is fine, but customers are complaining. Two diagnostic paths:

Read the build log

Open the deployment row from the Deploys tab. The log is split into stages (clone → install → build → start). Working backwards from the end:

Did start log a stack trace? That's usually your answer.
Did build complete without error but warn about a missing env var? Check the Environment tab — a variable might not have been promoted to the production environment.
Did install succeed but produce a different package-lock hash than your local? Lockfile drift can break runtime behaviour even when the build is green.

Read the live container's runtime log

Re-run with a known-good env

Reading the AI-Explain panel

Every Failed deploy row has an AI-Explain button. It runs against the first red line in the log and proposes:

A plain-English explanation of the failure.
A code-level fix as a draft PR against the offending repo.

Two things to know about it:

The PR path requires the Launchverse GitHub App to be installed on the repo with write scope on contents. If it isn't, the panel gives you the exact install URL inline; you can fix it in two clicks. If the installation is there but scopes are wrong you'll see the specific scope that's missing.
The explanation is generated against the log only, not your application's source. It's good at "missing env var", "syntax error on line N", "package not found", "port conflict". It's not good at "this is a logic bug that compiled clean and is now silently producing the wrong output". For those you still need a human.

Post-incident checklist

After the live URL is healthy again, before you close the laptop:

Did the team get notified? Deploy-failed emails fire automatically; confirm at least one teammate saw it. If not, check that team members have email notifications enabled.
Is the offending commit fixed forward? A rollback is a holding pattern, not a fix. Open a PR with the actual repair, let CI go through the preview environment, promote when green.
Did auto-rollback fire when it should have? If the answer is "no" and the live URL was unambiguously 5xx, check that your health probe is configured. Auto-rollback can only catch what your healthcheck catches.
Did Cancel work? If you tried to cancel and it took longer than ~5 seconds to land, email support with the project name and deploy ID — that's worth a platform-level look.

What the platform won't do for you

Setting expectations honestly:

The platform will not detect functional regressions that pass your healthcheck. Visual breaks, broken redirects, slow but-not-failing endpoints — auto-rollback is blind to these. You need synthetic monitoring on the user-facing flow for that, not infrastructure-level probes.
The platform will not roll back database schema changes. If your deploy migrated the schema, rolling back the container leaves you on the new schema running old code. Make migrations backwards-compatible (see zero-downtime deployments).
The platform will not undo destructive data changes. A deleted row from a buggy seeder cannot be restored by a deploy rollback. Use point-in-time recovery on your managed database for that.

Rollback and promote deployments — the configuration that makes auto-rollback do its job without false positives.
PR preview deployments — fix-forward on a preview before promoting to production.
Zero-downtime deployments — the deploy-side primitives that make rollback safe in the first place.

Production Incident Response on Launchverse: A 10-Minute Playbook

Before anything else: is the platform up?

Scenario 1 — A bad deploy is still building

What to do

What happens

Scenario 2 — A bad deploy is already live

Option A — Let auto-rollback do its job

Option B — Roll back manually

Confirm it actually worked

Scenario 3 — The deploy succeeded, but the application is misbehaving

Read the build log

Read the live container's runtime log

Re-run with a known-good env

Reading the AI-Explain panel

Post-incident checklist

What the platform won't do for you

Ready to deploy?

Production Incident Response on Launchverse: A 10-Minute Playbook

Before anything else: is the platform up?

Scenario 1 — A bad deploy is still building

What to do

What happens

Scenario 2 — A bad deploy is already live

Option A — Let auto-rollback do its job

Option B — Roll back manually

Confirm it actually worked

Scenario 3 — The deploy succeeded, but the application is misbehaving

Read the build log

Read the live container's runtime log

Re-run with a known-good env

Reading the AI-Explain panel

Post-incident checklist

What the platform won't do for you

Ready to deploy?

Before anything else: is the platform up?

Scenario 1 — A bad deploy is still building

What to do

What happens

Scenario 2 — A bad deploy is already live

Option A — Let auto-rollback do its job

Option B — Roll back manually

Confirm it actually worked

Scenario 3 — The deploy succeeded, but the application is misbehaving

Read the build log

Read the live container's runtime log

Re-run with a known-good env

Reading the AI-Explain panel

Post-incident checklist

What the platform won't do for you

Related guides

Ready to deploy?

Before anything else: is the platform up?

Scenario 1 — A bad deploy is still building

What to do

What happens

Scenario 2 — A bad deploy is already live

Option A — Let auto-rollback do its job

Option B — Roll back manually

Confirm it actually worked

Scenario 3 — The deploy succeeded, but the application is misbehaving

Read the build log

Read the live container's runtime log

Re-run with a known-good env

Reading the AI-Explain panel

Post-incident checklist

What the platform won't do for you

Related guides

Ready to deploy?