Production Observability Without Datadog: Logs, Metrics, Deploy Reliability

If you're running a small team of three or four engineers, paying $400 a month for Datadog or New Relic is hard to justify. Especially when 80% of what you actually need is "what's running, did the last deploy succeed, what's it doing right now, and what broke last night."

This guide covers the four observability signals that matter for a small product, how Launchverse exposes them out of the box (with no extra setup), and when paying for a vendor product genuinely starts to make sense.

The four signals

For most small / early-stage products, observability boils down to four questions:

Is it up? — container status, healthcheck, last response time.
Did the last deploy work? — pass/fail history, build minutes, rollback confidence.
What's it doing right now? — CPU and memory time series.
What broke last night? — application logs, error timeline.

Datadog answers all four with infinite knobs. You can answer all four with built-in PaaS tooling for free, with two important compromises around granularity and retention.

Signal 1 — Is it up?

The simplest signal is also the most important. Launchverse projects display container status in real time on the project's overview page, with three buckets:

Live — container running, healthcheck passing.
Live (degraded) — running, but healthcheck failing. The app is still serving requests; something inside isn't quite right.
Offline / Errored — container stopped, you have an outage.

For small projects this is enough; for projects where uptime contracts matter, add an external uptime check (Better Stack, UptimeRobot, Cloudflare Health Checks) that pings your domain every minute. That's a $0 add-on for one site.

Signal 2 — Did the last deploy work?

Every deploy is recorded with status, duration, and build-minute consumption. The Observability tab on a Launchverse project gives you:

A 30-day deploy histogram (success bars in green, failed bars in red).
Total deploys, success rate, total build minutes.
The 10 most recent deploys with status, commit hash, and duration.

This is the dashboard that prevents you from deploying broken code at 02:00 — a quick glance shows if recent runs were stable or chaotic.

Signal 3 — What's it doing right now?

Live CPU and memory time series require the platform's Sentinel agent to be enabled on the underlying server. When it is, the Observability tab shows a 30-minute moving window of percent-of-limit usage for both CPU and RAM, refreshed every minute.

When Sentinel isn't enabled, the panel says so honestly — no synthetic graphs, no fake data. You see what the platform actually knows. This is one of the things we explicitly avoid: a graph that looks busy is worse than no graph if the data isn't real, because it lulls you into thinking you have observability you don't.

Signal 4 — What broke last night?

Logs stream live from the application page. Every container's stdout/stderr is captured for at least 7 days on Free, 30 days on Pro, and indefinitely on Enterprise. Search is full-text; filtering by deploy is one click.

For deeper analysis (correlate a 5xx spike with a deploy event), open the Analytics tab — it shows deploy reliability and engine status alongside request count and 5xx rate, which together cover the most common "what changed" investigations.

When to graduate

There are two clear signals that you've outgrown PaaS-built-in observability:

You have customers paying you for SLAs. SLA accounting needs immutable audit logs and tools to compute uptime over arbitrary windows. PaaS tooling doesn't do that; pick Better Stack or Datadog.
You have multiple services that need to share traces. Distributed tracing (OpenTelemetry → some collector → some viewer) is its own commitment. If your architecture is more than two services, add OpenTelemetry to your code and export to Honeycomb or Tempo.

Until then, the built-in tooling on a modern PaaS is genuinely sufficient. We've shipped products with thousands of daily users on nothing but the four signals above.

Anti-patterns

Synthetic dashboards. A dashboard that looks busy but isn't pulling real data is worse than no dashboard. Always check the source.
Aggregating noise. If your 5xx graph spikes any time CI runs, the data is too noisy to be useful — fix the source.
Hoarding logs. 90-day retention on every log line costs more than most teams realise. Cap retention at 7 days for non-prod and 30 days for prod, then ship what you actually need long-term to S3.