Why Organizations Are Not Moving to Proxmox
Proxmox is not a VMware replacement you install over a weekend. After three years running 1000+ VMs in production, here is what actually holds organizations back, and what it takes to succeed.
Why Organizations Are Not Moving to Proxmox
What's inside
- 1Why Proxmox has no guard rails (and why that matters)
- 2High availability: what 'check the box' actually requires
- 3Monitoring is not optional: building real observability
- 4Ceph storage: the backbone you need to get right
- 5The real cost of a VMware migration
- 6How to succeed: build vs. partner
Since Broadcom's acquisition of VMware, Proxmox has appeared in every infrastructure cost-cutting conversation. The license math looks compelling: a fraction of VMware Enterprise Plus pricing, open source, full feature set. Budget-conscious CTOs are asking why they shouldn't just swap it in.
Here is the honest answer: not as a drop-in replacement.
Proxmox trades vendor guardrails for full control. That trade has real costs, in expertise, tooling, and operational maturity. Organizations that approach Proxmox as a direct swap typically struggle. Organizations that treat it as a platform shift, with either internal expertise or a managed partner, succeed.
After three years running Proxmox in production with 1000+ VMs at Natron, here is what most evaluations get wrong.
Proxmox has no guard rails and that is the point (and the problem)
VMware holds your hand. It has a Hardware Compatibility List. It has validated reference architectures. It will warn you, block you, or flat-out refuse if your setup does not meet its expectations. For many enterprises, that comfort can be a double-edged sword.
Proxmox will let you do anything. Any hardware, any configuration, any topology. Two-node cluster with no quorum device? Sure. Consumer SSDs as Ceph journals? Go ahead. Overloading your cluster with more VMs than it can handle? No problem. Proxmox will not stop you. It assumes you know what you are doing.
That is a meaningful trade-off.
The freedom to choose your own hardware, your own network design, your own storage layout means you can build exactly the infrastructure you need, optimized for your workloads and your budget. No vendor telling you that your perfectly good servers are not on the blessed list. No forced hardware refresh cycles because a compatibility matrix changed.
But it also means Proxmox assumes you know what you are doing. There is no wizard that validates your architecture. No pre-flight check that tells you your Ceph network is undersized or your HA fencing will not work with that hardware. You are the guard rail.
This is where a lot of VMware migrations stall. Teams used to a platform that constrains them into good decisions suddenly have total freedom, and total responsibility. You need deep awareness of Linux, networking, storage, and hardware. You need to understand why a design works, not just follow a vendor's reference guide.
This is not a criticism. It is the core reason organizations do not make the move. If your team has strong Linux and infrastructure skills (or a managed Proxmox partner who does), the freedom is a superpower.
HA sounds simple. It is not.
Proxmox has built-in High Availability. Check a box, assign a VM to an HA group, done. If a node dies, the VM restarts on another node.
In theory.
In practice, Corosync needs reliable, low-latency links between nodes. If those links flap, you get split-brain scenarios, and split-brain in a hypervisor cluster is the kind of problem that ruins your day.
Things we learned the hard way:
- Redundant Corosync links. Corosync is the heartbeat of your cluster. A single link that flaps at the wrong moment can trigger a split-brain. Redundancy is not optional.
- Failover testing needs to be done. HA configured is not HA verified. Pull a power cable, simulate a network partition, kill a node. If you have not tested it, you do not know if it works.
- Resource capacity planning. When a node fails, its VMs restart on the remaining nodes. If those nodes are already running at 80% capacity, you do not have an HA cluster: you have a cluster that fails twice.
HA is essential. But treat it as something you engineer, not something you enable.
Monitoring is not optional: it is the product
The built-in Proxmox GUI gives you basics: CPU graphs, memory usage, a task log. That is enough to know something is broken. It tells you nothing about why, nothing about what will break next, and nothing about whether your cluster is healthy or just quiet.
In production, the hard work is not installing a monitoring stack. It is figuring out what you actually need to watch. Proxmox does not hand you an answer. You have to work out which signals matter: is your storage network keeping up, or is it silently throttling VM performance? Are your OSD disks healthy, or degrading quietly? Is your HA fencing reliable under real failure conditions, or only under the conditions you tested?
This takes time and incidents. You learn what to monitor by running into problems you did not see coming. Every production issue teaches you something that should become an alert or a dashboard. Over three years, we have accumulated that knowledge. We know which metrics predict failures before they happen, and which alerts are just noise.
This is one of the bigger gaps in a VMware migration. VMware comes with decades of tooling, integrations, and certified consultants who have seen your problem before. Proxmox comes with a great platform and a blank page. You have to build the observability layer yourself, and it takes real production experience to build it well.
Ceph storage: the backbone you need to get right
Most Proxmox deployments at scale use Ceph for distributed storage. It is deeply integrated, open source, and scales horizontally. It is also the component that requires the most careful configuration to run reliably at scale.
What we have learned running Ceph in production:
- Network separation is non-negotiable. Ceph needs its own dedicated network, separate from VM traffic and management. We use 25Gbit links for the Ceph cluster network and for the public network. Mixing Ceph with VM traffic on the same links is asking for latency spikes during rebalancing.
- OSD count and sizing matters. We standardize on enterprise NVMe drives, 3.84TB, 7.68TB or 15.36TB per OSD. Consumer drives are not an option for production workloads.
- Recovery thundering herd. When an OSD fails, Ceph rebalances data across the remaining drives. If your cluster is already at 90%+ capacity, this rebalancing competes with production I/O and can degrade the entire cluster. We maintain a strict capacity ceiling.
- Placement groups matter. Too few PGs and your data distribution is uneven. Too many and your OSDs spend more time managing PGs than serving I/O. The formula is not complicated, but getting it wrong at scale is painful.
# Check Ceph cluster health and capacity
ceph status
ceph df
ceph osd pool ls detail
# Monitor OSD performance
ceph osd perf
# Check for slow OSDs (common precursor to disk failure)
ceph daemon osd.0 perf dump | jq '.osd.op_latency'Ceph is not a black box. It tells you everything, if you know where to look. The challenge is building the monitoring and alerting that turns those signals into actionable insights before they become incidents.
The real cost of a VMware migration
License savings are what get the conversation started. But the total cost of a Proxmox migration includes more than license fees.
| Cost Factor | VMware | Proxmox (Self-managed) | Proxmox (Managed) |
|---|---|---|---|
| Licensing | CHF 50-100k+/year | ~CHF 5k/year (subscription) | Included in service |
| Hardware freedom | Restricted to HCL | Any enterprise hardware | Validated by partner |
| Operational expertise | Vendor + consultants | Must build internally | Provided by partner |
| Monitoring & alerting | Aria suite | Build from scratch | Production-ready stack |
| HA validation | Reference architectures | Design and test yourself | Battle-tested designs |
| Migration effort | N/A | 3-6 months typical | 1-3 months guided |
| Ongoing maintenance | Vendor patches + updates | Your team | Managed for you |
The organizations that get the best ROI from Proxmox are those that either invest in building a dedicated platform team, or partner with a managed Proxmox provider who brings the operational knowledge from day one.
How to succeed: build versus partner
After three years and dozens of customer engagements, we see two paths that work:
Path 1: Build internally. Invest in 2-3 engineers with deep Linux, networking, and storage experience. Plan for 6-12 months of learning curve and budget accordingly. This path works for organizations with 50+ servers who want full control and have the talent pipeline to sustain it.
Path 2: Partner with a managed provider. Leverage someone else's three years of production experience. Get a validated architecture, a monitoring stack, and operational runbooks from day one. Your team focuses on what runs on the platform, not the platform itself. This path works for organizations of any size who want the cost benefits of Proxmox without building the expertise from scratch.
Both paths work. The approach that tends to run into trouble is treating Proxmox like VMware: expecting the platform to guide you, underinvesting in monitoring, and skipping HA validation. Without those foundations in place, the migration rarely sticks.
What we are still figuring out
Honesty matters. Three years in, there are still hard problems:
- DRS-like workload balancing. Proxmox HA restarts VMs on other nodes, but it does not automatically balance workloads across the cluster. Proxmox is builting this as we are writing, but for now it is a manual process to ensure your cluster is balanced and not overloaded.
- Enterprise backup integration. Proxmox Backup Server is solid for the basics, but integrating with enterprise backup tools (Veeam, Commvault) requires work. The ecosystem is improving rapidly but is not at VMware parity yet.
- Multi-site Ceph. Stretched clusters across data centers are possible but operationally complex. We currently run site-local Ceph clusters with application-level replication instead.
- GPU passthrough at scale. Single GPU passthrough works well. Managing a fleet of GPU-enabled VMs with live migration is still a manual process compared to VMware's mature vGPU ecosystem.
These are not dealbreakers. They are areas where the ecosystem is younger and requires more operational investment. We share them because organizations deserve to know what they are signing up for.
You do not have to figure this out alone
The knowledge gap is exactly what we close at Natron. We are not a reseller who read the docs last week. We are engineers who have been running Proxmox in production for three years, with the operational history to back it up.
What working with us looks like:
- Managed Proxmox clusters: we design, deploy, monitor, and maintain your environment so your team can focus on what runs on the platform, not the platform itself.
- Migration support: whether you are coming from VMware, Hyper-V, or bare metal, we have done it before and know where the pitfalls are.
- A real monitoring stack from day one: Prometheus, Grafana, Alertmanager, custom dashboards. Not a checkbox; the same stack we use for our own infrastructure.
- An honest conversation first: we will tell you if Proxmox is the right fit for your workloads. If it is not, we would rather say so upfront than sell you something that does not work.
We are Natron, based in Bern, Switzerland. We build and operate Natron Cloud: managed cloud infrastructure on Proxmox for businesses that want digital sovereignty without the operational burden.
Get the Full Guide
Enter your email and get instant access to the full guide as a downloadable PDF.
- 3 years of production Proxmox experience with 1000+ VMs
- Honest assessment of what VMware migrations actually require
- HA, monitoring, and Ceph storage lessons learned the hard way
- Decision framework: when Proxmox is right and when it isn't
Free download. No spam. We never share your data with third parties.
Why Organizations Are Not Moving to Proxmox
14 pages