Hi all,
Let's assume that you wanted to deploy ZK in a virtualized environment,
despite all of the known drawbacks. Assume we could deploy it such that
the ZK servers were all using independent CPUs and storage (though not
dedicated disks). Obviously, the shared disks (shared with other,
non-ZK VMs on the same hypervisor) will cause ZK to hit the default
session timeout occasionally, so you would need to raise the existing
session timeout to something like 30 seconds.
I'm curious if there would be any technical drawbacks to adding an
additional heartbeat mechanism between the clients and the servers,
which would have the goal of detecting network-only failures faster than
the existing heartbeat mechanism. The idea is that there would be a new
thread dedicated to processing these heartbeats, which would not get
blocked on I/O. Then the clients could configure a second, smaller
timeout value, and it would be assumed that any such timeout indicated a
real problem. The existing mechanism would still be in place to catch
I/O-related errors.
I understand the philosophy that there should be some heartbeat
mechanism that takes the disk into account, but I'm having trouble
coming up with technical reasons not to add a second mechanism.
Obviously, the advantage would be that the clients could detect network
failures and system crashes more quickly in an environment with slow
disks, and fail over to other servers more quickly. The only
disadvantages I can come up with are:
1) More code complexity, and slightly more heartbeat traffic on the wire
2) I think the servers have to log session expirations to disk, so if
the sessions expire at a faster rate than the disk can handle, it might
lead to a large backlog.
Are there other drawbacks I am missing? Would a patch that added
something like this be considered, or is it dead from the start? Thanks,
Jeremy
Let's assume that you wanted to deploy ZK in a virtualized environment,
despite all of the known drawbacks. Assume we could deploy it such that
the ZK servers were all using independent CPUs and storage (though not
dedicated disks). Obviously, the shared disks (shared with other,
non-ZK VMs on the same hypervisor) will cause ZK to hit the default
session timeout occasionally, so you would need to raise the existing
session timeout to something like 30 seconds.
I'm curious if there would be any technical drawbacks to adding an
additional heartbeat mechanism between the clients and the servers,
which would have the goal of detecting network-only failures faster than
the existing heartbeat mechanism. The idea is that there would be a new
thread dedicated to processing these heartbeats, which would not get
blocked on I/O. Then the clients could configure a second, smaller
timeout value, and it would be assumed that any such timeout indicated a
real problem. The existing mechanism would still be in place to catch
I/O-related errors.
I understand the philosophy that there should be some heartbeat
mechanism that takes the disk into account, but I'm having trouble
coming up with technical reasons not to add a second mechanism.
Obviously, the advantage would be that the clients could detect network
failures and system crashes more quickly in an environment with slow
disks, and fail over to other servers more quickly. The only
disadvantages I can come up with are:
1) More code complexity, and slightly more heartbeat traffic on the wire
2) I think the servers have to log session expirations to disk, so if
the sessions expire at a faster rate than the disk can handle, it might
lead to a large backlog.
Are there other drawbacks I am missing? Would a patch that added
something like this be considered, or is it dead from the start? Thanks,
Jeremy