Batch service using OneFlow - best practices


I am looking for tips how to configure a OneFlow service for CPU-intensive batch processing: I have many jobs which take several minutes on four CPU cores each. The jobs are CPU-local, no network communication is needed apart from downloading the input and uploading the result afterwards. I came with the following architecture:

  • a service with two roles:
  • master VM with outside network access would hold the job queue, each job in its own directory
  • private VNet for the service
  • computing VMs connected to the private VNet, mounting the job directory over NFS, getting job inputs from there, and uploading the result there

Now I have some questions:

  1. when using autoscaling (when the queue has more than X waiting jobs for longer time than Y, spawn a new computing server, and when the queue is near-empty for more than Z, delete a computing server) - how can I properly drain a server? I.e. how can OneFlow inform the computing VM that it is the one which is going to be decomissioned, and after the VM finishes the current job, how can it inform the OneFlow server that it can be safely destroyed?

  2. is it possible to lower the priority of QEMU processes for that role, so that it can use all the available CPU time on OpenNebula nodes, but not hinder the performance of the non-batch VMs?

  3. Is it possible somehow to connect the OneGate endpoint to that private VNet, so that OneGate communication does not need to pass through the master node? I would prefer to not have to route/NAT that private VNet to the outside world, I want the computing VMs to remain as isolated as possible.