GitLab Appliance - weird issues - performance?

After attempting numerous times to deploy the GitLab appliance from the marketplace ( https://marketplace.opennebula.io/appliance/6b54a412-03a5-11e9-8652-f0def1753696 ) I cannot get past the login page.

I am not currently configuring SSL/HTTPS - this is all HTTP, so I am leaving the relevant contextualization entries blank.

The appliance appears to deploy correctly, shows as “All set and ready to serve” on the console, and the logfiles in etc/one-appliance don’t show any errors.

However, upon connecting to the URL, the login page takes multiple minutes to load - and then when it does appear, the formatting is horribly broken - Times New Roman font, no layout information (but weirdly the logo appears!). I can try to login, but again, after a looooong time (5 minutes+?) it eventually gives me a 401 page - again with logo, but this time with better formatting.

Initially I believed this was a peformance problem, but I’ve now deployed this onto a fast NFS server (I can see relatively low disk latency and no queuing) and it has 24GB of RAM and 4 vCPUs. It’s not resource starved.

The Hypervisor is KVM on Centos7. I’ve tried using a variety of network configurations (vxlan + vrouter, adding it to the management VLAN) but I seem to get the same result. Can anyone suggest what I am missing?


Versions of the related components and OS (frontend, hypervisors, VMs):
GitLab Appliance 13.0.5-5.12.0-1.20200609
OpenNebula 5.10.1
Centos 7

Steps to reproduce:
Using iguane TF provider:

resource "opennebula_image" "GitLab" {
  name = "GitLab - KVM"
  description = "CentOS 7 based GitLab CE appliance"
  datastore_id = 1 
  persistent = false
  lock = "UNLOCK"
  path = "https://marketplace.opennebula.io//appliance/6b54a412-03a5-11e9-8652-f0def1753696/download/0"
  dev_prefix = "vd"
  driver = "qcow2"
  permissions = "660"
  group = var.one_target_group
}
resource "opennebula_template" "GitLabSrvTemplate" {
  name = "GitLab Server"
  template = templatefile("${path.module}/gitlab_srv.tmpl", {
    hostname              = local.gitlab_name
    domain                = var.domain
    vcpu                  = 4
    cpu                   = "4.0"
    mbram                 = "18000"
    oneapp_site_hostname  = "${local.gitlab_name}.${var.domain}"
    oneapp_admin_username = "theusername"
    oneapp_admin_password = "secretpassword"
    oneapp_admin_email    = "geunine@email.provider"
  })
  permissions = "660"
}

resource "opennebula_virtual_machine" "GitLabSrv" {
  name         = local.gitlab_name
  permissions  = "660"
  template_id  = opennebula_template.GitLabSrvTemplate.id
  nic {
    model      = "virtio"
    network_id = var.subnet_id
  }
}

Current results:

Expected results:

Hi @SteveB

I think that the problem is because of bad dns resolving - for example I do not know how this expression gets expanded:

But even if it will be the correct intended result then it may failed due to this:

    hostname              = local.gitlab_name
    domain                = var.domain

HOSTNAME and DOMAIN are not valid context variables (I am not familiar with the terraform provider so maybe I am wrong here) but it should be SET_HOSTNAME as described in the doc. I could not find anything relating to the DOMAIN.

If none above is the issue (terraform provider is actually using SET_HOSTNAME under the hood and appliance itself is resolving to itself via the correct domain name) then the issue could be that your client (browser) is the one not resolving the said domain correctly.

Basically let’s say that your domain is mygitlab.mylocaldomain. If everything is setup correctly then inside the appliance VM you should be able to do:

curl -L http://mygitlab.mylocaldomain

and get the login page html source on the stdout. That would be achieved for example by a correct record in the /etc/hosts:

127.0.0.1 mygitlab.mylocaldomain mygitlab

(with SET_HOSTNAME = "mygitlab.mylocaldomain" it should have been already done)

Also you should be able from your laptop/PC resolve the mygitlab.mylocaldomain and get the IP of the VM or router/firewall/nat and forward correctly the port 80 to the VM.

In the simple scenario when your laptop/PC is on the same network as the VM (with the address 192.168.111.111) you should have also a record in your dns nameserver or simply in /etc/hosts:

192.168.111.111 mygitlab.mylocaldomain mygitlab

Let me know if my hints were useful :slight_smile:

UPDATE: also pls take note that OpenNebula will replace underscores in the domain name to dashes so gitlab_name will become gitlab-name in /etc/hosts and therefore the appliance will never succeed to verify that the gitlab is running - resolving will fail again…this is partially an error in the OpenNebula as of now. Reference: dns - Can (domain name) subdomains have an underscore "_" in it? - Stack Overflow

-osp-

Thanks for the detailed answer -osp- , and apologies - I should have included the template file that is being used by terraform to build the contextualization data. I’ll break it out properly in a moment, but in short if I look at the appliance context in sunstone:
HOSTNAME = my-GitLab
ONEAPP_ADMIN_EMAIL = me@gmail.com
ONEAPP_ADMIN_USERNAME = admin
ONEAPP_ADMIN_PASSWORD = sneaky!P4ss
ONEAPP_SITE_HOSTNAME = my-GitLab@mydomain.test
SET_HOSTNAME = my-GitLab
DOMAIN = }mydomain.test

So the sneaky brace at the start of the DOMAIN entry is a problem :slight_smile: I will try fixing that and see what happens, but I don’t think that is the cause.

DNS is definitely resolving the GitLab appliance name from my client - but I haven’t tested it from the appliance itself - I’ll test that now.

Revised and re-tested.

Sadly, it’s definitely not DNS - I can curl from the appliance itself using both DNS shortname and long-name. Response is (as you would imagine) instantaneous.

The same is true from my machine - I can curl to both the IP or the FQDN and immediately receive a page-full of HTML. Interestingly, I piped that to a file, and it seems to be the same basic layout (no pretty panes) that I am getting from the server when I point a browser to it. (picture above) Visiting the server with a browser sadly gives me a white page for several minutes, before eventually giving me the same
sparse page. I’ve tried profiling the page-load in Chrome, and interestingly it’s taking nearly 300 seconds to load, and showing ERR_CONNECTION_RESET 200 (OK) a couple of times throughout that process when loading style-sheets (css), before also returning unexpected end of JSON input when loading
/assets/webpack/commons~pages.admin.sessions~pages.ldap.omniauth_callbacks~pages.omniauth_callbacks~pages.sessions~p~9253e31e.90c611c5.chunk.js.map

So there is something weird going on in that appliance - any clues which would be the best logfiles to start with?

Hi @SteveB

I still think that the problem is somewhere with your context and/or addresses/dns. This behavior often occurs when webapp (like GitLab) is trying dynamically talk to its backend and fails - that is why there are timeouts and the page seems to load only partially - because the static portion was served but the dynamic one was not (frontend talking over json API or similar). The reason for it is probably misconfiguration - GitLab is configured with some ONEAPP_SITE_HOSTNAME and it requires that both the browser and webapp itself (internally) can reach it and that it does point to the same place.

I suggest the following:

  1. remove terraform from the equation and try to deploy it manually and as simple as possible
  2. drop here your appliance’s /etc/one-appliance/config (after it is bootstrapped successfully)
  3. show us hostname -f inside the appliance
  4. show us ping -c 1 my-gitlab.mydomain.test from within your appliance and from your pc
  5. from your pc where you are running the browser do:
$ ssh -NfL 8080:127.0.0.1:80 myuser@my-gitlab.mydomain.test # or similar to tunnel the http port (setup ssh keys)
$ sudo echo "127.0.0.1 my-gitlab.mydomain.test" >> /etc/hosts
$ xdg-open "http://my-gitlab.mydomain.test:8080"

There is still possibility that you have found a bug but I lastly tested a deployment similar to yours here (without terraform) and it worked fine.

-osp-

Okay - weirder and weirder - it’s something filesystem related, somehow.

I can tail * in /root - there are no files in there, and it tells me so.
I can create a couple of text files in there and tail * shows me the contents of both. So far, so normal.

cd /var/log/one-appliance/
tail ONE_bootstrap.log works okay
tail ONE_configure.log works okay
tail ONE_install.log works okay
tail * and it shows the files.
tail * a second time and the SSH session hangs indefinitely.

So I tried ls * from the filesystem root a few times. The third time - the SSH session hung.

I now suspect that this isn’t a problem with the appliance at all, and there is some serious underlying issue.

OK, just few obvious pointers if you did not checked already:

$ df -h
$ mount | grep ro

Did you resized the image so it can breathe? What about memory - GitLab needs 4GB+…

Good luck

-osp-

Thanks, man - I appreciate the pointers. Sadly - there is >25% available on all filesystems, only tmpfs is mounted RO as usual, and it’s got >2GB memory free once everything is loaded (appliance configured with 6GB).

Time to start looking for kvm issues, I think :frowning:

Weirder and weirder.
The issue persists irrespective of whether the appliance runs from an NFS store or from local disk.
I’ve tried it on a KVM node running inside vsphere AND on a KVM node on some raw-hardware, and have the same issue. Significantly, I can’t reproduce the issue when using a vnc session to the appliance…it’s time to look at the network.

Finally nailed it. The appliance defaults to an MTU size of 1500. For reasons I don’t yet understand, this is too large to make it across the network, but isn’t getting fragmented. I’ve reduced the MTU size on the VM and everything now seems to behave normally.

Thanks for your help, Petr: we were looking in the wrong place, but you were right - it was a network issue all along - and thanks for the appliance (can’t wait to try it out, as soon as I find the root cause of the MTU issue!)

Glad to help @SteveB - it is great that you managed to find the root cause :wink:

P.S. Not sure for what good is the tmpfs mounted as read-only though but maybe it is some custom mount and not one of the /dev/shm or /tmp…

-osp-