记一次艰难的gitlab升级

2017-9-23 22:26:44

突然发觉Gitlab发布了10.0.0,趁着今天厂里不上班没人用gitlab,打算升级一波

但是我厂gitlab一直有一个隐患,配套的postgresql一直升级不上去,每次dnf upgrade gitlab-ce的时候都会在升级数据库的时候报错,但是由于一直使用上没出问题,所以就一直没理睬

但是这次升级gitlab10.0.0时,终于不对以前的postgresql9.2.18提供支持了,升级的时候直接提示数据库不受支持,看来是时候把这个问题修好了

第一次尝试

图片

那就收工升级呗,gitlab-ctl pg-upgrade就是开干,但是报错

/var/opt/gitlab/postgresql/analyze_new_cluster.sh : No such file or directory

google一下,找到了gitlab的一个issue https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1934 我确实移动过数据库,所以去数据库目录里找到了这个文件,把它复制过来

STDERR: initdb: directory "/var/opt/gitlab/postgresql/data.9.6.3" exists but is not empty
If you want to create a new database system, either remove or empty
the directory "/var/opt/gitlab/postgresql/data.9.6.3" or run initdb
with an argument other than "/var/opt/gitlab/postgresql/data.9.6.3".

好嘛,反正还没升级完,我先把这个目录删掉了

终于看到==== Upgrade has completed ====了,真是不容易,但是打开页面却报500,去看log时发现pg疯狂刷错

2017-09-23_12:43:24.63534 LOG: unrecognized configuration parameter "unix_socket_directories" in file "/data/postgresql/data/postgresql.conf" line 72
2017-09-23_12:43:24.63535 LOG: unrecognized configuration parameter "min_wal_size" in file "/data/postgresql/data/postgresql.conf" line 173
2017-09-23_12:43:24.63536 LOG: unrecognized configuration parameter "max_wal_size" in file "/data/postgresql/data/postgresql.conf" line 174
2017-09-23_12:43:24.63536 LOG: unrecognized configuration parameter "max_replication_slots" in file "/data/postgresql/data/postgresql.conf" line 177
2017-09-23_12:43:24.63536 LOG: unrecognized configuration parameter "min_wal_size" in file "/data/postgresql/data/runtime.conf" line 14
2017-09-23_12:43:24.63537 LOG: unrecognized configuration parameter "max_wal_size" in file "/data/postgresql/data/runtime.conf" line 15
2017-09-23_12:43:24.63537 LOG: unrecognized configuration parameter "idle_in_transaction_session_timeout" in file "/data/postgresql/data/runtime.conf" line 102
2017-09-23_12:43:24.63537 FATAL: configuration file "/data/postgresql/data/postgresql.conf" contains errors

于是去找pg的二进制文件,看看他到底是什么鬼,结果发现/opt/gitlab/embedded/bin/postgres指向的是/opt/gitlab/embedded/postgresql/9.2.18/bin/postgres,想了想,决定开个暴力的方法

图片

这下pg终于能起来了

可惜事情不会这么顺利……待unicorn起来后,pg又开始报错

2017-09-23_12:42:48.52264 STATEMENT: SELECT COUNT() FROM projects LEFT JOIN namespaces ON projects.namespace_id = namespaces.id WHERE namespaces.id IS NULL;
2017-09-23_12:42:48.52274 ERROR: relation "snippets" does not exist at character 22
2017-09-23_12:42:48.52274 STATEMENT: SELECT COUNT(
) FROM snippets WHERE type='PersonalSnippet';
2017-09-23_12:42:48.52280 ERROR: relation "snippets" does not exist at character 22
2017-09-23_12:42:48.52281 STATEMENT: SELECT COUNT() FROM snippets WHERE type='ProjectSnippet';
2017-09-23_12:42:48.52292 ERROR: relation "uploads" does not exist at character 22
2017-09-23_12:42:48.52293 STATEMENT: SELECT COUNT(
) FROM uploads ;
2017-09-23_12:42:48.52306 ERROR: relation "users" does not exist at character 22
2017-09-23_12:42:48.52307 STATEMENT: SELECT COUNT() FROM users ;
2017-09-23_12:42:48.52319 ERROR: relation "ci_runners" does not exist at character 22
2017-09-23_12:42:48.52320 STATEMENT: SELECT COUNT(
) FROM ci_runners WHERE active=true AND contacted_at >= NOW() - '1 day'::INTERVAL;
2017-09-23_12:42:49.58394 ERROR: relation "ci_runners" does not exist at character 323
2017-09-23_12:42:49.58397 STATEMENT: SELECT a.attname, format_type(a.atttypid, a.atttypmod),
2017-09-23_12:42:49.58398 pg_get_expr(d.adbin, d.adrelid), a.attnotnull, a.atttypid, a.atttypmod
2017-09-23_12:42:49.58398 FROM pg_attribute a LEFT JOIN pg_attrdef d
2017-09-23_12:42:49.58398 ON a.attrelid = d.adrelid AND a.attnum = d.adnum
2017-09-23_12:42:49.58399 WHERE a.attrelid = '"ci_runners"'::regclass
2017-09-23_12:42:49.58399 AND a.attnum > 0 AND NOT a.attisdropped
2017-09-23_12:42:49.58399 ORDER BY a.attnum
2017-09-23_12:42:49.58399
2017-09-23_12:42:49.59044 ERROR: relation "ci_runners" does not exist at character 323
2017-09-23_12:42:49.59045 STATEMENT: SELECT a.attname, format_type(a.atttypid, a.atttypmod),
2017-09-23_12:42:49.59046 pg_get_expr(d.adbin, d.adrelid), a.attnotnull, a.atttypid, a.atttypmod
2017-09-23_12:42:49.59046 FROM pg_attribute a LEFT JOIN pg_attrdef d
2017-09-23_12:42:49.59046 ON a.attrelid = d.adrelid AND a.attnum = d.adnum
2017-09-23_12:42:49.59046 WHERE a.attrelid = '"ci_runners"'::regclass
2017-09-23_12:42:49.59046 AND a.attnum > 0 AND NOT a.attisdropped
2017-09-23_12:42:49.59047 ORDER BY a.attnum

这就很尴尬了,居然会找不见数据表……那它以前是怎么跑的……
事已至此,我只好先回滚,不然万一搞不定影响正常工作就麻烦了

第二次尝试

回滚后觉得不甘心,而且不可能以后一直gitlab再也不升级,抱着再试试还不行就重装的觉悟,开始了第二次尝试

这次出现了一个新的报错

STDOUT:
connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.50432"?

could not connect to old postmaster started with the command:
"/opt/gitlab/embedded/postgresql/9.2.18/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/opt/gitlab/postgresql/data" -o "-p 50432 -c autovacuum=off -c autovacuum_freeze_max_age=2000000000 -c listen_addresses='' -c unix_socket_permissions=0700" start
Failure, exiting
STDERR:
Upgrading the data: NOT OK

这是个偶发故障,可能是pg还没起来,一般来说只要重试就行

在google这个错误的时候我发现了一条命令gitlab-rake db:migrate这不就是我想要的命令吗

一路继续终于又走到了升级完报错找不见表的时候了,migrate走起,顺利执行完毕,重启gitlab所有服务后我终于把pg升到了9.6.3

图片

但是我突然发现,似乎少了什么

图片

我心里一惊,用户只剩我一个了,project都没了,group也只剩最初的一个了,数据丢了????这下可玩大了,真的是《gitlab从入门到删仓库》了

但是转念一想,还好git是分布式,就算出现最坏的结果,也只要通知大家重新push一下项目就行,没什么特别大的损失,但是这样子确实不够优雅,确定硬盘上的project还都在躺着之后,我决定冷静一下想一想到底是哪里出了问题

首先google了一下gitlab-rake db:migrate的文档,确定了这个命令不会重新初始化数据,那问题一定出在其他地方

冷静下来后突然想到,既然/var/opt/gitlab/postgresql/analyze_new_cluster.sh这个文件都是硬编码的,那migrate的数据库会不会也是硬编码的,我当时移动数据库地址的时候应该差不多是只有我一个用户的时候移动的,如果是这样的话,就有办法了

第三次尝试

回滚pg到9.2后,我先把独立的数据库data目录移动到原始目录,然后再重新走了整个升级过程,升级完后把升级后的data目录复制出去覆盖掉独立的数据库目录,然后重新执行migrate过程,终于,熟悉的项目们都回来了

图片

拥抱10.0.0

dnf upgrade gitlab-ce终于没有任何阻碍了

图片