avatar
@bangbang93

修复gitlab升级后,后台迁移任务一直卡在0% active状态不运行

3/23/2025, 1:18:30 AM

我自托管的gitlab一直运行在16.10.0版本,这两天心血来潮想要升级,于是熟练得打开gitlab升级路线查询
https://gitlab-com.gitlab.io/support/toolbox/upgrade-path/?current=16.10.0
发现需要经历6次升级,那只好慢慢升了。

结果在升级到16.11.10之后,发现有好几个后台迁移任务,我知道这玩意执行很慢,所以就打算先睡觉,等第二天一早任务执行完后继续升级。结果第二天早上一看,这几个后台迁移任务居然一个都没有少。
当时的查询方向是后台迁移任务卡主,所以大约按照"gitlab background migration stack 0%",进行了一圈google,按照官方文档https://docs.gitlab.com/update/background_migrations/?tab=From+the+database#check-for-pending-background-migrations的内容,手动挨个执行了升级脚本,没有报错全部都能执行完毕。

这时开始升级第二个版本,升级到17.1.8。此时安装时报错了,给出的错误是

Error executing action run on resource 'bash_hide_env[migrate gitlab-rails database]'
Mixlib::ShellOut::ShellCommandFailed
Expected process to exit with [0], but received '1'
---- Begin output of "bash"  ----
STDOUT: rake aborted!
StandardError: An error has occurred, all later migrations canceled:

按照这个错误,我google到了gitlab官方的这个issue https://gitlab.com/gitlab-org/gitlab/-/issues/525265,但是issue里执行报错的migration和我并不是同一个,但是值得参考。
我报错的文件是https://gitlab.com/gitlab-org/gitlab/-/blob/adacc97ecff50e1dab70c16acd784a12dc7cdbe0/db/migrate/20240421143211_remove_container_registry_migration_workers.rb,所以按照issue里的指令,替换job class执行

WORKER_CLASSES = [
      container_registry_migration_guard_worker
      container_registry_migration_observer_worker
      container_registry_migration_enqueuer_worker
    ]
.freeze

WORKER_CLASSES.each do |wc|
  Gitlab::SidekiqSharding::Validator.allow_unrouted_sidekiq_calls do
    Sidekiq.redis do |conn|
      puts conn.hgetall("cron_job:#{wc}")
    end
  end
end

确实看到了这三个定时任务,并且也如issue所描述的其last_enqueue_time是"2025-02-29 16:00:08 +0000"这么一个并不存在的时间
再按照issue挨个执行

Gitlab::SidekiqSharding::Validator.allow_unrouted_sidekiq_calls do
  Sidekiq.redis do |conn|
    conn.hdel("cron_job:ensure_merge_requests_prepared_worker", "last_enqueue_time")
  end
end

后升级完成。

但是这时又有10个后台迁移任务在等待执行,并且等了好几个小时之后所有任务都还是0%的进度,但是这次并不急着睡觉,所以打算研究一下。

在后台瞎点的时候我发现,只要点进后台sidekiq的cron页面,就会直接500报错
1742693550256.png
并且此时查看sidekiq日志有报错信息

invalid date
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:631:in `parse'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:631:in `rescue in parse_enqueue_time'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:628:in `parse_enqueue_time'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:290:in `initialize'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:228:in `new'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:228:in `block in all'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:226:in `collect'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:226:in `all'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:534:in `destroy_removed_jobs'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:177:in `load_from_hash!'
/opt/gitlab/embedded/service/gitlab-rails/config/initializers/sidekiq.rb:9:in `load_cron_jobs!'
/opt/gitlab/embedded/service/gitlab-rails/config/initializers/sidekiq.rb:118:in `block in <top (required)>'
/opt/gitlab/embedded/service/gitlab-rails/vendor/gems/sidekiq-7.1.6/lib/sidekiq.rb:98:in `configure_server'
/opt/gitlab/embedded/service/gitlab-rails/config/initializers/sidekiq.rb:44:in `<top (required)>'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/engine.rb:667:in `load'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/engine.rb:667:in `block in load_config_initializer'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/activesupport-7.0.8.4/lib/active_support/notifications.rb:208:in `instrument'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/engine.rb:666:in `load_config_initializer'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/engine.rb:620:in `block (2 levels) in <class:Engine>'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/engine.rb:619:in `each'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/engine.rb:619:in `block in <class:Engine>'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/initializable.rb:32:in `instance_exec'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/initializable.rb:32:in `run'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/initializable.rb:61:in `block in run_initializers'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:228:in `block in tsort_each'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:350:in `block (2 levels) in each_strongly_connected_component'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:422:in `block (2 levels) in each_strongly_connected_component_from'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:431:in `each_strongly_connected_component_from'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:421:in `block in each_strongly_connected_component_from'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/initializable.rb:50:in `each'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/initializable.rb:50:in `tsort_each_child'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:415:in `call'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:415:in `each_strongly_connected_component_from'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:349:in `block in each_strongly_connected_component'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:347:in `each'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:347:in `call'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:347:in `each_strongly_connected_component'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:226:in `tsort_each'
/opt/gitlab/embedded/lib/ruby/3.1.0/tsort.rb:205:in `tsort_each'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/initializable.rb:60:in `run_initializers'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/application.rb:372:in `initialize!'
/opt/gitlab/embedded/service/gitlab-rails/config/environment.rb:7:in `<top (required)>'
<internal:/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
<internal:/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/rubygems/core_ext/kernel_require.rb>:37:in `require'
/opt/gitlab/embedded/service/gitlab-rails/vendor/gems/sidekiq-7.1.6/lib/sidekiq/cli.rb:303:in `boot_application'
/opt/gitlab/embedded/service/gitlab-rails/vendor/gems/sidekiq-7.1.6/lib/sidekiq/cli.rb:42:in `run'
/opt/gitlab/embedded/service/gitlab-rails/vendor/gems/sidekiq-7.1.6/bin/sidekiq:31:in `<top (required)>'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/bin/sidekiq:25:in `load'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/bin/sidekiq:25:in `<top (required)>'
/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/bundler/cli/exec.rb:58:in `load'
/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/bundler/cli/exec.rb:58:in `kernel_load'
/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/bundler/cli/exec.rb:23:in `run'
/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/bundler/cli.rb:455:in `exec'
/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/bundler/vendor/thor/lib/thor/command.rb:28:in `run'
/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/bundler/vendor/thor/lib/thor.rb:527:in `dispatch'
/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/bundler/cli.rb:35:in `dispatch'
/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/bundler/vendor/thor/lib/thor/base.rb:584:in `start'
/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/bundler/cli.rb:29:in `start'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/bundler-2.5.11/exe/bundle:28:in `block in <top (required)>'
/opt/gitlab/embedded/lib/ruby/site_ruby/3.1.0/bundler/friendly_errors.rb:117:in `with_friendly_errors'
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/bundler-2.5.11/exe/bundle:20:in `<top (required)>'
/opt/gitlab/embedded/bin/bundle:25:in `load'
/opt/gitlab/embedded/bin/bundle:25:in `<main>'
{"severity":"INFO","time":"2025-03-22T16:18:52.348Z","message":"stopped","memwd_reason":"background task stopped","memwd_handler_class":"Gitlab::Memory::Watchdog::Handlers::SidekiqHandler","memwd_sleep_time_s":3,"pid":1680601,"worker_id":"sidekiq_0","memwd_rss_bytes":428371968,"retry":0}
{"severity":"INFO","time":"2025-03-22T16:18:57.392Z","message":"A worker terminated, shutting down the cluster"}

还是熟悉的错误,直觉告诉我应该是sidekiq的cron里还有2025年2月29日的日期。但是我并没有使用过sidekiq,所以对其原理并不是很了解,只好先去翻翻gitlab的文档。在其中找到了Troubshoting Sidekiq的说明https://docs.gitlab.com/administration/sidekiq/sidekiq_troubleshooting/#manually-trigger-a-cron-job 于是摸到gitlab-rake console,执行Sidekiq::Cron::Job.all,但是还是会报错

/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:631:in `parse': invalid date (Date::Error)
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:631:in `rescue in parse_enqueue_time'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:628:in `parse_enqueue_time'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:290:in `initialize'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:228:in `new'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:228:in `block in all'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:226:in `collect'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:226:in `all'
        from (irb):2:in `<main>'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/commands/console/console_command.rb:74:in `start'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/commands/console/console_command.rb:19:in `start'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/commands/console/console_command.rb:106:in `perform'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/thor-1.3.1/lib/thor/command.rb:28:in `run'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/thor-1.3.1/lib/thor/invocation.rb:127:in `invoke_command'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/thor-1.3.1/lib/thor.rb:527:in `dispatch'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/command/base.rb:87:in `perform'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/command.rb:48:in `invoke'
        ... 4 levels...
/opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:629:in `strptime': invalid date (Date::Error)
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:629:in `parse_enqueue_time'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:290:in `initialize'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:228:in `new'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:228:in `block in all'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:226:in `collect'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/sidekiq-cron-1.12.0/lib/sidekiq/cron/job.rb:226:in `all'
        from (irb):2:in `<main>'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/commands/console/console_command.rb:74:in `start'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/commands/console/console_command.rb:19:in `start'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/commands/console/console_command.rb:106:in `perform'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/thor-1.3.1/lib/thor/command.rb:28:in `run'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/thor-1.3.1/lib/thor/invocation.rb:127:in `invoke_command'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/thor-1.3.1/lib/thor.rb:527:in `dispatch'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/command/base.rb:87:in `perform'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/command.rb:48:in `invoke'
        from /opt/gitlab/embedded/lib/ruby/gems/3.1.0/gems/railties-7.0.8.4/lib/rails/commands.rb:18:in `<top (required)>'
        ... 3 levels...

看来没法通过程序解决了,似乎sidekiq无论如何都会尝试去解析这个并不存在的时间然后报错。只能从存储入手了。
简单阅读了一下gitlab和sidekiq的源码,发现这里的cron是存储在redis里的,所以可以使用gitlab-redis-cli命令摸到gitlab的redis里,然后在redis离手动scan cron_job:*,找到了hash类型的key
scan 0 cron_job:* type hash
1742693526004.png
果然被我找到了好几个last_enqueue_time是2月29日的记录
于是用scan找到所有hash类型的key,其实就是cron_job:开头,并且没有其他suffix结尾的,挨个执行hdel key last_enqueue_time
大概在删了20个之后是在受不了手动处理了,感觉没完没了。所以掏出github copilot,让它给gitlab-rails console写了一个脚本

# This script is intended to be run in the GitLab Rails console
# It assumes that the Redis connection is already configured and available

# Initialize cursor for SCAN
cursor = "0"
deleted_count = 0

begin
  # Fetch a batch of keys with the prefix 'cron_job:' and type 'hash'
  cursor, keys = Gitlab::Redis::SharedState.with { |redis| redis.scan(cursor, match: 'cron_job:*', count: 100, type: 'hash') }

  keys.each do |key|
    # Fetch the 'last_enqueue_time' attribute
    last_enqueue_time = Gitlab::Redis::SharedState.with { |redis| redis.hget(key, 'last_enqueue_time') }
    
    # Check if 'last_enqueue_time' starts with '2025-02-29'
    if last_enqueue_time && last_enqueue_time.start_with?('2025-02-29')
      # Log the key and attribute for dry run
      puts "Would delete 'last_enqueue_time' from key: #{key} with value: #{last_enqueue_time}"
      # Uncomment the next line to actually delete the attribute
      # Gitlab::Redis::SharedState.with { |redis| redis.hdel(key, 'last_enqueue_time') }
      deleted_count += 1
    # else
    #   # Log the key and attribute for dry run
    #   puts "Skipping key: #{key} with 'last_enqueue_time': #{last_enqueue_time}"
    end
  end
end while cursor != "0"

puts "Total keys deleted: #{deleted_count}"

最终输出了Total keys deleted: 40,看上去还有40个key需要处理。
执行完成后,终于能打开sidekiq的cron界面了,并且稍等了几分钟后发现后台迁移任务也开始执行了
1742693519515.png