服务器异常关机导致的GP集群非正常停止如何启动?

有时候我们会遇到一些特殊情况,并不是由于集群故障而导致Greenplum不可用。比如机房断电导致服务器异常关机。 此时正常情况下,如果没有更换硬盘,理论上机器加电后是可以直接启动的,但是需要先将GP记录启动状态的一些文件清理掉。这些文件通常包括如下几类:
  • .s.PGSQL.xxxx – GP启动时开启的端口文件,是位于/tmp/目录下的隐藏文件
  • postmaster.pid – 记录GP启动进程等相关信息的文件,位于每个实例的数据主目录下,例如:/home/gpadmin1/GPData/mirror/gpsne0

文件清理操作

如果启动服务器后直接执行启动数据库操作,通常会有如下报错信息:
[gpadmin1@centos7 ~]$ gpstart -a
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[INFO]:-Starting gpstart with args: -a
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[INFO]:-Gathering information and validating the environment...
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 6.3.0 build commit:77aa1b6e4486adbaede9f5f2864a04fc3a512e93'
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[INFO]:-Greenplum Catalog Version: '301908232'
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[WARNING]:-postmaster.pid file exists on Master, checking if recovery startup required
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[INFO]:-Commencing recovery startup checks
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[INFO]:-Have lock file /tmp/.s.PGSQL.5432 but no process running on port 5432
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[INFO]:-No Master instance process, entering recovery startup mode
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[INFO]:-Clearing Master instance lock files
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[INFO]:-Clearing Master instance pid file
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[INFO]:-Starting Master instance in admin mode
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[CRITICAL]:-Failed to start Master instance in admin mode
20200323:07:22:32:004099 gpstart:centos7:gpadmin1-[CRITICAL]:-Error occurred: non-zero rc: 1
 Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /home/gpadmin1/GPData/master/gpsne-1 -l /home/gpadmin1/GPData/master/gpsne-1/pg_log/startup.log -w -t 600 -o " -p 5432 -c gp_role=utility " start'
rc=1, stdout='waiting for server to start.... stopped waiting
', stderr='pg_ctl: could not start server
Examine the log output.
'
[gpadmin1@centos7 ~]$ ps -ef | grep postgres
gpadmin1  4263  4048  0 07:22 pts/0    00:00:00 grep --color=auto postgres
上面日志的WARNING中已经提示了[WARNING]:-postmaster.pid file exists on Master, checking if recovery startup required。此时我们去执行一下文件清理操作再启动数据库即可,通常如下:
rm -rf /tmp/.s.PGSQL*
rm -rf /home/gpadmin1/GPData/mirror/gpsne0/postmaster.pid

# 如果机器节点很多,可以通过gpssh批量执行
[gpadmin1@centos7 ~]$ gpssh -f hostfile_singlenode -e "rm -rf /tmp/.s.PGSQL*"
[gpadmin1@centos7 ~]$ gpssh -f hostfile_singlenode -e "rm -rf /home/gpadmin1/GPData/*/gp*/postmaster.pid"

正常启动数据库

[gpadmin1@centos7 ~]$ gpstart -a
20200323:07:51:18:011881 gpstart:centos7:gpadmin1-[INFO]:-Starting gpstart with args: -a
20200323:07:51:18:011881 gpstart:centos7:gpadmin1-[INFO]:-Gathering information and validating the environment...
20200323:07:51:18:011881 gpstart:centos7:gpadmin1-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 6.3.0 build commit:77aa1b6e4486adbaede9f5f2864a04fc3a512e93'
20200323:07:51:18:011881 gpstart:centos7:gpadmin1-[INFO]:-Greenplum Catalog Version: '301908232'
20200323:07:51:18:011881 gpstart:centos7:gpadmin1-[INFO]:-Starting Master instance in admin mode
20200323:07:51:18:011881 gpstart:centos7:gpadmin1-[INFO]:-Obtaining Greenplum Master catalog information
20200323:07:51:18:011881 gpstart:centos7:gpadmin1-[INFO]:-Obtaining Segment details from master...
20200323:07:51:18:011881 gpstart:centos7:gpadmin1-[INFO]:-Setting new master era
20200323:07:51:18:011881 gpstart:centos7:gpadmin1-[INFO]:-Master Started...
20200323:07:51:18:011881 gpstart:centos7:gpadmin1-[INFO]:-Shutting down master
20200323:07:51:18:011881 gpstart:centos7:gpadmin1-[WARNING]:-Skipping startup of segment marked down in configuration: on centos-7 directory /home/gpadmin1/GPData/primary/gpsne0 <<<<<
20200323:07:51:18:011881 gpstart:centos7:gpadmin1-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[INFO]:-Process results...
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[INFO]:-----------------------------------------------------
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[INFO]:-   Successful segment starts                                            = 1
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[INFO]:-   Failed segment starts                                                = 0
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[WARNING]:-Skipped segment starts (segments are marked down in configuration)   = 1   <<<<<<<<
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[INFO]:-----------------------------------------------------
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[INFO]:-Successfully started 1 of 1 segment instances, skipped 1 other segments
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[INFO]:-----------------------------------------------------
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[WARNING]:-****************************************************************************
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[WARNING]:-There are 1 segment(s) marked down in the database
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[WARNING]:-To recover from this current state, review usage of the gprecoverseg
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[WARNING]:-management utility which will recover failed segment instance databases.
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[WARNING]:-****************************************************************************
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[INFO]:-Starting Master instance centos-7 directory /home/gpadmin1/GPData/master/gpsne-1
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[INFO]:-Command pg_ctl reports Master centos-7 instance active
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[INFO]:-Connecting to dbname='template1' connect_timeout=15
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[INFO]:-No standby master configured.  skipping...
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[WARNING]:-Number of segments not attempted to start: 1
20200323:07:51:19:011881 gpstart:centos7:gpadmin1-[INFO]:-Check status of database with gpstate utility
[gpadmin1@centos7 ~]$ psql
此时用psql去访问数据库是可以正常访问的。但是细心的朋友可能会发现,我上面的镜像节点稍微有点问题,可以执行一下镜像恢复操作第一时间解决该问题:
[gpadmin1@centos7 ~]$ gprecoverseg
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Starting gprecoverseg with args:
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.3.0 build commit:77aa1b6e4486adbaede9f5f2864a04fc3a512e93'
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.3.0 build commit:77aa1b6e4486adbaede9f5f2864a04fc3a512e93) on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jan  9 2020 23:10:47'
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Obtaining Segment details from master...
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Heap checksum setting is consistent between master and the segments that are candidates for recoverseg
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Greenplum instance recovery parameters
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:----------------------------------------------------------
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Recovery type              = Standard
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:----------------------------------------------------------
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Recovery 1 of 1
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:----------------------------------------------------------
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-   Synchronization mode                 = Incremental
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-   Failed instance host                 = centos-7
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-   Failed instance address              = centos-7
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-   Failed instance directory            = /home/gpadmin1/GPData/primary/gpsne0
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-   Failed instance port                 = 6000
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-   Recovery Source instance host        = centos-7
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-   Recovery Source instance address     = centos-7
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-   Recovery Source instance directory   = /home/gpadmin1/GPData/mirror/gpsne0
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-   Recovery Source instance port        = 7000
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-   Recovery Target                      = in-place
20200323:07:51:29:012043 gprecoverseg:centos7:gpadmin1-[INFO]:----------------------------------------------------------

Continue with segment recovery procedure Yy|Nn (default=N):
> y
20200323:07:51:31:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-1 segment(s) to recover
20200323:07:51:31:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Ensuring 1 failed segment(s) are stopped
20200323:07:51:32:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Ensuring that shared memory is cleaned up for stopped segments
20200323:07:51:32:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Updating configuration with new mirrors
20200323:07:51:32:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Updating mirrors
20200323:07:51:32:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Running pg_rewind on required mirrors
20200323:07:51:33:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Starting mirrors
20200323:07:51:33:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-era is a895efa805c1e937_200323075118
20200323:07:51:33:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Commencing parallel segment instance startup, please wait...
20200323:07:51:34:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Process results...
20200323:07:51:34:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Triggering FTS probe
20200323:07:51:34:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-******************************************************************
20200323:07:51:34:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Updating segments for streaming is completed.
20200323:07:51:34:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-For segments updated successfully, streaming will continue in the background.
20200323:07:51:34:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-Use  gpstate -s  to check the streaming progress.
20200323:07:51:34:012043 gprecoverseg:centos7:gpadmin1-[INFO]:-******************************************************************
至此集群启动结束~

关注微信公众号

VMware 中国研发中心