-
Notifications
You must be signed in to change notification settings - Fork 34
Description
hive内存泄漏的罪魁祸首分析
在这周二的时候,下午6点半左右的时候收到了短信和邮件的报警,原因是hive的metastore的heap持续上升
我们第一次收到短信的时候大概是18:20,收到报警之后我们第一时间让运维重启了下那台有问题的节点,因为那天下午比较忙,也没再继续看了
过了一会儿,另一个metastore节点也开始报警了,这下就要慌了,要重视了,所以接下来我们开始梳理下我们的分析过程
heap分析
首先去spoor监控系统看了下heap的监控,
hive的metastore产生了heap持续上升的报警的时间发生大的时间是在18点左右
此时我让运维用jmap看了下对象
num #instances #bytes class name
----------------------------------------------
1: 432489 37733856 [C
2: 282981 17037232 [B
3: 79583 10426296 [I
4: 234208 7494656 java.util.concurrent.ConcurrentHashMap$Node
5: 225221 7207072 java.util.Hashtable$Entry
6: 285220 6845280 java.lang.String
7: 96252 4951304 [Ljava.lang.Object;
8: 138504 4432128 java.util.HashMap$Node
9: 33265 3968536 [Ljava.util.HashMap$Node;
10: 623 3004352 [Ljava.util.concurrent.ConcurrentHashMap$Node;
11: 56040 2689920 java.util.HashMap
12: 600 2207280 [Ljava.util.Hashtable$Entry;
13: 85342 2048208 java.util.ArrayList
14: 12919 1343576 org.datanucleus.state.ReferentialJDOStateManager
15: 7273 1280048 com.mysql.jdbc.JDBC4ResultSet
16: 31544 1261760 java.util.LinkedHashMap$Entry
17: 15085 1208056 [S
看起来最大的是char[],我让他帮忙尝试看看能不能dump到heap,结果还真可以,所以我们可以用mat稍微再看下
"pool-3-thread-126" #1641 prio=5 os_prio=0 tid=0x00007f4a553dd800 nid=0x23310 runnable [0x00007f4a1e820000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at com.mysql.jdbc.util.ReadAheadInputStream.fill(ReadAheadInputStream.java:114)
at com.mysql.jdbc.util.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:161)
at com.mysql.jdbc.util.ReadAheadInputStream.read(ReadAheadInputStream.java:189)
- locked <0x00000006d45e8980> (a com.mysql.jdbc.util.ReadAheadInputStream)
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3036)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3489)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3478)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4019)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2490)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2651)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2734)
- locked <0x00000006c94c02b0> (a com.mysql.jdbc.JDBC4Connection)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
- locked <0x00000006c94c02b0> (a com.mysql.jdbc.JDBC4Connection)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2322)
- locked <0x00000006c94c02b0> (a com.mysql.jdbc.JDBC4Connection)
at com.jolbox.bonecp.PreparedStatementHandle.executeQuery(PreparedStatementHandle.java:174)
at org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeQuery(ParamLoggingPreparedStatement.java:381)
at org.datanucleus.store.rdbms.SQLController.executeStatementQuery(SQLController.java:504)
at org.datanucleus.store.rdbms.scostore.JoinListStore.listIterator(JoinListStore.java:738)
at org.datanucleus.store.rdbms.scostore.AbstractListStore.listIterator(AbstractListStore.java:92)
at org.datanucleus.store.rdbms.scostore.AbstractListStore.iterator(AbstractListStore.java:82)
at org.datanucleus.store.types.backed.List.loadFromStore(List.java:298)
at org.datanucleus.store.types.backed.List.toArray(List.java:615)
- locked <0x00000006ba0d1ff0> (a org.datanucleus.store.types.backed.List)
at java.util.ArrayList.<init>(ArrayList.java:177)
at com.google.common.collect.Lists.newArrayList(Lists.java:119)
at org.apache.hadoop.hive.metastore.ObjectStore.convertList(ObjectStore.java:1047)
at org.apache.hadoop.hive.metastore.ObjectStore.convertToPart(ObjectStore.java:1593)
at org.apache.hadoop.hive.metastore.ObjectStore.listPartitionsPsWithAuth(ObjectStore.java:1951)
at sun.reflect.GeneratedMethodAccessor58.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114)
at com.sun.proxy.$Proxy2.listPartitionsPsWithAuth(Unknown Source)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_ps_with_auth(HiveMetaStore.java:3870)
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
at com.sun.proxy.$Proxy4.get_partitions_ps_with_auth(Unknown Source)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partitions_ps_with_auth.getResult(ThriftHiveMetastore.java:10114)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partitions_ps_with_auth.getResult(ThriftHiveMetastore.java:10098)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.hadoop.hive.metastore.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:48)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
从thread overview中我们看到了top的是hive的Partition相关的,当时我的第一直觉是有人查了一个大分区的表,那么当前最重要的就是找到是谁调用了哪个大分区的表
mysql slowLog分析
能搞这么大的分区的表,我想着mysql那边应该会有slowLog,由于从监控面板显示持续上涨最开始发生在18:00左右,于是让dba帮拉了一下那个时间点的slowLog
从Rows_sent中我们找到了那个具体的表了
其实现在回想,在stack里看到了get_partitions_ps_with_auth不分析mysql的slowlog也没有关系
metastore的log分析
接下来问题变得简单了一点,我们去metastore查找了下那个时间点附近对于这个表的操作,找到了一个 get_partitions_ps_with_auth,对应的ip指向了159这个节点
调用源头分析
人总是先怀疑自己,我们首先查了下xql的driver ip为159,在那个时间点的对上面那张表的访问,是有对这个表的读取,然后肉眼分析了了下,看着好像都没啥问题,都带了分区条件。看起来线索就已经中断了
但是hive metastore的日志是不会有假,所以我们就登上了159那台主机,看那个时间点的container,然后通过container id找到yarn的application name,重新分析了下这个问题
通过application name的业务分析,此时又把矛头指向了我们xql,所以我们重新回过头看了下那个时间点的语句,以及结合他们sql的报错找到了那个有问题的语句,当时只是看他带了分区条件就迅速跳过了,也没细看
我们简化下用户sql
select ....
from dw.base_event_hour
where date ='20190831'
and hour between '00' and '01'
and props.id in ('...') or (... ) or ( ... and ...) 这里虽然写了分区字段,但是后面有or的条件,并且没用括号括起来,导致后面的or分区没有这个分区条件
嗯,还是写or不加括号的锅,
当然我们也找到了我们analzyer的一个bug






