hive内存泄漏的罪魁祸首分析

# hive内存泄漏的罪魁祸首分析


在这周二的时候，下午6点半左右的时候收到了短信和邮件的报警，原因是hive的metastore的heap持续上升

![heap](https://user-images.githubusercontent.com/13329565/64468852-13bf2080-d15c-11e9-99e2-9a458fdbfc0b.png)


我们第一次收到短信的时候大概是18:20，收到报警之后我们第一时间让运维重启了下那台有问题的节点，因为那天下午比较忙，也没再继续看了
过了一会儿，另一个metastore节点也开始报警了，这下就要慌了，要重视了，所以接下来我们开始梳理下我们的分析过程

## heap分析

首先去spoor监控系统看了下heap的监控，
hive的metastore产生了heap持续上升的报警的时间发生大的时间是在18点左右

![spoor](https://user-images.githubusercontent.com/13329565/64468857-1faae280-d15c-11e9-87e6-4bd33a0f2590.png)


此时我让运维用jmap看了下对象

```text
 num     #instances         #bytes  class name
----------------------------------------------
   1:        432489       37733856  [C
   2:        282981       17037232  [B
   3:         79583       10426296  [I
   4:        234208        7494656  java.util.concurrent.ConcurrentHashMap$Node
   5:        225221        7207072  java.util.Hashtable$Entry
   6:        285220        6845280  java.lang.String
   7:         96252        4951304  [Ljava.lang.Object;
   8:        138504        4432128  java.util.HashMap$Node
   9:         33265        3968536  [Ljava.util.HashMap$Node;
  10:           623        3004352  [Ljava.util.concurrent.ConcurrentHashMap$Node;
  11:         56040        2689920  java.util.HashMap
  12:           600        2207280  [Ljava.util.Hashtable$Entry;
  13:         85342        2048208  java.util.ArrayList
  14:         12919        1343576  org.datanucleus.state.ReferentialJDOStateManager
  15:          7273        1280048  com.mysql.jdbc.JDBC4ResultSet
  16:         31544        1261760  java.util.LinkedHashMap$Entry
  17:         15085        1208056  [S
```

看起来最大的是char[]，我让他帮忙尝试看看能不能dump到heap，结果还真可以，所以我们可以用mat稍微再看下

![heap mat](https://user-images.githubusercontent.com/13329565/64468879-3bae8400-d15c-11e9-8b5e-eae06a5077cc.png)



![thread mat](https://user-images.githubusercontent.com/13329565/64468881-449f5580-d15c-11e9-8f2f-dc0075997404.png)



```jstack
"pool-3-thread-126" #1641 prio=5 os_prio=0 tid=0x00007f4a553dd800 nid=0x23310 runnable [0x00007f4a1e820000]
   java.lang.Thread.State: RUNNABLE
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:170)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at com.mysql.jdbc.util.ReadAheadInputStream.fill(ReadAheadInputStream.java:114)
	at com.mysql.jdbc.util.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:161)
	at com.mysql.jdbc.util.ReadAheadInputStream.read(ReadAheadInputStream.java:189)
	- locked <0x00000006d45e8980> (a com.mysql.jdbc.util.ReadAheadInputStream)
	at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3036)
	at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3489)
	at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3478)
	at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4019)
	at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2490)
	at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2651)
	at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2734)
	- locked <0x00000006c94c02b0> (a com.mysql.jdbc.JDBC4Connection)
	at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
	- locked <0x00000006c94c02b0> (a com.mysql.jdbc.JDBC4Connection)
	at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2322)
	- locked <0x00000006c94c02b0> (a com.mysql.jdbc.JDBC4Connection)
	at com.jolbox.bonecp.PreparedStatementHandle.executeQuery(PreparedStatementHandle.java:174)
	at org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeQuery(ParamLoggingPreparedStatement.java:381)
	at org.datanucleus.store.rdbms.SQLController.executeStatementQuery(SQLController.java:504)
	at org.datanucleus.store.rdbms.scostore.JoinListStore.listIterator(JoinListStore.java:738)
	at org.datanucleus.store.rdbms.scostore.AbstractListStore.listIterator(AbstractListStore.java:92)
	at org.datanucleus.store.rdbms.scostore.AbstractListStore.iterator(AbstractListStore.java:82)
	at org.datanucleus.store.types.backed.List.loadFromStore(List.java:298)
	at org.datanucleus.store.types.backed.List.toArray(List.java:615)
	- locked <0x00000006ba0d1ff0> (a org.datanucleus.store.types.backed.List)
	at java.util.ArrayList.<init>(ArrayList.java:177)
	at com.google.common.collect.Lists.newArrayList(Lists.java:119)
	at org.apache.hadoop.hive.metastore.ObjectStore.convertList(ObjectStore.java:1047)
	at org.apache.hadoop.hive.metastore.ObjectStore.convertToPart(ObjectStore.java:1593)
	at org.apache.hadoop.hive.metastore.ObjectStore.listPartitionsPsWithAuth(ObjectStore.java:1951)
	at sun.reflect.GeneratedMethodAccessor58.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114)
	at com.sun.proxy.$Proxy2.listPartitionsPsWithAuth(Unknown Source)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_ps_with_auth(HiveMetaStore.java:3870)
	at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
	at com.sun.proxy.$Proxy4.get_partitions_ps_with_auth(Unknown Source)
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partitions_ps_with_auth.getResult(ThriftHiveMetastore.java:10114)
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partitions_ps_with_auth.getResult(ThriftHiveMetastore.java:10098)
	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
	at org.apache.hadoop.hive.metastore.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:48)
	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
```

从thread overview中我们看到了top的是hive的Partition相关的，当时我的第一直觉是有人查了一个大分区的表，那么当前最重要的就是找到是谁调用了哪个大分区的表



## mysql slowLog分析

能搞这么大的分区的表，我想着mysql那边应该会有slowLog，由于从监控面板显示持续上涨最开始发生在18:00左右，于是让dba帮拉了一下那个时间点的slowLog
从Rows_sent中我们找到了那个具体的表了


![mysql slow log](https://user-images.githubusercontent.com/13329565/64469201-7c100100-d160-11e9-82ab-beb4adebef3d.png)



其实现在回想，在stack里看到了`get_partitions_ps_with_auth`不分析mysql的slowlog也没有关系


## metastore的log分析

接下来问题变得简单了一点，我们去metastore查找了下那个时间点附近对于这个表的操作，找到了一个 `get_partitions_ps_with_auth`，对应的ip指向了159这个节点


![meta log](https://user-images.githubusercontent.com/13329565/64468904-8d570e80-d15c-11e9-98eb-423467263561.png)


## 调用源头分析

人总是先怀疑自己，我们首先查了下xql的driver ip为159，在那个时间点的对上面那张表的访问，是有对这个表的读取，然后肉眼分析了了下，看着好像都没啥问题，都带了分区条件。看起来线索就已经中断了

但是hive metastore的日志是不会有假，所以我们就登上了159那台主机，看那个时间点的container，然后通过container id找到yarn的application name，重新分析了下这个问题




![yarn log](https://user-images.githubusercontent.com/13329565/64468907-9942d080-d15c-11e9-84d3-5b5672302133.png)




通过application name的业务分析，此时又把矛头指向了我们xql，所以我们重新回过头看了下那个时间点的语句,以及结合他们sql的报错找到了那个有问题的语句,当时只是看他带了分区条件就迅速跳过了，也没细看



我们简化下用户sql

```sql
select  ....
from dw.base_event_hour
where date ='20190831' 
and hour between '00' and '01'
and props.id in ('...') or (... ) or ( ... and ...) 
```
这里虽然写了分区字段，但是后面有or的条件，并且没用括号括起来，导致后面的or分区没有这个分区条件
嗯，还是写or不加括号的锅，
当然我们也找到了我们analzyer的一个bug









Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hive内存泄漏的罪魁祸首分析 #62

hive内存泄漏的罪魁祸首分析

heap分析

mysql slowLog分析

metastore的log分析

调用源头分析

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

hive内存泄漏的罪魁祸首分析 #62

Description

hive内存泄漏的罪魁祸首分析

heap分析

mysql slowLog分析

metastore的log分析

调用源头分析

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions