SQL上线引发的血案

网友投稿 999 2022-11-04

本站部分文章、图片属于网络上可搜索到的公开信息,均用于学习和交流用途,不能代表睿象云的观点、立场或意见。我们接受网民的监督,如发现任何违法内容或侵犯了您的权益,请第一时间联系小编邮箱jiasou666@gmail.com 处理。

SQL上线引发的血案

背景

排查过程

SELECT a.task_id, a.task_type, a.task_type_name, a.task_owner, a.task_owner_name, a.task_title, a.start_time, b.customer_account_id, b.customer_name, c.owner, c.owner_name, c.intention_level, c.oppty_status_category, c.oppty_status, c.last_follow_up_time, c.last_follow_up_status, c.last_follow_up_content, c.priority FROM saos_base.tsa_task a LEFT JOIN saos_tur.tur_customer b ON a.customer_Account = b.customer_account_id LEFT JOIN ( SELECT oo.customer_account_id, oo.OWNER, oo.owner_name, oo.intention_level, oo.oppty_status_category, oo.oppty_status, oo.last_follow_up_time, oo.last_follow_up_status, oo.last_follow_up_content, oo.priority FROM saos_tur.rb_opportunity oo WHERE oo.oppty_status_category = 10 ) c ON b.customer_account_id = c.customer_account_id WHERE a.biz_app = 'RB' AND a.task_owner = '5123' AND a.task_type = 7134 AND a.task_state = 11 AND a.start_time < '2020-09-10 00:00:00' ORDER BY c.priority desc, a.start_time ASC LIMIT 0, 10;

TopN和Limit下推规则

那么这个SQL还能再快嘛,8s业务肯定是无法接受的,我想到了用tiflash加速,然后我把这个SQL涉及到的表都开启了tiflash,explain 这个SQL发现查询竟然没有走tiflash,然后我手动设置了session 级别的set @@session.tidb_isolation_read_engines = "tidb,tiflash";把tikv禁止掉,再次explain 这个SQL都走tiflash了,手动执行时间从8s将到了500ms

既然走tiflash查询效率提升比较明显,那就让开发人员改下SQL,让这个查询强制走tiflash,比较尴尬的是我们按官方文档把SQL调整了还是不能走tiflash,强制走tiflash的SQL如下:

SELECT /*+ read_from_storage(tiflash[a,b]) */ a.task_id, a.task_type, a.task_type_name, a.task_owner, a.task_owner_name, a.task_title, a.start_time, b.customer_account_id, b.customer_name, c.owner, c.owner_name, c.intention_level, c.oppty_status_category, c.oppty_status, c.last_follow_up_time, c.last_follow_up_status, c.last_follow_up_content, c.priority FROM saos_base.tsa_task a LEFT JOIN saos_tur.tur_customer b ON a.customer_Account = b.customer_account_id LEFT JOIN ( SELECT oo.customer_account_id, oo.OWNER, oo.owner_name, oo.intention_level, oo.oppty_status_category, oo.oppty_status, oo.last_follow_up_time, oo.last_follow_up_status, oo.last_follow_up_content, oo.priority FROM saos_tur.rb_opportunity oo WHERE oo.oppty_status_category = 10 ) c ON b.customer_account_id = c.customer_account_id WHERE a.biz_app = 'RB' AND a.task_owner = '5123' AND a.task_type = 7134 AND a.task_state = 11 AND a.start_time < '2020-09-10 00:00:00' ORDER BY c.priority desc, a.start_time ASC LIMIT 0, 10;

tiflash的用法

上面问题反馈给群里PingCAP的技术支持小伙伴们然后继续排查,收集了表的健康度、表的统计信息、统计信息直方图,信息如下:

#表健康度MySQL [(none)]> SHOW STATS_HEALTHY where Table_name in ( 'tsa_task','tur_customer','rb_opportunity');+-----------+----------------+----------------+---------+| Db_name | Table_name | Partition_name | Healthy |+-----------+----------------+----------------+---------+| saos_base | tsa_task | | 51 || saos_tur | tur_customer | | 80 || saos_tur | rb_opportunity | | 72 |+-----------+----------------+----------------+---------+#表统计信息MySQL [(none)]> SHOW stats_meta where Table_name in ( 'tsa_task','tur_customer','rb_opportunity');+-----------+----------------+----------------+---------------------+--------------+-----------+| Db_name | Table_name | Partition_name | Update_time | Modify_count | Row_count |+-----------+----------------+----------------+---------------------+--------------+-----------+| saos_tur | tur_customer | | 2020-09-11 14:25:23 | 157738 | 795808 || saos_tur | rb_opportunity | | 2020-09-11 14:27:06 | 144563 | 517778 || saos_base | tsa_task | | 2020-09-11 14:27:51 | 1112555 | 2311207 |+-----------+----------------+----------------+---------------------+--------------+-----------+#统计信息直方图MySQL [(none)]> SHOW STATS_HISTOGRAMS where Table_name in ( 'tsa_task','tur_customer','rb_opportunity');+-----------+----------------+----------------+-------------------------------+----------+---------------------+----------------+------------+---------------------+-------------+| Db_name | Table_name | Partition_name | Column_name | Is_index | Update_time | Distinct_count | Null_count | Avg_col_size | Correlation |+-----------+----------------+----------------+-------------------------------+----------+---------------------+----------------+------------+---------------------+-------------+| saos_base | tsa_task | | biz_code | 0 | 2020-08-13 08:51:18 | 1510400 | 0 | 22.83 | 0.198177 || saos_base | tsa_task | | task_type | 0 | 2020-08-13 08:51:18 | 93 | 0 | 7.37 | -0.124592 || saos_base | tsa_task | | create_time | 0 | 2020-08-13 08:51:21 | 1309184 | 0 | +-----------+----------------+----------------+-------------------------------+----------+---------------------+----------------+------------+---------------------+-------------+65 rows in set (0.00 sec)

查看表的健康度信息查看表的总行数以及修改的行数查看统计信息直方图

通过上面信息发现这3个表的健康度都不高,官方建议健康度最好在80以上,所以我们先重新收集下表的统计信息,命令如下:

analyze table tsa_task;analyze table tur_customer;analyze table rb_opportunity;

analyze 用法

然后再次执行发现还是无法走tiflash,但是走tikv变快了,通过执行计划可以看到查询优化器用到了TopN和IndexMergeJoin这两个优化算子,执行计划如下:

explain analyze SELECT a.task_id, a.task_type, a.task_type_name, a.task_owner, a.task_owner_name, a.task_title, a.start_time, b.customer_account_id, b.customer_name, c.owner, c.owner_name, c.intention_level, c.oppty_status_category, c.oppty_status, c.last_follow_up_time, c.last_follow_up_status, c.last_follow_up_content, c.priority FROM saos_base.tsa_task a LEFT JOIN saos_tur.tur_customer b ON a.customer_Account = b.customer_account_id LEFT JOIN ( SELECT oo.customer_account_id, oo.OWNER, oo.owner_name, oo.intention_level, oo.oppty_status_category, oo.oppty_status, oo.last_follow_up_time, oo.last_follow_up_status, oo.last_follow_up_content, oo.priority FROM saos_tur.rb_opportunity oo WHERE oo.oppty_status_category = 10 ) c ON b.customer_account_id = c.customer_account_id WHERE a.biz_app = 'RB' AND a.task_owner = '5123' AND a.task_type = 7134 AND a.task_state = 11 AND a.start_time < '2020-09-10 00:00:00' ORDER BY c.priority desc, a.start_time ASC LIMIT 0, 10 \G*************************** 1. row *************************** id: Projection_13 estRows: 10.00 actRows: 0 task: root access object: execution info: time:146.177049ms, loops:1, Concurrency:OFF operator info: saos_base.tsa_task.task_id, saos_base.tsa_task.task_type, saos_base.tsa_task.task_type_name, saos_base.tsa_task.task_owner, saos_base.tsa_task.task_owner_name, saos_base.tsa_task.task_title, saos_base.tsa_task.start_time, saos_tur.tur_customer.customer_account_id, saos_tur.tur_customer.customer_name, saos_tur.rb_opportunity.owner, saos_tur.rb_opportunity.owner_name, saos_tur.rb_opportunity.intention_level, saos_tur.rb_opportunity.oppty_status_category, saos_tur.rb_opportunity.oppty_status, saos_tur.rb_opportunity.last_follow_up_time, saos_tur.rb_opportunity.last_follow_up_status, saos_tur.rb_opportunity.last_follow_up_content, saos_tur.rb_opportunity.priority memory: 7.62890625 KB disk: N/A*************************** 2. row *************************** id: └─TopN_16 estRows: 10.00 actRows: 0 task: root access object: execution info: time:146.17435ms, loops:1 operator info: saos_tur.rb_opportunity.priority:desc, saos_base.tsa_task.start_time:asc, offset:0, count:10 memory: 0 Bytes disk: N/A*************************** 3. row *************************** id: └─IndexMergeJoin_31 estRows: 762.40 actRows: 0 task: root access object: execution info: time:146.13307ms, loops:2, Concurrency:OFF operator info: left outer join, inner:Projection_29, outer key:saos_tur.tur_customer.customer_account_id, inner key:saos_tur.rb_opportunity.customer_account_id memory: 0 Bytes disk: N/A*************************** 4. row *************************** id: ├─IndexMergeJoin_46(Build) estRows: 735.86 actRows: 0 task: root access object: execution info: time:146.13931ms, loops:1, Concurrency:OFF operator info: left outer join, inner:Projection_44, outer key:saos_base.tsa_task.customer_account, inner key:saos_tur.tur_customer.customer_account_id memory: 0 Bytes disk: N/A*************************** 5. row *************************** id: │ ├─IndexLookUp_69(Build) estRows: 730.81 actRows: 0 task: root access object: execution info: time:147.392197ms, loops:1, rpc num: 1, rpc time:12.901444ms, proc keys:20185 operator info: memory: 464 KB disk: N/A*************************** 6. row *************************** id: │ │ ├─IndexRangeScan_66(Build) estRows: 20967.69 actRows: 20174 task: cop[tikv] access object: table:a, index:idx_owner(task_owner)execution info: time:11ms, loops:24 operator info: range:["5123","5123"], keep order:false memory: N/A disk: N/A*************************** 7. row *************************** id: │ │ └─Selection_68(Probe) estRows: 730.81 actRows: 0 task: cop[tikv] access object: execution info: proc max:96ms, min:17ms, p80:72ms, p95:96ms, iters:69, tasks:11 operator info: eq(saos_base.tsa_task.biz_app, "RB"), eq(saos_base.tsa_task.task_state, 11), eq(saos_base.tsa_task.task_type, 7134), lt(saos_base.tsa_task.start_time, 2020-09-10 00:00:00.000000) memory: N/A disk: N/A*************************** 8. row *************************** id: │ │ └─TableRowIDScan_67 estRows: 20967.69 actRows: 20174 task: cop[tikv] access object: table:aexecution info: proc max:96ms, min:17ms, p80:72ms, p95:96ms, iters:69, tasks:11 operator info: keep order:false memory: N/A disk: N/A*************************** 9. row *************************** id: │ └─Projection_44(Probe) estRows: 1.00 actRows: task: root access object: execution info: time:0ns, loops:0 operator info: saos_tur.tur_customer.customer_name, saos_tur.tur_customer.customer_account_id memory: N/A disk: N/A*************************** 10. row *************************** id: │ └─IndexLookUp_43 estRows: 1.00 actRows: task: root access object: execution info: time:0ns, loops:0 operator info: memory: N/A disk: N/A*************************** 11. row *************************** id: │ ├─IndexRangeScan_41(Build) estRows: 1.00 actRows: task: cop[tikv] access object: table:b, index:uniq_customeraccountid(customer_account_id)execution info: time:0ns, loops:0 operator info: range: decided by [eq(saos_tur.tur_customer.customer_account_id, saos_base.tsa_task.customer_account)], keep order:true memory: N/A disk: N/A*************************** 12. row *************************** id: │ └─TableRowIDScan_42(Probe) estRows: 1.00 actRows: task: cop[tikv] access object: table:bexecution info: time:0ns, loops:0 operator info: keep order:false memory: N/A disk: N/A*************************** 13. row *************************** id: └─Projection_29(Probe) estRows: 1.04 actRows: task: root access object: execution info: time:0ns, loops:0 operator info: saos_tur.rb_opportunity.customer_account_id, saos_tur.rb_opportunity.owner, saos_tur.rb_opportunity.owner_name, saos_tur.rb_opportunity.oppty_status_category, saos_tur.rb_opportunity.oppty_status, saos_tur.rb_opportunity.priority, saos_tur.rb_opportunity.last_follow_up_time, saos_tur.rb_opportunity.last_follow_up_content, saos_tur.rb_opportunity.last_follow_up_status, saos_tur.rb_opportunity.intention_level memory: N/A disk: N/A*************************** 14. row *************************** id: └─IndexLookUp_28 estRows: 1.04 actRows: task: root access object: execution info: time:0ns, loops:0 operator info: memory: N/A disk: N/A*************************** 15. row *************************** id: ├─IndexRangeScan_25(Build) estRows: 2.54 actRows: task: cop[tikv] access object: table:oo, index:idx_customer_account_id(customer_account_id)execution info: time:0ns, loops:0 operator info: range: decided by [eq(saos_tur.rb_opportunity.customer_account_id, saos_tur.tur_customer.customer_account_id)], keep order:true memory: N/A disk: N/A*************************** 16. row *************************** id: └─Selection_27(Probe) estRows: 1.04 actRows: task: cop[tikv] access object: execution info: time:0ns, loops:0 operator info: eq(saos_tur.rb_opportunity.oppty_status_category, 10) memory: N/A disk: N/A*************************** 17. row *************************** id: └─TableRowIDScan_26 estRows: 2.54 actRows: task: cop[tikv] access object: table:ooexecution info: time:0ns, loops:0 operator info: keep order:false memory: N/A disk: N/A17 rows in set (0.15 sec)

至此,我们业务的问题解决了(查询快了),但是改写SQL让其强制走tiflash不生效的问题还没结论,最后经过PingCAP后台小伙伴确认还是我们的写法问题,这个SQL涉及到的表不在同一个db里面,所以强制走tiflash的写法需要指定db,即/*+ read_from_storage(tiflash[a,b]) */ 改成/*+ read_from_storage(tiflash[saos_base.a, saos_tur.b]) */ 就可以了,当然了这个问题在官方文档里面并没有写清楚,官方后面会完善使用手册,增加这个场景的说明。PS:TiFlash hint不生效的情况还有两种:1.低于5.7.7的版本要加上–comments才能让hint生效,MySQL 命令行客户端在 5.7.7 版本之前默认清除了 Optimizer Hints。如果需要在这些早期版本的客户端中使用 Hint 语法,需要在启动客户端时加上 --comments 选项;2.如果表使用了别名,那么hint里也要使用别名才能生效

总结

查看表的总行数以及修改的行数等信息SHOW stats_meta where Table_name = ‘tablename’查看表的健康度

通过 SHOW STATS_HEALTHY 可以查看表的统计信息健康度,并粗略估计表上统计信息的准确度。当 modify_count >= row_count 时,健康度为 0;当 modify_count < row_count 时,健康度为 (1 - modify_count/row_count) * 100

自动触发表统计信息更新的条件

触发策略:新表达到1000条,并且在1分钟内没有写入,会自动触发。当表的(修改数/当前总行数,这个数据对应state_meta里面total和modify)大于 tidb_auto_analyze_ratio 的时候,会自动触发 analyze语句。tidb_auto_analyze_ratio的默认值为0.5,即默认开启此功能。为了保险起见,在开启此功能的时候,保证了其最小值为0.3。但是不能大于等于 pseudo-estimate-ratio(默认值为 0.8),否则会有一段时间使用pseudo统计信息,建议设置值为0.5。

查看tidb目前支持的读引擎以及修改session级别的读引擎

#修改回话级别的读引擎,下面这个设置是禁止读tikv,需要确保查询的表都已经同步到tiflash,不然查询会报错MySQL [(none)]> set @@session.tidb_isolation_read_engines = "tidb,tiflash";Query OK, 0 rows affected (0.00 sec)#查看目前支持的读引擎MySQL [(none)]> show variables like 'tidb_isolation_read_engines'; +-----------------------------+--------------+| Variable_name | Value |+-----------------------------+--------------+| tidb_isolation_read_engines | tidb,tiflash |+-----------------------------+--------------+1 row in set (0.01 sec)

手动更新表统计信息的方法,更多细节和使用方法见手动更新表统计信息

ANALYZE table Tablename;

Tidb虽然高度兼容Mysql,但是有些用法和Mysql还是不一样的,包括SQL优化,下面是SQL优化相关的原理文章,要想把Tidb用好,SQL的优化这块还需要从原理方面深入学习,SQL优化原理相关

下面这三篇原理相关的文章可以了解下说存储说计算谈调度

写在最后

特别感谢PingCAP的小伙伴们的大力支持,每次遇到问题会帮忙积极主动跟进处理,再次感谢仲舒、贤净、苏丹、振姣。

上一篇:软件测试培训之兼容性测试是如何做的?
下一篇:软件测试培训之自动化测试的重要性
相关文章

 发表评论

暂时没有评论,来抢沙发吧~