使用CloudLens for SLS监控Project资源配额最佳实践-阿里云(云淘科技)
本文主要介绍如何使用CloudLens for SLS中全局错误日志、监控指标做Project 资源配额的水位监控 、超限监控。
背景介绍
Alibaba Cloud Lens 基于 SLS 构建统一云产品可观测能力,支持一键开启实例日志(重要日志、详细日志、作业运行日志)和全局日志(审计日志、计费日志、错误日志、监控指标)的采集功能。
日志分类 | 子分类 | 监控场景说明 |
实例日志 | 详细日志(收费) | 访问流量监控访问异常监控 |
重要日志(免费) | 消费组监控Logtail采集监控 | |
作业运行日志(免费) | 数据加工(新版)监控定时SQL任务监控 | |
全局日志 | 审计日志(免费) | 资源操作监控 |
错误日志(免费) | 额度超限监控访问异常监控操作异常监控 | |
监控指标(免费) | 访问流量监控访问异常监控资源配额水位监控 | |
计费日志(免费) | 资源用量跟踪 |
各日志说明参考CloudLens日志索引表:https://help.aliyun.com/document_detail/456901.html?spm=a2c4g.456864.0.0.e979723c8We7zA
使用场景
本文主要介绍如何使用CloudLens for SLS中全局错误日志、监控指标做Project 资源配额的水位监控 、超限监控 以及 如何提交资源配额提升申请。
使用前提
为了构建实时资源配额水位监控,全局日志的几种监控日志(错误日志、指标监控)需存放在相同的Project下。同时为了避免监控日志存放在业务Project导致监控占用Project的Quota,可直接挑选一个固定地域的目标Project,如杭州地域:log-service-{用户ID}-cn-hangzhou。
CloudLens for SLS 额度监控大盘
资源配额预警概览
报表提供资源配额预警概览 (水位超过80%)以及 额度超限分布
Project重点资源配额实时水位详情
包含Project部分基础资源配额以及数据读写资源配额的实时水位详情
Project资源配额超限详情
监控实践
分类 | 监控项 | 说明 |
实时水位监控 | 基础资源配额水位监控 |
|
数据读写配额水位监控 |
|
|
额度超限监控 | 资源配额超限次数监控 |
|
分类 | 场景 | 监控项 | 说明 |
基础资源配额 | LogStore | 实时水位监控 |
|
额度超限监控 |
|
||
机器组 | 水位监控 |
|
|
额度超限监控 |
|
||
Logtail采集配置 | 水位监控 |
|
|
额度超限监控 |
|
||
数据读写资源配额 | Project写入流量 | 水位监控 |
|
额度超限监控 |
|
||
Project写入次数 | 水位监控 |
|
|
额度超限监控 |
|
基础监控
基础资源配额水位监控
1、确认告警SQL:15min定时检查LogStore数、机器组数、Logtail采集配置水位是否达到告警阈值。注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果查询SQL如下:(告警只能对比结果中最多1000条数据是否满足告警条件,建议告警SQL内先针对水位做下筛选,比如此处logstore_ratio > 80 or machine_group_ratio > 80 or logtail_config_ratio > 80 ) 80 or machine_group_ratio > 80 or logtail_config_ratio > 80) limit 10000″ data-tag=”codeblock” outputclass=”language-sql” class=”pre codeblock language-sql”>* | select Project, region, logstore_ratio, machine_group_ratio, logtail_config_ratio from
(SELECT A.id as Project , A.region as region,
round(COALESCE(SUM(B.count_logstore), 0)/cast(json_extract(A.quota, ‘$.logstore’) as double) * 100, 3) as logstore_ratio, cast(json_extract(A.quota, ‘$.logstore’) as double) as quota_logstore,
round(COALESCE(SUM(C.count_machine_group), 0)/cast(json_extract(A.quota, ‘$.machine_group’) as double) * 100, 3) as machine_group_ratio, cast(json_extract(A.quota, ‘$.machine_group’) as double) as quota_machine_group,
round(COALESCE(SUM(D.count_logtail_config), 0)/cast(json_extract(A.quota, ‘$.config’) as double) * 100, 3) as logtail_config_ratio, cast(json_extract(A.quota, ‘$.config’) as double) as quota_logtail_config
FROM “resource.sls.cmdb.project” as A
LEFT JOIN (
SELECT project, COUNT(*) AS count_logstore
FROM “resource.sls.cmdb.logstore” as B
GROUP BY project
) AS B ON A.id = B.project
LEFT JOIN (
SELECT project, COUNT(*) AS count_machine_group
FROM “resource.sls.cmdb.machine_group” as C
GROUP BY project
) AS C ON A.id = C.project
LEFT JOIN (
SELECT project, COUNT(*) AS count_logtail_config
FROM “resource.sls.cmdb.logtail_config” as D
GROUP BY project
) AS D ON A.id = D.project
group by A.id, A.quota, A.region)
where quota_logstore is not null and quota_machine_group is not null and quota_logtail_config is not null and (logstore_ratio > 80 or machine_group_ratio > 80 or logtail_config_ratio > 80) limit 100002、告警配置依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的LogStore数、机器组数、Logtail采集配置其中一个水位超过额度的90%时告警级别为严重
- 当有Project的LogStore数、机器组数、Logtail采集配置其中一个水位超过额度的80%时告警级别为中
数据读写配额水位监控
1、确认告警SQL:每分钟定时检查Project写入流量、写入次数水位是否达到告警阈值。注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果查询SQL:(告警只能对比结果中最多1000条数据是否满足告警条件,建议告警SQL内先针对写入流量/写入次数做下筛选,比如此处where inflow_ratio > 80 or write_cnt_ratio > 80 ) 80 or write_cnt_ratio > 80 limit 10000″ data-tag=”codeblock” outputclass=”language-sql” class=”pre codeblock language-sql”>(*)| select Project, region, inflow_ratio, write_cnt_ratio from (SELECT cmdb.id as Project, cmdb.region as region, round(COALESCE(M.name1,0)/round(cast(json_extract(cmdb.quota, ‘$.inflow_per_min’) as double)/1000000000, 3) * 100, 3) as inflow_ratio, round(COALESCE(M.name2,0)/cast(json_extract(cmdb.quota, ‘$.write_cnt_per_min’) as double) * 100, 3) as write_cnt_ratio
from “resource.sls.cmdb.project” as cmdb
LEFT JOIN (
select project, round(MAX(name1)/1000000000, 3) as name1, MAX(name2) as name2 from (SELECT __time_nano__ as time, element_at( split_to_map(__labels__, ‘|’, ‘#$#’) , ‘project’) as project, sum(CASE WHEN __name__ = ‘logstore_origin_inflow_bytes’ THEN __value__ ELSE NULL END) AS name1,
sum(CASE WHEN __name__ = ‘logstore_write_count’ THEN __value__ ELSE NULL END) AS name2
FROM “internal-monitor-metric.prom” where __name__ in (‘logstore_origin_inflow_bytes’,’logstore_write_count’ ) and regexp_like(element_at( split_to_map(__labels__, ‘|’, ‘#$#’) , ‘project’) , ‘.*’) group by project,time )group by project) AS M ON cmdb.id = M.project) where inflow_ratio > 80 or write_cnt_ratio > 80 limit 100002、告警配置查询区间选择相对5分钟,依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的Project写入流量、写入次数其中一个水位超过额度的90%时告警级别为严重
- 当有Project的Project写入流量、写入次数其中一个水位超过额度的80%时告警级别为中
资源配额超限次数监控
1、确认告警SQL:15min定时检查是否有额度超限发生。查询SQL:((* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed or ErrorCode: ShardWriteQuotaExceed or ErrorCode: ShardReadQuotaExceed)))| SELECT Project,
CASE
WHEN ErrorMsg like ‘%Project write quota exceed: inflow%’ then ‘Project写入流量超限’
WHEN ErrorMsg like ‘%Project write quota exceed: qps%’ then ‘Project写入次数超限’
WHEN ErrorMsg like ‘%dashboard quota exceed%’ then ‘报表额度超限’
WHEN ErrorMsg like ‘%config count%’ then ‘Logtail采集配置超限’
WHEN ErrorMsg like ‘%machine group count%’ then ‘机器组超限’
WHEN ErrorMsg like ‘%Alert count %’ then ‘告警超限’
WHEN ErrorMsg like ‘%logstore count %’ then ‘LogStore数超限’
WHEN ErrorMsg like ‘%shard count%’ then ‘Shard数超限’
WHEN ErrorMsg like ‘%shard write bytes%’ then ‘Shard写入超限’
WHEN ErrorMsg like ‘%shard write quota%’ then ‘Shard写入超限’
WHEN ErrorMsg like ‘%user can only run%’ then ‘SQL分析操作并发数超限’
ELSE ErrorMsg
END AS ErrorMsg,
COUNT(1) AS count GROUP BY Project, ErrorMsg Limit 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:
- 当有任意额度超限10次错误告警级别为严重
- 当有任意额度发生超限1次错误时告警级别为中
高级监控
以下是基础监控的细分项,一般情况下不需要,如果需更精细的告警监控,可以参考。
LogStore监控
水位监控
1、确认告警SQL:15min定时检查LogStore数水位是否达到告警阈值。注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果。查询SQL:* | select Project, region, round(count_logstore/quota_logstore * 100, 3) as logstore_ratio from
(SELECT A.id as Project , A.region as region, COALESCE(SUM(B.count_logstore), 0) AS count_logstore , cast(json_extract(A.quota, ‘$.logstore’) as double) as quota_logstore
FROM “resource.sls.cmdb.project” as A
LEFT JOIN (
SELECT project, COUNT(*) AS count_logstore
FROM “resource.sls.cmdb.logstore” as B
GROUP BY project
) AS B ON A.id = B.project
group by A.id, A.quota, A.region) where quota_logstore is not null order by logstore_ratio desc limit 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的LogStore数超过额度的90%时告警级别为严重
- 当有Project的LogStore数超过额度的80%时告警级别为中
此处需注意,告警触发条件配置多个时,判断顺序是从上至下,因此logstore_ratio>90需配置在logstore_ratio>80的上面。
超限监控
1、确认告警SQL:15min定时检查LogStore是否发生超限现象。查询SQL:* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
COUNT(1) AS count where ErrorMsg like ‘%logstore count %’ GROUP BY Project ORDER BY count DESC LIMIT 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的LogStore发生超限10次错误告警级别为严重
- 当有Project的LogStore发生超限1次错误时告警级别为中
机器组监控
水位监控
1、确认告警SQL:15min定时检查机器组数水位是否达到告警阈值。注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果查询SQL:* | select Project, region, round(count_machine_group/quota_machine_group * 100, 3) as machine_group_ratio from
(SELECT A.id as Project , A.region as region, COALESCE(SUM(B.count_machine_group), 0) AS count_machine_group , cast(json_extract(A.quota, ‘$.machine_group’) as double) as quota_machine_group
FROM “resource.sls.cmdb.project” as A
LEFT JOIN (
SELECT project, COUNT(*) AS count_machine_group
FROM “resource.sls.cmdb.machine_group” as B
GROUP BY project
) AS B ON A.id = B.project
group by A.id, A.quota, A.region) where quota_machine_group is not null order by machine_group_ratio desc limit 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的机器组超过额度的90%时告警级别为严重
- 当有Project的机器组超过额度的80%时告警级别为中
超限监控
1、确认告警SQL:15min定时检查机器组是否发生超限现象。查询SQL:* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
COUNT(1) AS count where ErrorMsg like ‘%machine group count%’ GROUP BY Project ORDER BY count DESC LIMIT 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的机器组发生超限10次错误告警级别为严重
- 当有Project的机器组发生超限1次错误时告警级别为中
Logtail采集配置
水位监控
1、确认告警SQL:15min定时检查Logtail采集配置数水位是否达到告警阈值。注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果查询SQL:* | select Project, region, round(count_logtail_config/quota_logtail_config * 100, 3) as logtail_config_ratio from
(SELECT A.id as Project , A.region as region, COALESCE(SUM(B.count_logtail_config), 0) AS count_logtail_config , cast(json_extract(A.quota, ‘$.config’) as double) as quota_logtail_config
FROM “resource.sls.cmdb.project” as A
LEFT JOIN (
SELECT project, COUNT(*) AS count_logtail_config
FROM “resource.sls.cmdb.logtail_config” as B
GROUP BY project
) AS B ON A.id = B.project
group by A.id, A.quota, A.region) where quota_logtail_config is not null order by logtail_config_ratio desc limit 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的Logtail采集配置数超过额度的90%时告警级别为严重
- 当有Project的Logtail采集配置数超过额度的80%时告警级别为中
超限监控
1、确认告警SQL:15min定时检查LogStore是否发生超限现象。查询SQL:* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
COUNT(1) AS count where ErrorMsg like ‘%config count%’ GROUP BY Project ORDER BY count DESC LIMIT 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的Logtail采集配置发生超限10次错误告警级别为严重
- 当有Project的Logtail采集配置发生超限1次错误时告警级别为中
Project写入流量监控
水位监控
1、确认告警SQL:每分钟定时检查相对5分钟内Project写入流量水位是否达到告警阈值。SQL详情:(*)| SELECT Project, region , round(count_inflow/cast(quota_inflow as double) * 100, 3) as inflow_ratio
FROM
(SELECT cmdb.id as Project, cmdb.region as region, COALESCE(M.name1,0) as count_inflow, round(cast(json_extract(cmdb.quota, ‘$.inflow_per_min’) as double)/1000000000, 3) as quota_inflow from “resource.sls.cmdb.project” as cmdb
LEFT JOIN (
select project, round(MAX(name1)/1000000000, 3) as name1 from (SELECT __time_nano__ as time, element_at( split_to_map(__labels__, ‘|’, ‘#$#’) , ‘project’) as project, sum(CASE WHEN __name__ = ‘logstore_origin_inflow_bytes’ THEN __value__ ELSE NULL END) AS name1
FROM “internal-monitor-metric.prom” where __name__ =’logstore_origin_inflow_bytes’ and regexp_like(element_at( split_to_map(__labels__, ‘|’, ‘#$#’) , ‘project’) , ‘.*’) group by project,time )group by project) AS M ON cmdb.id = M.project )order by inflow_ratio desc limit 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:
- 当有Project写入流量超过额度的90%时告警级别为严重
- 当有Project写入流量超过额度的80%时告警级别为中
超限监控
1、确认告警SQL:15min定时检查Project写入流量是否发生超限现象。查询SQL:* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
COUNT(1) AS count where ErrorMsg like ‘%Project write quota exceed: inflow%’ GROUP BY Project ORDER BY count DESC LIMIT 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:
- 当有Project写入流量发生超限10次错误告警级别为严重
- 当有Project写入流量发生超限1次错误时告警级别为中
Project写入次数监控
水位监控
1、确认告警SQL:每分钟定时检查相对5分钟内Project写入次数水位是否达到告警阈值。查询SQL:(*)| SELECT Project, region, round(count_write_cnt/cast(quota_write_cnt as double) * 100, 3) as write_cnt_ratio
FROM
(SELECT cmdb.id as Project, cmdb.region as region, COALESCE(M.name1,0) as count_write_cnt,
cast(json_extract(cmdb.quota, ‘$.write_cnt_per_min’) as bigint) as quota_write_cnt from “resource.sls.cmdb.project” as cmdb
LEFT JOIN (
select project, MAX(name1) as name1 from (SELECT __time_nano__ as time, element_at( split_to_map(__labels__, ‘|’, ‘#$#’) , ‘project’) as project,
sum(CASE WHEN __name__ = ‘logstore_write_count’ THEN __value__ ELSE NULL END) AS name1
FROM “internal-monitor-metric.prom” where __name__ = ‘logstore_write_count’ and regexp_like(element_at( split_to_map(__labels__, ‘|’, ‘#$#’) , ‘project’) , ‘.*’) group by project,time )group by project) AS M ON cmdb.id = M.project ) order by write_cnt_ratio desc limit 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:
- 当有Project写入次数超过额度的90%时告警级别为严重
- 当有Project写入次数超过额度的80%时告警级别为中
超限监控
1、确认告警SQL:15min定时检查Project写入次数是否发生超限现象。查询SQL:* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
COUNT(1) AS count where ErrorMsg like ‘%Project write quota exceed: qps%’ GROUP BY Project ORDER BY count DESC LIMIT 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:
- 当有Project写入次数发生超限10次错误告警级别为严重
- 当有Project写入次数发生超限1次错误时告警级别为中
发表评论