Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why processing requests of batch size=1 is much slower than batch size>1 #142

Open
mapcan opened this issue Jun 8, 2023 · 0 comments
Open

Comments

@mapcan
Copy link

mapcan commented Jun 8, 2023

I did some performace test of a 3.5B bloom 1-gpu model using perf_analyzer, the result is

batch size avg latency
1 6533769us
2 2819328us
4 2953732us

and then I tested 2-gpu version of this model, the result is

batch size avg latency
1 1769113us
2 3032188us
4 3461972us

1-gpu model is much slower when processing 1 size batch than processing 2 or 4 size batch, and 2-gpu model is faster than 1-gpu model when processing 1 size batch but slower when batch size > 1, how to explain these results? please help.

GPU: Tesla T4
CUDA Version: 11.8
model config: almost same as all_models/bloom
input data:

{
    "data": [
        {
            "INPUT_0": {
                "content": [
                    "话说大宋仁宗天子在位,嘉祐三年三月三日五更三点,天子驾坐紫宸殿,受百官朝贺。但见:祥云迷凤阁,瑞气罩龙楼。含烟御柳拂旌旗,带露宫花迎剑戟。天香影里,玉簪朱履聚丹墀
;仙乐声中,绣袄锦衣扶御驾。珍珠帘卷,黄金殿上现金轝,凤羽扇开,白玉阶前停宝辇。隐隐净鞭三下响,层层文武两班齐。当有殿头官喝道:“有事出班早奏,无事卷帘退朝。”只见班部丛中,宰相赵哲、参政文
彦博出班奏曰:“目今京师瘟疫盛行,伤损军民甚多。伏望陛下释罪宽恩,省刑薄税,祈禳天灾,救济万民。”天子听奏,急敕翰林院随即草诏,一面降赦天下罪囚,应有民间税赋,悉皆赦免;一面命在京宫观寺院,
修设好事禳灾。不料其年瘟疫转盛,仁宗天子闻知,龙体不安,复会百官计议。向那班部中,有一大臣,越班启奏。天子看时,乃是参知政事范仲淹,拜罢起居,奏曰:“目今天灾盛行,军民涂炭,日夕不能聊生。以
臣愚意,要禳此灾,可宣嗣汉天师星夜临朝,就京师禁院,修设三千六百分罗天大醮,奏闻上帝,可以禳保民间瘟疫。”仁宗天子准奏,急令翰林学士草诏一道,天子御笔亲书,并降御香一炷,钦差内外提点殿前太尉
洪信为天使,前往江西信州龙虎山,宣请嗣汉天师张真人星夜来朝,祈禳瘟疫。就金殿上焚起御香,亲将丹诏付与洪太尉,即便登程前去。洪信领了圣敕,辞别天子,背了诏书,盛了御香,带了数十人,上了铺马,
一行部队,离了东京,取路径投信州贵溪县来。但见:遥山叠翠,远水"
                ],
                "shape": [1]
            },
            "INPUT_1": {
                "content": [
                    200
                ],
                "shape": [1]
            },
            "INPUT_2": {
                "content": [
                    ""
                ],
                "shape": [1]
            },
            "INPUT_3": {
                "content": [
                    ""
                ],

                "content": [
                    1.0
                ],
                "shape": [1]
            },
            "repetition_penalty": {
                "content": [
                    1.03
                ],
                "shape": [1]
            },
            "random_seed": {
                "content": [
                    3
                ],
                "shape": [1]
            },
            "is_return_log_probs": {
                "content": [
                    true
                ],
                "shape": [1]
            },
            "is_return_context_embeddings": {
                "content": [
                    true
                ],
                "shape": [1]
            },
            "beam_width": {
                "content": [
                    1
                ],
                "shape": [1]
            },
            "start_id": {
                "content": [
                    1
                ],
                "shape": [1]
            },
            "end_id": {
                "content": [
                    2
                ],
                "shape": [1]
            }
        }
    ]
}

perf_analyzer command

perf_analyzer -m ensemble --input-data  input_data --measurement-mode=count_windows --concurrency-range 1   -b {batch size}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant