GLM-130B model evaluation on the 4 x RTX 3090 GPU machine #94

Tomas0413 · 2023-03-01T23:54:20Z

I have completed the initial evaluation of the GLB-130B model on a 4x RTX 3090 GPU machine (64GB of RAM and 8TB SSD drive).

The model underwent INT4 quantization, which helped reduce the GPU memory requirement from 240 GB to 63 GB.

The time taken in the evaluation table is per task, and for convenience, I reported on each line for a different metric. I also tried to find the relevant benchmark for each task and added a link to the Papers With Code website.

For now, this is only the first evaluation run, but I plan to spend more time closely examining the tasks and exploring the use of FastTransformers to improve inference time.

I'm interested in seeing if I could use GLM-130B in combination with RLHF in the future, so I'll explore that direction too.

Next to the table, I also share the partial raw log file.

Task	Config	Benchmark	Evaluation Metric	Max	Median	Average	Time Taken (per task)	Error
glue_qnli	tasks/bloom/glue_qnli.yaml	GLUE QNLI	Accuracy (validation)	88.395	86.235	84.386	1d 8h 30m
superglue_axb	tasks/bloom/superglue_axb.yaml	SuperGLUE Broadcoverage Diagnostics (AX-b)	Accuracy (test)	79.982	79.167	78.922	10h 19m
mc_taco	tasks/bloom/mc_taco.yaml	MC-TACO	EM (validation)	10.330	10.330	10.330	1h 40m
mc_taco	tasks/bloom/mc_taco.yaml	MC-TACO	F1 (validation)	16.724	16.724	16.724	1h 40m
mc_taco	tasks/bloom/mc_taco.yaml	MC-TACO	EM (test)	11.273	11.273	11.273	1h 40m
mc_taco	tasks/bloom/mc_taco.yaml	MC-TACO	F1 (test)	17.461	17.461	17.461	1h 40m
math_qa	tasks/bloom/math_qa.yaml	MathQA	Accuracy (validation)	26.145	23.866	24.371	2d 14h 4m
math_qa	tasks/bloom/math_qa.yaml	MathQA	Accuracy (test)	26.968	24.958	25.085	2d 14h 4m
pubmed_qa	tasks/bloom/pubmed_qa.yaml	PubMedQA	Accuracy (train)	70.200	70.200	70.200	4h 9m
glue_mnli	tasks/bloom/glue_mnli.yaml	GLUE MNLI	Accuracy (validation-matched)	86.765	84.707	84.942	1w 6d 3h 21m
glue_mnli	tasks/bloom/glue_mnli.yaml	GLUE MNLI	Accuracy (validation-mismatched)	87.429	85.466	85.761	1w 6d 3h 21m
glue_wnli	tasks/bloom/glue_wnli.yaml	GLUE WNLI	Accuracy (validation)	69.014	66.197	64.789	23m
superglue_axg	tasks/bloom/superglue_axg.yaml	SuperGLUE Winogender Schema Diagnostics (AX-g)	Accuracy (test)	88.202	86.798	87.022	2h 18m
openbook_qa	tasks/bloom/openbook_qa.yaml	OpenBookQA						CUDA out of memory. Tried to allocate 994.00 MiB (GPU 3; 23.69 GiB total capacity; 19.62 GiB already allocated; 841.12 MiB free; 21.37 GiB reserved in total by PyTorch)
glue_cola	tasks/bloom/glue_cola.yaml	CoLA	Accuracy (validation)	64.334	56.184	56.031	3h 7m
C3	tasks/chinese/clue/c3.yaml	C3	Accuracy (all)	74.895	74.895	74.895	8h 33m
DRCD	tasks/chinese/clue/drcd.yaml	DRCD	EM (all)	75.284	40.323	48.916	2d 20h 42m
DRCD	tasks/chinese/clue/drcd.yaml	DRCD	F1 (all)	75.359	40.344	48.954	2d 20h 42m
OCNLI_50K	tasks/chinese/clue/ocnli.yaml	CLUE (OCNLI_50K)	Accuracy (all)	73.767	73.767	73.767	2h 16m
CMNLI	tasks/chinese/clue/cmnli.yaml	CLUE (CMNLI)	Accuracy (all)	75.189	75.189	75.189	12h 6m
CSL	tasks/chinese/clue/csl.yaml	CSL		Accuracy (all)	50.000	50.000	50.000	7h 14m
CMRC2018	tasks/chinese/clue/cmrc2018.yaml	CLUE (CMRC2018)	EM (all)	53.091	28.611	29.574	2d 22h 15m
CMRC2018	tasks/chinese/clue/cmrc2018.yaml	CLUE (CMRC2018)	F1 (all)	53.885	29.042	30.057	2d 22h 15m
CLUEWSC2020	tasks/chinese/clue/cluewsc.yaml	FewCLUE (CLUEWSC-FC)	Accuracy (all)	82.237	69.243	64.364	3h 42m
AFQMC	tasks/chinese/clue/afqmc.yaml	CLUE (AFQMC)	Accuracy (all)	71.640	69.856	67.822	16h
EPRSTMT	tasks/chinese/fewclue/eprstmt.yaml	FewCLUE (EPRSTMT)	Accuracy (dev)	92.500	92.500	92.500	51m
EPRSTMT	tasks/chinese/fewclue/eprstmt.yaml	FewCLUE (EPRSTMT)	Accuracy (test)	91.475	91.475	91.475	51m
CLUEWSCF	tasks/chinese/fewclue/cluewscf.yaml	CLUE (WSC1.1)	Accuracy (dev)	62.893	56.604	56.604	2h 28m
CLUEWSCF	tasks/chinese/fewclue/cluewscf.yaml	CLUE (WSC1.1)	Accuracy (test)	65.984	58.094	58.094	2h 28m
CSLF	tasks/chinese/fewclue/cslf.yaml	CSL	Accuracy (dev)	49.375	49.375	49.375	8h 1m
CSLF	tasks/chinese/fewclue/cslf.yaml	CSL	Accuracy (test)	50.000	50.000	50.000	8h 1m
CHIDF	tasks/chinese/fewclue/chidf.yaml	(Few-Shot) on ChID	Accuracy (dev)	91.089	91.089	91.089	2h 58m
CHIDF	tasks/chinese/fewclue/chidf.yaml	(Few-Shot) on ChID	Accuracy (test)	92.358	92.358	92.358	2h 58m
BUSTM	tasks/chinese/fewclue/bustm.yaml	FewCLUE (BUSTM)						CUDA out of memory. Tried to allocate 1.01 GiB (GPU 0; 23.69 GiB total capacity; 19.77 GiB already allocated; 912.75 MiB free; 21.28 GiB reserved in total by PyTorch)
OCNLIF	tasks/chinese/fewclue/ocnlif.yaml	(Few-Shot) on OCNLI	Accuracy (dev)	71.875	71.875	71.875	2h 2m
OCNLIF	tasks/chinese/fewclue/ocnlif.yaml	(Few-Shot) on OCNLI	Accuracy (test)	74.167	74.167	74.167	2h 2m
LAMBADA	tasks/lambada/lambada.yaml	LAMBADA						CUDA out of memory. Tried to allocate 1.65 GiB (GPU 1; 23.69 GiB total capacity; 18.99 GiB already allocated; 1.22 GiB free; 20.97 GiB reserved in total by PyTorch)
LAMBADA-unidirectional	tasks/lambada/lambada-unidirectional.yaml	LAMBADA						CUDA out of memory. Tried to allocate 1.67 GiB (GPU 1; 23.69 GiB total capacity; 19.02 GiB already allocated; 1.56 GiB free; 20.70 GiB reserved in total by PyTorch)
Pile	tasks/language-modeling/pile.yaml	The Pile						IndexError: list index out of range
Penn Treebank	tasks/language-modeling/ptb.yaml	Penn Treebank					0
WikiText-103	tasks/language-modeling/wikitext-103.yaml	WikiText-103					0
WikiText-2	tasks/language-modeling/wikitext-2.yaml	WikiText-2					0
MMLU	tasks/mmlu/mmlu.yaml	MMLU	Accuracy (stem)	64.000	38.298	38.408	4d 13h 21m
MMLU	tasks/mmlu/mmlu.yaml	MMLU	Accuracy (social_sciences)	60.104	52.728	48.456	4d 13h 21m
MMLU	tasks/mmlu/mmlu.yaml	MMLU	Accuracy (humanities)	64.135	49.691	41.573	4d 13h 21m
MMLU	tasks/mmlu/mmlu.yaml	MMLU	Accuracy (other)	64.957	47.085	50.756	4d 13h 21m
MMLU	tasks/mmlu/mmlu.yaml	MMLU	Accuracy (Overall)	64.957	46.207	44.403	4d 13h 21m
CROWS	tasks/ethnic/crows-pair/crows-pair.yaml	CrowS-Pairs				0
ETHOS_zeroshot	tasks/ethnic/ethos/ethos-zeroshot.yaml	Ethos Binary					0
ETHOS_oneshot	tasks/ethnic/ethos/ethos-oneshot.yaml	Ethos Binary					0
ETHOS_fewshot_single	tasks/ethnic/ethos/ethos-fewshot-single.yaml	Ethos Binary
ETHOS_fewshot_multi	tasks/ethnic/ethos/ethos-fewshot-multi.yaml	Ethos MultiLabel					0
StereoSet	tasks/ethnic/stereoset/stereoset.yaml	StereoSet					4d 13h 21m

    Task glue_qnli loaded from config tasks/bloom/glue_qnli.yaml
    Task superglue_axb loaded from config tasks/bloom/superglue_axb.yaml
    Task mc_taco loaded from config tasks/bloom/mc_taco.yaml
    Task math_qa loaded from config tasks/bloom/math_qa.yaml
    Task pubmed_qa loaded from config tasks/bloom/pubmed_qa.yaml
    Task glue_mnli loaded from config tasks/bloom/glue_mnli.yaml
    Task glue_wnli loaded from config tasks/bloom/glue_wnli.yaml
    Task superglue_axg loaded from config tasks/bloom/superglue_axg.yaml
    Task openbook_qa loaded from config tasks/bloom/openbook_qa.yaml
    Task glue_cola loaded from config tasks/bloom/glue_cola.yaml
    Task C3 loaded from config tasks/chinese/clue/c3.yaml
    Task DRCD loaded from config tasks/chinese/clue/drcd.yaml
    Task OCNLI_50K loaded from config tasks/chinese/clue/ocnli.yaml
    Task CMNLI loaded from config tasks/chinese/clue/cmnli.yaml
    Task CSL loaded from config tasks/chinese/clue/csl.yaml
    Task CMRC2018 loaded from config tasks/chinese/clue/cmrc2018.yaml
    Task CLUEWSC2020 loaded from config tasks/chinese/clue/cluewsc.yaml
    Task AFQMC loaded from config tasks/chinese/clue/afqmc.yaml
    Task EPRSTMT loaded from config tasks/chinese/fewclue/eprstmt.yaml
    Task CLUEWSCF loaded from config tasks/chinese/fewclue/cluewscf.yaml
    Task CSLF loaded from config tasks/chinese/fewclue/cslf.yaml
    Task CHIDF loaded from config tasks/chinese/fewclue/chidf.yaml
    Task BUSTM loaded from config tasks/chinese/fewclue/bustm.yaml
    Task OCNLIF loaded from config tasks/chinese/fewclue/ocnlif.yaml
    Task LAMBADA loaded from config tasks/lambada/lambada.yaml
    Task LAMBADA-unidirectional loaded from config tasks/lambada/lambada-unidirectional.yaml
    Task Pile loaded from config tasks/language-modeling/pile.yaml
    Task Penn Treebank loaded from config tasks/language-modeling/ptb.yaml
    Task WikiText-103 loaded from config tasks/language-modeling/wikitext-103.yaml
    Task WikiText-2 loaded from config tasks/language-modeling/wikitext-2.yaml
    Task MMLU loaded from config tasks/mmlu/mmlu.yaml
    Task CROWS loaded from config tasks/ethnic/crows-pair/crows-pair.yaml
    Task ETHOS_zeroshot loaded from config tasks/ethnic/ethos/ethos-zeroshot.yaml
    Task ETHOS_oneshot loaded from config tasks/ethnic/ethos/ethos-oneshot.yaml
    Task ETHOS_fewshot_single loaded from config tasks/ethnic/ethos/ethos-fewshot-single.yaml
    Task ETHOS_fewshot_multi loaded from config tasks/ethnic/ethos/ethos-fewshot-multi.yaml
    Task StereoSet loaded from config tasks/ethnic/stereoset/stereoset.yaml
> Successfully load 37 tasks


MultiChoiceTaskConfig(name='glue_qnli’)
Evaluation results of task glue_qnli:
    Group validation Accuracy: max = 88.395, median = 86.235, average = 84.386
Finish task glue_qnli in 117054.9s.


MultiChoiceTaskConfig(name='superglue_axb’)
Evaluation results of task superglue_axb:
    Group test Accuracy: max = 79.982, median = 79.167, average = 78.922
Finish task superglue_axb in 37182.9s.


GenerationTaskConfig(name='mc_taco’)
Evaluation results of task mc_taco:
      Group validation: 
        Metric EM: max = 10.330, median = 10.330, average = 10.330
        Metric F1: max = 16.724, median = 16.724, average = 16.724
      Group test: 
        Metric EM: max = 11.273, median = 11.273, average = 11.273
        Metric F1: max = 17.461, median = 17.461, average = 17.461
Finish task mc_taco in 6043.5s.


MultiChoiceTaskConfig(name='math_qa’)
Evaluation results of task math_qa:
    Group validation Accuracy: max = 26.145, median = 23.866, average = 24.371
    Group test Accuracy: max = 26.968, median = 24.958, average = 25.085
Finish task math_qa in 223462.6s.


MultiChoiceTaskConfig(name='pubmed_qa’)
Evaluation results of task pubmed_qa:
    Group train Accuracy: max = 70.200, median = 70.200, average = 70.200
Finish task pubmed_qa in 14989.6s.


MultiChoiceTaskConfig(name='glue_mnli’)
Evaluation results of task glue_mnli:
    Group validation-matched Accuracy: max = 86.765, median = 84.707, average = 84.942
    Group validation-mismatched Accuracy: max = 87.429, median = 85.466, average = 85.761
Finish task glue_mnli in 1135279.5s.


MultiChoiceTaskConfig(name='glue_wnli’)
Evaluation results of task glue_wnli:
    Group validation Accuracy: max = 69.014, median = 66.197, average = 64.789
Finish task glue_wnli in 1398.8s.


MultiChoiceTaskConfig(name='superglue_axg’)
Evaluation results of task superglue_axg:
    Group test Accuracy: max = 88.202, median = 86.798, average = 87.022
Finish task superglue_axg in 8315.5s.


MultiChoiceTaskConfig(name='openbook_qa’)
CUDA out of memory. Tried to allocate 994.00 MiB (GPU 3; 23.69 GiB total capacity; 19.62 GiB already allocated; 841.12 MiB free; 21.37 GiB reserved in total by PyTorch)

MultiChoiceTaskConfig(name='glue_cola’)
Evaluation results of task glue_cola:
    Group validation Accuracy: max = 64.334, median = 56.184, average = 56.031
Finish task glue_cola in 11242.4s.


MultiChoiceTaskConfig(name='C3’)
Evaluation results of task C3:
    Group all Accuracy: max = 74.895, median = 74.895, average = 74.895
Finish task C3 in 30808.4s.


GenerationTaskConfig(name='DRCD’)
Evaluation results of task DRCD:
      Group all: 
        Metric EM: max = 75.284, median = 40.323, average = 48.916
        Metric F1: max = 75.359, median = 40.344, average = 48.954
Finish task DRCD in 247329.3s.


MultiChoiceTaskConfig(name='OCNLI_50K’)
Evaluation results of task OCNLI_50K:
    Group all Accuracy: max = 73.767, median = 73.767, average = 73.767
Finish task OCNLI_50K in 8183.8s.


MultiChoiceTaskConfig(name='CMNLI’)
Evaluation results of task CMNLI:
    Group all Accuracy: max = 75.189, median = 75.189, average = 75.189
Finish task CMNLI in 43612.5s.


MultiChoiceTaskConfig(name='CSL’)
Evaluation results of task CSL:
    Group all Accuracy: max = 50.000, median = 50.000, average = 50.000
Finish task CSL in 26050.0s.


GenerationTaskConfig(name='CMRC2018’)
Evaluation results of task CMRC2018:
      Group all: 
        Metric EM: max = 53.091, median = 28.611, average = 29.574
        Metric F1: max = 53.885, median = 29.042, average = 30.057
Finish task CMRC2018 in 252918.1s.


MultiChoiceTaskConfig(name='CLUEWSC2020’)
Evaluation results of task CLUEWSC2020:
    Group all Accuracy: max = 82.237, median = 69.243, average = 64.364
Finish task CLUEWSC2020 in 13329.4s.


MultiChoiceTaskConfig(name='AFQMC’)
Evaluation results of task AFQMC:
    Group all Accuracy: max = 71.640, median = 69.856, average = 67.822
Finish task AFQMC in 57646.9s.


MultiChoiceTaskConfig(name='EPRSTMT’)
Evaluation results of task EPRSTMT:
    Group dev Accuracy: max = 92.500, median = 92.500, average = 92.500
    Group test Accuracy: max = 91.475, median = 91.475, average = 91.475
Finish task EPRSTMT in 3073.1s.


MultiChoiceTaskConfig(name='CLUEWSCF’)
Evaluation results of task CLUEWSCF:
    Group dev Accuracy: max = 62.893, median = 56.604, average = 56.604
    Group test Accuracy: max = 65.984, median = 58.094, average = 58.094
Finish task CLUEWSCF in 8895.2s.


MultiChoiceTaskConfig(name='CSLF’)
Evaluation results of task CSLF:
    Group dev Accuracy: max = 49.375, median = 49.375, average = 49.375
    Group test Accuracy: max = 50.000, median = 50.000, average = 50.000
Finish task CSLF in 28883.4s.


MultiChoiceTaskConfig(name='CHIDF’)
Evaluation results of task CHIDF:
    Group dev Accuracy: max = 91.089, median = 91.089, average = 91.089
    Group test Accuracy: max = 92.358, median = 92.358, average = 92.358
Finish task CHIDF in 10732.3s.


MultiChoiceTaskConfig(name='BUSTM’)
CUDA out of memory. Tried to allocate 1.01 GiB (GPU 0; 23.69 GiB total capacity; 19.77 GiB already allocated; 912.75 MiB free; 21.28 GiB reserved in total by PyTorch)


MultiChoiceTaskConfig(name='OCNLIF’)
Evaluation results of task OCNLIF:
    Group dev Accuracy: max = 71.875, median = 71.875, average = 71.875
    Group test Accuracy: max = 74.167, median = 74.167, average = 74.167
Finish task OCNLIF in 7344.2s.


GenerationTaskConfig(name='LAMBADA’)
CUDA out of memory. Tried to allocate 1.65 GiB (GPU 1; 23.69 GiB total capacity; 18.99 GiB already allocated; 1.22 GiB free; 20.97 GiB reserved in total by PyTorch)


GenerationTaskConfig(name='LAMBADA-unidirectional’)
CUDA out of memory. Tried to allocate 1.67 GiB (GPU 1; 23.69 GiB total capacity; 19.02 GiB already allocated; 1.56 GiB free; 20.70 GiB reserved in total by PyTorch)


LanguageModelTaskConfig(name='Pile’)
IndexError: list index out of range


LanguageModelTaskConfig(name='Penn Treebank’)
Evaluation results of task Penn Treebank:
      Group test: 
Finish task Penn Treebank in 0.0s.


LanguageModelTaskConfig(name='WikiText-103’)
Evaluation results of task WikiText-103:
      Group test: 
Finish task WikiText-103 in 0.0s.


LanguageModelTaskConfig(name='WikiText-2’)
Evaluation results of task WikiText-2:
      Group test: 
Finish task WikiText-2 in 0.0s.


MultiChoiceTaskConfig(name='MMLU’)
Evaluation results of task MMLU:
    Group stem Accuracy: max = 64.000, median = 38.298, average = 38.408
    Group social_sciences Accuracy: max = 60.104, median = 52.728, average = 48.456
    Group humanities Accuracy: max = 64.135, median = 49.691, average = 41.573
    Group other Accuracy: max = 64.957, median = 47.085, average = 50.756
Group Overall Accuracy: max = 64.957, median = 46.207, average = 44.403
Finish task MMLU in 393690.1s.


MultiChoiceTaskConfig(name='CROWS’)
Evaluating task CROWS:
    Evaluating group test:
Evaluation results of task CROWS:
Finish task CROWS in 0.0s.


MultiChoiceTaskConfig(name='ETHOS_zeroshot’)
Evaluation results of task ETHOS_zeroshot:
      Group test: 
Finish task ETHOS_zeroshot in 0.0s.


MultiChoiceTaskConfig(name='ETHOS_oneshot’)
Evaluation results of task ETHOS_oneshot:
      Group test: 
Finish task ETHOS_oneshot in 0.0s.


MultiChoiceTaskConfig(name='ETHOS_fewshot_single’)
Evaluation results of task ETHOS_fewshot_single:
      Group test: 
Finish task ETHOS_fewshot_single in 0.0s.


MultiChoiceTaskConfig(name='ETHOS_fewshot_multi’)
Evaluation results of task ETHOS_fewshot_multi:
      Group test: 
Finish task ETHOS_fewshot_multi in 0.0s.


MultiChoiceTaskConfig(name='StereoSet’)
Evaluation results of task StereoSet:
Finish task StereoSet in 0.0s.
Finish 10 tasks in 393690.3s

The text was updated successfully, but these errors were encountered:

rchanggogogo mentioned this issue Jun 21, 2023

模型效果很差，是什么原因呢？ #186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GLM-130B model evaluation on the 4 x RTX 3090 GPU machine #94

GLM-130B model evaluation on the 4 x RTX 3090 GPU machine #94

Tomas0413 commented Mar 1, 2023

GLM-130B model evaluation on the 4 x RTX 3090 GPU machine #94

GLM-130B model evaluation on the 4 x RTX 3090 GPU machine #94

Comments

Tomas0413 commented Mar 1, 2023