Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GLM-130B model evaluation on the 4 x RTX 3090 GPU machine #94

Open
Tomas0413 opened this issue Mar 1, 2023 · 0 comments
Open

GLM-130B model evaluation on the 4 x RTX 3090 GPU machine #94

Tomas0413 opened this issue Mar 1, 2023 · 0 comments

Comments

@Tomas0413
Copy link

I have completed the initial evaluation of the GLB-130B model on a 4x RTX 3090 GPU machine (64GB of RAM and 8TB SSD drive).

The model underwent INT4 quantization, which helped reduce the GPU memory requirement from 240 GB to 63 GB.

The time taken in the evaluation table is per task, and for convenience, I reported on each line for a different metric. I also tried to find the relevant benchmark for each task and added a link to the Papers With Code website.

For now, this is only the first evaluation run, but I plan to spend more time closely examining the tasks and exploring the use of FastTransformers to improve inference time.

I'm interested in seeing if I could use GLM-130B in combination with RLHF in the future, so I'll explore that direction too.

Next to the table, I also share the partial raw log file.

Task Config Benchmark Evaluation Metric Max Median Average Time Taken (per task) Error
glue_qnli tasks/bloom/glue_qnli.yaml GLUE QNLI Accuracy (validation) 88.395 86.235 84.386 1d 8h 30m
superglue_axb tasks/bloom/superglue_axb.yaml SuperGLUE Broadcoverage Diagnostics (AX-b) Accuracy (test) 79.982 79.167 78.922 10h 19m
mc_taco tasks/bloom/mc_taco.yaml MC-TACO EM (validation) 10.330 10.330 10.330 1h 40m
mc_taco tasks/bloom/mc_taco.yaml MC-TACO F1 (validation) 16.724 16.724 16.724 1h 40m
mc_taco tasks/bloom/mc_taco.yaml MC-TACO EM (test) 11.273 11.273 11.273 1h 40m
mc_taco tasks/bloom/mc_taco.yaml MC-TACO F1 (test) 17.461 17.461 17.461 1h 40m
math_qa tasks/bloom/math_qa.yaml MathQA Accuracy (validation) 26.145 23.866 24.371 2d 14h 4m
math_qa tasks/bloom/math_qa.yaml MathQA Accuracy (test) 26.968 24.958 25.085 2d 14h 4m
pubmed_qa tasks/bloom/pubmed_qa.yaml PubMedQA Accuracy (train) 70.200 70.200 70.200 4h 9m
glue_mnli tasks/bloom/glue_mnli.yaml GLUE MNLI Accuracy (validation-matched) 86.765 84.707 84.942 1w 6d 3h 21m
glue_mnli tasks/bloom/glue_mnli.yaml GLUE MNLI Accuracy (validation-mismatched) 87.429 85.466 85.761 1w 6d 3h 21m
glue_wnli tasks/bloom/glue_wnli.yaml GLUE WNLI Accuracy (validation) 69.014 66.197 64.789 23m
superglue_axg tasks/bloom/superglue_axg.yaml SuperGLUE Winogender Schema Diagnostics (AX-g) Accuracy (test) 88.202 86.798 87.022 2h 18m
openbook_qa tasks/bloom/openbook_qa.yaml OpenBookQA CUDA out of memory. Tried to allocate 994.00 MiB (GPU 3; 23.69 GiB total capacity; 19.62 GiB already allocated; 841.12 MiB free; 21.37 GiB reserved in total by PyTorch)
glue_cola tasks/bloom/glue_cola.yaml CoLA Accuracy (validation) 64.334 56.184 56.031 3h 7m
C3 tasks/chinese/clue/c3.yaml C3 Accuracy (all) 74.895 74.895 74.895 8h 33m
DRCD tasks/chinese/clue/drcd.yaml DRCD EM (all) 75.284 40.323 48.916 2d 20h 42m
DRCD tasks/chinese/clue/drcd.yaml DRCD F1 (all) 75.359 40.344 48.954 2d 20h 42m
OCNLI_50K tasks/chinese/clue/ocnli.yaml CLUE (OCNLI_50K) Accuracy (all) 73.767 73.767 73.767 2h 16m
CMNLI tasks/chinese/clue/cmnli.yaml CLUE (CMNLI) Accuracy (all) 75.189 75.189 75.189 12h 6m
CSL tasks/chinese/clue/csl.yaml CSL Accuracy (all) 50.000 50.000 50.000 7h 14m
CMRC2018 tasks/chinese/clue/cmrc2018.yaml CLUE (CMRC2018) EM (all) 53.091 28.611 29.574 2d 22h 15m
CMRC2018 tasks/chinese/clue/cmrc2018.yaml CLUE (CMRC2018) F1 (all) 53.885 29.042 30.057 2d 22h 15m
CLUEWSC2020 tasks/chinese/clue/cluewsc.yaml FewCLUE (CLUEWSC-FC) Accuracy (all) 82.237 69.243 64.364 3h 42m
AFQMC tasks/chinese/clue/afqmc.yaml CLUE (AFQMC) Accuracy (all) 71.640 69.856 67.822 16h
EPRSTMT tasks/chinese/fewclue/eprstmt.yaml FewCLUE (EPRSTMT) Accuracy (dev) 92.500 92.500 92.500 51m
EPRSTMT tasks/chinese/fewclue/eprstmt.yaml FewCLUE (EPRSTMT) Accuracy (test) 91.475 91.475 91.475 51m
CLUEWSCF tasks/chinese/fewclue/cluewscf.yaml CLUE (WSC1.1) Accuracy (dev) 62.893 56.604 56.604 2h 28m
CLUEWSCF tasks/chinese/fewclue/cluewscf.yaml CLUE (WSC1.1) Accuracy (test) 65.984 58.094 58.094 2h 28m
CSLF tasks/chinese/fewclue/cslf.yaml CSL Accuracy (dev) 49.375 49.375 49.375 8h 1m
CSLF tasks/chinese/fewclue/cslf.yaml CSL Accuracy (test) 50.000 50.000 50.000 8h 1m
CHIDF tasks/chinese/fewclue/chidf.yaml (Few-Shot) on ChID Accuracy (dev) 91.089 91.089 91.089 2h 58m
CHIDF tasks/chinese/fewclue/chidf.yaml (Few-Shot) on ChID Accuracy (test) 92.358 92.358 92.358 2h 58m
BUSTM tasks/chinese/fewclue/bustm.yaml FewCLUE (BUSTM) CUDA out of memory. Tried to allocate 1.01 GiB (GPU 0; 23.69 GiB total capacity; 19.77 GiB already allocated; 912.75 MiB free; 21.28 GiB reserved in total by PyTorch)
OCNLIF tasks/chinese/fewclue/ocnlif.yaml (Few-Shot) on OCNLI Accuracy (dev) 71.875 71.875 71.875 2h 2m
OCNLIF tasks/chinese/fewclue/ocnlif.yaml (Few-Shot) on OCNLI Accuracy (test) 74.167 74.167 74.167 2h 2m
LAMBADA tasks/lambada/lambada.yaml LAMBADA CUDA out of memory. Tried to allocate 1.65 GiB (GPU 1; 23.69 GiB total capacity; 18.99 GiB already allocated; 1.22 GiB free; 20.97 GiB reserved in total by PyTorch)
LAMBADA-unidirectional tasks/lambada/lambada-unidirectional.yaml LAMBADA CUDA out of memory. Tried to allocate 1.67 GiB (GPU 1; 23.69 GiB total capacity; 19.02 GiB already allocated; 1.56 GiB free; 20.70 GiB reserved in total by PyTorch)
Pile tasks/language-modeling/pile.yaml The Pile IndexError: list index out of range
Penn Treebank tasks/language-modeling/ptb.yaml Penn Treebank 0
WikiText-103 tasks/language-modeling/wikitext-103.yaml WikiText-103 0
WikiText-2 tasks/language-modeling/wikitext-2.yaml WikiText-2 0
MMLU tasks/mmlu/mmlu.yaml MMLU Accuracy (stem) 64.000 38.298 38.408 4d 13h 21m
MMLU tasks/mmlu/mmlu.yaml MMLU Accuracy (social_sciences) 60.104 52.728 48.456 4d 13h 21m
MMLU tasks/mmlu/mmlu.yaml MMLU Accuracy (humanities) 64.135 49.691 41.573 4d 13h 21m
MMLU tasks/mmlu/mmlu.yaml MMLU Accuracy (other) 64.957 47.085 50.756 4d 13h 21m
MMLU tasks/mmlu/mmlu.yaml MMLU Accuracy (Overall) 64.957 46.207 44.403 4d 13h 21m
CROWS tasks/ethnic/crows-pair/crows-pair.yaml CrowS-Pairs 0
ETHOS_zeroshot tasks/ethnic/ethos/ethos-zeroshot.yaml Ethos Binary 0
ETHOS_oneshot tasks/ethnic/ethos/ethos-oneshot.yaml Ethos Binary 0
ETHOS_fewshot_single tasks/ethnic/ethos/ethos-fewshot-single.yaml Ethos Binary
ETHOS_fewshot_multi tasks/ethnic/ethos/ethos-fewshot-multi.yaml Ethos MultiLabel 0
StereoSet tasks/ethnic/stereoset/stereoset.yaml StereoSet 4d 13h 21m
    Task glue_qnli loaded from config tasks/bloom/glue_qnli.yaml
    Task superglue_axb loaded from config tasks/bloom/superglue_axb.yaml
    Task mc_taco loaded from config tasks/bloom/mc_taco.yaml
    Task math_qa loaded from config tasks/bloom/math_qa.yaml
    Task pubmed_qa loaded from config tasks/bloom/pubmed_qa.yaml
    Task glue_mnli loaded from config tasks/bloom/glue_mnli.yaml
    Task glue_wnli loaded from config tasks/bloom/glue_wnli.yaml
    Task superglue_axg loaded from config tasks/bloom/superglue_axg.yaml
    Task openbook_qa loaded from config tasks/bloom/openbook_qa.yaml
    Task glue_cola loaded from config tasks/bloom/glue_cola.yaml
    Task C3 loaded from config tasks/chinese/clue/c3.yaml
    Task DRCD loaded from config tasks/chinese/clue/drcd.yaml
    Task OCNLI_50K loaded from config tasks/chinese/clue/ocnli.yaml
    Task CMNLI loaded from config tasks/chinese/clue/cmnli.yaml
    Task CSL loaded from config tasks/chinese/clue/csl.yaml
    Task CMRC2018 loaded from config tasks/chinese/clue/cmrc2018.yaml
    Task CLUEWSC2020 loaded from config tasks/chinese/clue/cluewsc.yaml
    Task AFQMC loaded from config tasks/chinese/clue/afqmc.yaml
    Task EPRSTMT loaded from config tasks/chinese/fewclue/eprstmt.yaml
    Task CLUEWSCF loaded from config tasks/chinese/fewclue/cluewscf.yaml
    Task CSLF loaded from config tasks/chinese/fewclue/cslf.yaml
    Task CHIDF loaded from config tasks/chinese/fewclue/chidf.yaml
    Task BUSTM loaded from config tasks/chinese/fewclue/bustm.yaml
    Task OCNLIF loaded from config tasks/chinese/fewclue/ocnlif.yaml
    Task LAMBADA loaded from config tasks/lambada/lambada.yaml
    Task LAMBADA-unidirectional loaded from config tasks/lambada/lambada-unidirectional.yaml
    Task Pile loaded from config tasks/language-modeling/pile.yaml
    Task Penn Treebank loaded from config tasks/language-modeling/ptb.yaml
    Task WikiText-103 loaded from config tasks/language-modeling/wikitext-103.yaml
    Task WikiText-2 loaded from config tasks/language-modeling/wikitext-2.yaml
    Task MMLU loaded from config tasks/mmlu/mmlu.yaml
    Task CROWS loaded from config tasks/ethnic/crows-pair/crows-pair.yaml
    Task ETHOS_zeroshot loaded from config tasks/ethnic/ethos/ethos-zeroshot.yaml
    Task ETHOS_oneshot loaded from config tasks/ethnic/ethos/ethos-oneshot.yaml
    Task ETHOS_fewshot_single loaded from config tasks/ethnic/ethos/ethos-fewshot-single.yaml
    Task ETHOS_fewshot_multi loaded from config tasks/ethnic/ethos/ethos-fewshot-multi.yaml
    Task StereoSet loaded from config tasks/ethnic/stereoset/stereoset.yaml
> Successfully load 37 tasks


MultiChoiceTaskConfig(name='glue_qnli’)
Evaluation results of task glue_qnli:
    Group validation Accuracy: max = 88.395, median = 86.235, average = 84.386
Finish task glue_qnli in 117054.9s.


MultiChoiceTaskConfig(name='superglue_axb’)
Evaluation results of task superglue_axb:
    Group test Accuracy: max = 79.982, median = 79.167, average = 78.922
Finish task superglue_axb in 37182.9s.


GenerationTaskConfig(name='mc_taco’)
Evaluation results of task mc_taco:
      Group validation: 
        Metric EM: max = 10.330, median = 10.330, average = 10.330
        Metric F1: max = 16.724, median = 16.724, average = 16.724
      Group test: 
        Metric EM: max = 11.273, median = 11.273, average = 11.273
        Metric F1: max = 17.461, median = 17.461, average = 17.461
Finish task mc_taco in 6043.5s.


MultiChoiceTaskConfig(name='math_qa’)
Evaluation results of task math_qa:
    Group validation Accuracy: max = 26.145, median = 23.866, average = 24.371
    Group test Accuracy: max = 26.968, median = 24.958, average = 25.085
Finish task math_qa in 223462.6s.


MultiChoiceTaskConfig(name='pubmed_qa’)
Evaluation results of task pubmed_qa:
    Group train Accuracy: max = 70.200, median = 70.200, average = 70.200
Finish task pubmed_qa in 14989.6s.


MultiChoiceTaskConfig(name='glue_mnli’)
Evaluation results of task glue_mnli:
    Group validation-matched Accuracy: max = 86.765, median = 84.707, average = 84.942
    Group validation-mismatched Accuracy: max = 87.429, median = 85.466, average = 85.761
Finish task glue_mnli in 1135279.5s.


MultiChoiceTaskConfig(name='glue_wnli’)
Evaluation results of task glue_wnli:
    Group validation Accuracy: max = 69.014, median = 66.197, average = 64.789
Finish task glue_wnli in 1398.8s.


MultiChoiceTaskConfig(name='superglue_axg’)
Evaluation results of task superglue_axg:
    Group test Accuracy: max = 88.202, median = 86.798, average = 87.022
Finish task superglue_axg in 8315.5s.


MultiChoiceTaskConfig(name='openbook_qa’)
CUDA out of memory. Tried to allocate 994.00 MiB (GPU 3; 23.69 GiB total capacity; 19.62 GiB already allocated; 841.12 MiB free; 21.37 GiB reserved in total by PyTorch)

MultiChoiceTaskConfig(name='glue_cola’)
Evaluation results of task glue_cola:
    Group validation Accuracy: max = 64.334, median = 56.184, average = 56.031
Finish task glue_cola in 11242.4s.


MultiChoiceTaskConfig(name='C3’)
Evaluation results of task C3:
    Group all Accuracy: max = 74.895, median = 74.895, average = 74.895
Finish task C3 in 30808.4s.


GenerationTaskConfig(name='DRCD’)
Evaluation results of task DRCD:
      Group all: 
        Metric EM: max = 75.284, median = 40.323, average = 48.916
        Metric F1: max = 75.359, median = 40.344, average = 48.954
Finish task DRCD in 247329.3s.


MultiChoiceTaskConfig(name='OCNLI_50K’)
Evaluation results of task OCNLI_50K:
    Group all Accuracy: max = 73.767, median = 73.767, average = 73.767
Finish task OCNLI_50K in 8183.8s.


MultiChoiceTaskConfig(name='CMNLI’)
Evaluation results of task CMNLI:
    Group all Accuracy: max = 75.189, median = 75.189, average = 75.189
Finish task CMNLI in 43612.5s.


MultiChoiceTaskConfig(name='CSL’)
Evaluation results of task CSL:
    Group all Accuracy: max = 50.000, median = 50.000, average = 50.000
Finish task CSL in 26050.0s.


GenerationTaskConfig(name='CMRC2018’)
Evaluation results of task CMRC2018:
      Group all: 
        Metric EM: max = 53.091, median = 28.611, average = 29.574
        Metric F1: max = 53.885, median = 29.042, average = 30.057
Finish task CMRC2018 in 252918.1s.


MultiChoiceTaskConfig(name='CLUEWSC2020’)
Evaluation results of task CLUEWSC2020:
    Group all Accuracy: max = 82.237, median = 69.243, average = 64.364
Finish task CLUEWSC2020 in 13329.4s.


MultiChoiceTaskConfig(name='AFQMC’)
Evaluation results of task AFQMC:
    Group all Accuracy: max = 71.640, median = 69.856, average = 67.822
Finish task AFQMC in 57646.9s.


MultiChoiceTaskConfig(name='EPRSTMT’)
Evaluation results of task EPRSTMT:
    Group dev Accuracy: max = 92.500, median = 92.500, average = 92.500
    Group test Accuracy: max = 91.475, median = 91.475, average = 91.475
Finish task EPRSTMT in 3073.1s.


MultiChoiceTaskConfig(name='CLUEWSCF’)
Evaluation results of task CLUEWSCF:
    Group dev Accuracy: max = 62.893, median = 56.604, average = 56.604
    Group test Accuracy: max = 65.984, median = 58.094, average = 58.094
Finish task CLUEWSCF in 8895.2s.


MultiChoiceTaskConfig(name='CSLF’)
Evaluation results of task CSLF:
    Group dev Accuracy: max = 49.375, median = 49.375, average = 49.375
    Group test Accuracy: max = 50.000, median = 50.000, average = 50.000
Finish task CSLF in 28883.4s.


MultiChoiceTaskConfig(name='CHIDF’)
Evaluation results of task CHIDF:
    Group dev Accuracy: max = 91.089, median = 91.089, average = 91.089
    Group test Accuracy: max = 92.358, median = 92.358, average = 92.358
Finish task CHIDF in 10732.3s.


MultiChoiceTaskConfig(name='BUSTM’)
CUDA out of memory. Tried to allocate 1.01 GiB (GPU 0; 23.69 GiB total capacity; 19.77 GiB already allocated; 912.75 MiB free; 21.28 GiB reserved in total by PyTorch)


MultiChoiceTaskConfig(name='OCNLIF’)
Evaluation results of task OCNLIF:
    Group dev Accuracy: max = 71.875, median = 71.875, average = 71.875
    Group test Accuracy: max = 74.167, median = 74.167, average = 74.167
Finish task OCNLIF in 7344.2s.


GenerationTaskConfig(name='LAMBADA’)
CUDA out of memory. Tried to allocate 1.65 GiB (GPU 1; 23.69 GiB total capacity; 18.99 GiB already allocated; 1.22 GiB free; 20.97 GiB reserved in total by PyTorch)


GenerationTaskConfig(name='LAMBADA-unidirectional’)
CUDA out of memory. Tried to allocate 1.67 GiB (GPU 1; 23.69 GiB total capacity; 19.02 GiB already allocated; 1.56 GiB free; 20.70 GiB reserved in total by PyTorch)


LanguageModelTaskConfig(name='Pile’)
IndexError: list index out of range


LanguageModelTaskConfig(name='Penn Treebank’)
Evaluation results of task Penn Treebank:
      Group test: 
Finish task Penn Treebank in 0.0s.


LanguageModelTaskConfig(name='WikiText-103’)
Evaluation results of task WikiText-103:
      Group test: 
Finish task WikiText-103 in 0.0s.


LanguageModelTaskConfig(name='WikiText-2’)
Evaluation results of task WikiText-2:
      Group test: 
Finish task WikiText-2 in 0.0s.


MultiChoiceTaskConfig(name='MMLU’)
Evaluation results of task MMLU:
    Group stem Accuracy: max = 64.000, median = 38.298, average = 38.408
    Group social_sciences Accuracy: max = 60.104, median = 52.728, average = 48.456
    Group humanities Accuracy: max = 64.135, median = 49.691, average = 41.573
    Group other Accuracy: max = 64.957, median = 47.085, average = 50.756
Group Overall Accuracy: max = 64.957, median = 46.207, average = 44.403
Finish task MMLU in 393690.1s.


MultiChoiceTaskConfig(name='CROWS’)
Evaluating task CROWS:
    Evaluating group test:
Evaluation results of task CROWS:
Finish task CROWS in 0.0s.


MultiChoiceTaskConfig(name='ETHOS_zeroshot’)
Evaluation results of task ETHOS_zeroshot:
      Group test: 
Finish task ETHOS_zeroshot in 0.0s.


MultiChoiceTaskConfig(name='ETHOS_oneshot’)
Evaluation results of task ETHOS_oneshot:
      Group test: 
Finish task ETHOS_oneshot in 0.0s.


MultiChoiceTaskConfig(name='ETHOS_fewshot_single’)
Evaluation results of task ETHOS_fewshot_single:
      Group test: 
Finish task ETHOS_fewshot_single in 0.0s.


MultiChoiceTaskConfig(name='ETHOS_fewshot_multi’)
Evaluation results of task ETHOS_fewshot_multi:
      Group test: 
Finish task ETHOS_fewshot_multi in 0.0s.


MultiChoiceTaskConfig(name='StereoSet’)
Evaluation results of task StereoSet:
Finish task StereoSet in 0.0s.
Finish 10 tasks in 393690.3s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant