You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have completed the initial evaluation of the GLB-130B model on a 4x RTX 3090 GPU machine (64GB of RAM and 8TB SSD drive).
The model underwent INT4 quantization, which helped reduce the GPU memory requirement from 240 GB to 63 GB.
The time taken in the evaluation table is per task, and for convenience, I reported on each line for a different metric. I also tried to find the relevant benchmark for each task and added a link to the Papers With Code website.
For now, this is only the first evaluation run, but I plan to spend more time closely examining the tasks and exploring the use of FastTransformers to improve inference time.
I'm interested in seeing if I could use GLM-130B in combination with RLHF in the future, so I'll explore that direction too.
Next to the table, I also share the partial raw log file.
CUDA out of memory. Tried to allocate 994.00 MiB (GPU 3; 23.69 GiB total capacity; 19.62 GiB already allocated; 841.12 MiB free; 21.37 GiB reserved in total by PyTorch)
CUDA out of memory. Tried to allocate 1.01 GiB (GPU 0; 23.69 GiB total capacity; 19.77 GiB already allocated; 912.75 MiB free; 21.28 GiB reserved in total by PyTorch)
CUDA out of memory. Tried to allocate 1.65 GiB (GPU 1; 23.69 GiB total capacity; 18.99 GiB already allocated; 1.22 GiB free; 20.97 GiB reserved in total by PyTorch)
CUDA out of memory. Tried to allocate 1.67 GiB (GPU 1; 23.69 GiB total capacity; 19.02 GiB already allocated; 1.56 GiB free; 20.70 GiB reserved in total by PyTorch)
Task glue_qnli loaded from config tasks/bloom/glue_qnli.yaml
Task superglue_axb loaded from config tasks/bloom/superglue_axb.yaml
Task mc_taco loaded from config tasks/bloom/mc_taco.yaml
Task math_qa loaded from config tasks/bloom/math_qa.yaml
Task pubmed_qa loaded from config tasks/bloom/pubmed_qa.yaml
Task glue_mnli loaded from config tasks/bloom/glue_mnli.yaml
Task glue_wnli loaded from config tasks/bloom/glue_wnli.yaml
Task superglue_axg loaded from config tasks/bloom/superglue_axg.yaml
Task openbook_qa loaded from config tasks/bloom/openbook_qa.yaml
Task glue_cola loaded from config tasks/bloom/glue_cola.yaml
Task C3 loaded from config tasks/chinese/clue/c3.yaml
Task DRCD loaded from config tasks/chinese/clue/drcd.yaml
Task OCNLI_50K loaded from config tasks/chinese/clue/ocnli.yaml
Task CMNLI loaded from config tasks/chinese/clue/cmnli.yaml
Task CSL loaded from config tasks/chinese/clue/csl.yaml
Task CMRC2018 loaded from config tasks/chinese/clue/cmrc2018.yaml
Task CLUEWSC2020 loaded from config tasks/chinese/clue/cluewsc.yaml
Task AFQMC loaded from config tasks/chinese/clue/afqmc.yaml
Task EPRSTMT loaded from config tasks/chinese/fewclue/eprstmt.yaml
Task CLUEWSCF loaded from config tasks/chinese/fewclue/cluewscf.yaml
Task CSLF loaded from config tasks/chinese/fewclue/cslf.yaml
Task CHIDF loaded from config tasks/chinese/fewclue/chidf.yaml
Task BUSTM loaded from config tasks/chinese/fewclue/bustm.yaml
Task OCNLIF loaded from config tasks/chinese/fewclue/ocnlif.yaml
Task LAMBADA loaded from config tasks/lambada/lambada.yaml
Task LAMBADA-unidirectional loaded from config tasks/lambada/lambada-unidirectional.yaml
Task Pile loaded from config tasks/language-modeling/pile.yaml
Task Penn Treebank loaded from config tasks/language-modeling/ptb.yaml
Task WikiText-103 loaded from config tasks/language-modeling/wikitext-103.yaml
Task WikiText-2 loaded from config tasks/language-modeling/wikitext-2.yaml
Task MMLU loaded from config tasks/mmlu/mmlu.yaml
Task CROWS loaded from config tasks/ethnic/crows-pair/crows-pair.yaml
Task ETHOS_zeroshot loaded from config tasks/ethnic/ethos/ethos-zeroshot.yaml
Task ETHOS_oneshot loaded from config tasks/ethnic/ethos/ethos-oneshot.yaml
Task ETHOS_fewshot_single loaded from config tasks/ethnic/ethos/ethos-fewshot-single.yaml
Task ETHOS_fewshot_multi loaded from config tasks/ethnic/ethos/ethos-fewshot-multi.yaml
Task StereoSet loaded from config tasks/ethnic/stereoset/stereoset.yaml
> Successfully load 37 tasks
MultiChoiceTaskConfig(name='glue_qnli’)
Evaluation results of task glue_qnli:
Group validation Accuracy: max = 88.395, median = 86.235, average = 84.386
Finish task glue_qnli in 117054.9s.
MultiChoiceTaskConfig(name='superglue_axb’)
Evaluation results of task superglue_axb:
Group test Accuracy: max = 79.982, median = 79.167, average = 78.922
Finish task superglue_axb in 37182.9s.
GenerationTaskConfig(name='mc_taco’)
Evaluation results of task mc_taco:
Group validation:
Metric EM: max = 10.330, median = 10.330, average = 10.330
Metric F1: max = 16.724, median = 16.724, average = 16.724
Group test:
Metric EM: max = 11.273, median = 11.273, average = 11.273
Metric F1: max = 17.461, median = 17.461, average = 17.461
Finish task mc_taco in 6043.5s.
MultiChoiceTaskConfig(name='math_qa’)
Evaluation results of task math_qa:
Group validation Accuracy: max = 26.145, median = 23.866, average = 24.371
Group test Accuracy: max = 26.968, median = 24.958, average = 25.085
Finish task math_qa in 223462.6s.
MultiChoiceTaskConfig(name='pubmed_qa’)
Evaluation results of task pubmed_qa:
Group train Accuracy: max = 70.200, median = 70.200, average = 70.200
Finish task pubmed_qa in 14989.6s.
MultiChoiceTaskConfig(name='glue_mnli’)
Evaluation results of task glue_mnli:
Group validation-matched Accuracy: max = 86.765, median = 84.707, average = 84.942
Group validation-mismatched Accuracy: max = 87.429, median = 85.466, average = 85.761
Finish task glue_mnli in 1135279.5s.
MultiChoiceTaskConfig(name='glue_wnli’)
Evaluation results of task glue_wnli:
Group validation Accuracy: max = 69.014, median = 66.197, average = 64.789
Finish task glue_wnli in 1398.8s.
MultiChoiceTaskConfig(name='superglue_axg’)
Evaluation results of task superglue_axg:
Group test Accuracy: max = 88.202, median = 86.798, average = 87.022
Finish task superglue_axg in 8315.5s.
MultiChoiceTaskConfig(name='openbook_qa’)
CUDA out of memory. Tried to allocate 994.00 MiB (GPU 3; 23.69 GiB total capacity; 19.62 GiB already allocated; 841.12 MiB free; 21.37 GiB reserved in total by PyTorch)
MultiChoiceTaskConfig(name='glue_cola’)
Evaluation results of task glue_cola:
Group validation Accuracy: max = 64.334, median = 56.184, average = 56.031
Finish task glue_cola in 11242.4s.
MultiChoiceTaskConfig(name='C3’)
Evaluation results of task C3:
Group all Accuracy: max = 74.895, median = 74.895, average = 74.895
Finish task C3 in 30808.4s.
GenerationTaskConfig(name='DRCD’)
Evaluation results of task DRCD:
Group all:
Metric EM: max = 75.284, median = 40.323, average = 48.916
Metric F1: max = 75.359, median = 40.344, average = 48.954
Finish task DRCD in 247329.3s.
MultiChoiceTaskConfig(name='OCNLI_50K’)
Evaluation results of task OCNLI_50K:
Group all Accuracy: max = 73.767, median = 73.767, average = 73.767
Finish task OCNLI_50K in 8183.8s.
MultiChoiceTaskConfig(name='CMNLI’)
Evaluation results of task CMNLI:
Group all Accuracy: max = 75.189, median = 75.189, average = 75.189
Finish task CMNLI in 43612.5s.
MultiChoiceTaskConfig(name='CSL’)
Evaluation results of task CSL:
Group all Accuracy: max = 50.000, median = 50.000, average = 50.000
Finish task CSL in 26050.0s.
GenerationTaskConfig(name='CMRC2018’)
Evaluation results of task CMRC2018:
Group all:
Metric EM: max = 53.091, median = 28.611, average = 29.574
Metric F1: max = 53.885, median = 29.042, average = 30.057
Finish task CMRC2018 in 252918.1s.
MultiChoiceTaskConfig(name='CLUEWSC2020’)
Evaluation results of task CLUEWSC2020:
Group all Accuracy: max = 82.237, median = 69.243, average = 64.364
Finish task CLUEWSC2020 in 13329.4s.
MultiChoiceTaskConfig(name='AFQMC’)
Evaluation results of task AFQMC:
Group all Accuracy: max = 71.640, median = 69.856, average = 67.822
Finish task AFQMC in 57646.9s.
MultiChoiceTaskConfig(name='EPRSTMT’)
Evaluation results of task EPRSTMT:
Group dev Accuracy: max = 92.500, median = 92.500, average = 92.500
Group test Accuracy: max = 91.475, median = 91.475, average = 91.475
Finish task EPRSTMT in 3073.1s.
MultiChoiceTaskConfig(name='CLUEWSCF’)
Evaluation results of task CLUEWSCF:
Group dev Accuracy: max = 62.893, median = 56.604, average = 56.604
Group test Accuracy: max = 65.984, median = 58.094, average = 58.094
Finish task CLUEWSCF in 8895.2s.
MultiChoiceTaskConfig(name='CSLF’)
Evaluation results of task CSLF:
Group dev Accuracy: max = 49.375, median = 49.375, average = 49.375
Group test Accuracy: max = 50.000, median = 50.000, average = 50.000
Finish task CSLF in 28883.4s.
MultiChoiceTaskConfig(name='CHIDF’)
Evaluation results of task CHIDF:
Group dev Accuracy: max = 91.089, median = 91.089, average = 91.089
Group test Accuracy: max = 92.358, median = 92.358, average = 92.358
Finish task CHIDF in 10732.3s.
MultiChoiceTaskConfig(name='BUSTM’)
CUDA out of memory. Tried to allocate 1.01 GiB (GPU 0; 23.69 GiB total capacity; 19.77 GiB already allocated; 912.75 MiB free; 21.28 GiB reserved in total by PyTorch)
MultiChoiceTaskConfig(name='OCNLIF’)
Evaluation results of task OCNLIF:
Group dev Accuracy: max = 71.875, median = 71.875, average = 71.875
Group test Accuracy: max = 74.167, median = 74.167, average = 74.167
Finish task OCNLIF in 7344.2s.
GenerationTaskConfig(name='LAMBADA’)
CUDA out of memory. Tried to allocate 1.65 GiB (GPU 1; 23.69 GiB total capacity; 18.99 GiB already allocated; 1.22 GiB free; 20.97 GiB reserved in total by PyTorch)
GenerationTaskConfig(name='LAMBADA-unidirectional’)
CUDA out of memory. Tried to allocate 1.67 GiB (GPU 1; 23.69 GiB total capacity; 19.02 GiB already allocated; 1.56 GiB free; 20.70 GiB reserved in total by PyTorch)
LanguageModelTaskConfig(name='Pile’)
IndexError: list index out of range
LanguageModelTaskConfig(name='Penn Treebank’)
Evaluation results of task Penn Treebank:
Group test:
Finish task Penn Treebank in 0.0s.
LanguageModelTaskConfig(name='WikiText-103’)
Evaluation results of task WikiText-103:
Group test:
Finish task WikiText-103 in 0.0s.
LanguageModelTaskConfig(name='WikiText-2’)
Evaluation results of task WikiText-2:
Group test:
Finish task WikiText-2 in 0.0s.
MultiChoiceTaskConfig(name='MMLU’)
Evaluation results of task MMLU:
Group stem Accuracy: max = 64.000, median = 38.298, average = 38.408
Group social_sciences Accuracy: max = 60.104, median = 52.728, average = 48.456
Group humanities Accuracy: max = 64.135, median = 49.691, average = 41.573
Group other Accuracy: max = 64.957, median = 47.085, average = 50.756
Group Overall Accuracy: max = 64.957, median = 46.207, average = 44.403
Finish task MMLU in 393690.1s.
MultiChoiceTaskConfig(name='CROWS’)
Evaluating task CROWS:
Evaluating group test:
Evaluation results of task CROWS:
Finish task CROWS in 0.0s.
MultiChoiceTaskConfig(name='ETHOS_zeroshot’)
Evaluation results of task ETHOS_zeroshot:
Group test:
Finish task ETHOS_zeroshot in 0.0s.
MultiChoiceTaskConfig(name='ETHOS_oneshot’)
Evaluation results of task ETHOS_oneshot:
Group test:
Finish task ETHOS_oneshot in 0.0s.
MultiChoiceTaskConfig(name='ETHOS_fewshot_single’)
Evaluation results of task ETHOS_fewshot_single:
Group test:
Finish task ETHOS_fewshot_single in 0.0s.
MultiChoiceTaskConfig(name='ETHOS_fewshot_multi’)
Evaluation results of task ETHOS_fewshot_multi:
Group test:
Finish task ETHOS_fewshot_multi in 0.0s.
MultiChoiceTaskConfig(name='StereoSet’)
Evaluation results of task StereoSet:
Finish task StereoSet in 0.0s.
Finish 10 tasks in 393690.3s
The text was updated successfully, but these errors were encountered:
I have completed the initial evaluation of the GLB-130B model on a 4x RTX 3090 GPU machine (64GB of RAM and 8TB SSD drive).
The model underwent INT4 quantization, which helped reduce the GPU memory requirement from 240 GB to 63 GB.
The time taken in the evaluation table is per task, and for convenience, I reported on each line for a different metric. I also tried to find the relevant benchmark for each task and added a link to the Papers With Code website.
For now, this is only the first evaluation run, but I plan to spend more time closely examining the tasks and exploring the use of FastTransformers to improve inference time.
I'm interested in seeing if I could use GLM-130B in combination with RLHF in the future, so I'll explore that direction too.
Next to the table, I also share the partial raw log file.
The text was updated successfully, but these errors were encountered: