using Google Colab's GPU #914

Ryosuke-254 · 2024-12-03T15:43:25Z

I want to run the following code using Google Colab's GPU, but while the GPU is briefly utilized, it is mostly not being used, which is causing problems. Could you provide any suggestions for improvement?

必要なツールをインストール

!apt-get update -qq
!apt-get install -y -qq wget tar cmake build-essential

MMseqs2 (GPU版) をダウンロードして展開

!wget https://mmseqs.com/latest/mmseqs-linux-gpu.tar.gz -O mmseqs-linux-gpu.tar.gz
!tar xvzf mmseqs-linux-gpu.tar.gz
!mv mmseqs/bin/mmseqs /usr/local/bin/

CUDAツールをインストール

!apt-get install -y -qq nvidia-cuda-toolkit
!nvcc --version # CUDAがインストールされているか確認

PyCUDAとその他のPythonライブラリをインストール

!pip install -q pycuda biopython pandas

Google ColabでのGPU利用状況を確認

!nvidia-smi

MMseqs2ワークディレクトリを作成

import os
work_dir = "./mmseqs_work"
os.makedirs(work_dir, exist_ok=True)

入力FASTAファイルを指定

input_fasta = "/content/Book2test.fasta" # 必要に応じてファイルパスを変更してください

MMseqs2データベースの作成（1回のみ）

!mmseqs createdb {input_fasta} {work_dir}/db

データベースをGPU対応フォーマットに変換（makepaddedseqdbを使用）

!mmseqs makepaddedseqdb {work_dir}/db {work_dir}/db_gpu

自身に対してペアワイズ検索（GPUを使用）

search_result_path = os.path.join(work_dir, "search_result")
tmp_dir = os.path.join(work_dir, "tmp")
os.makedirs(tmp_dir, exist_ok=True)

!mmseqs search {work_dir}/db {work_dir}/db_gpu {search_result_path} {tmp_dir}
--min-seq-id 0.8 --threads 4 --search-type 3 --gpu 1 || echo "Search failed!"

.m8ファイルが生成されているか確認

!ls {search_result_path}.m8 || echo "No .m8 file found!"

出力結果を解析

import pandas as pd
from Bio import SeqIO

MMseqs2出力ファイルを指定

search_result_m8 = f"{search_result_path}.m8" # MMseqs2出力ファイルのパス

MMseqs2出力形式を読み込む

columns = ["query", "target", "pident", "alnlen", "mismatch", "gapopen", "qstart", "qend", "tstart", "tend", "evalue", "bits"]

try:
results = pd.read_csv(search_result_m8, sep="\t", names=columns)

# 配列同一性が80%未満のクエリ配列を抽出
filtered_results = results[results["pident"] < 80]
unique_query_ids = set(filtered_results["query"])

# 元のFASTAから該当する配列を抽出
filtered_sequences = {rec.id: rec for rec in SeqIO.parse(input_fasta, "fasta") if rec.id in unique_query_ids}
output_fasta = "/content/filtered_sequences.fasta"

# 抽出した配列をFASTA形式で保存
with open(output_fasta, "w") as f:
    SeqIO.write(filtered_sequences.values(), f, "fasta")

print(f"フィルタされた配列を保存しました: {output_fasta}")

except FileNotFoundError:
print(f"Error: MMseqs2 output file not found at {search_result_m8}")
except Exception as e:
print(f"Unexpected error: {e}")

low deletions false
Filter MSA 1
Use filter only at N seqs 0
Maximum seq. id. threshold 0.9
Minimum seq. id. 0.0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Pseudo count mode 0
Profile output mode 0
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 3
Search iterations 1
Start sensitivity 4
Search steps 1
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files false
Translation mode 0

ungappedprefilter ./mmseqs_work/db ./mmseqs_work/db_gpu ./mmseqs_work/tmp/14843528504956813129/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -c 0 -e 0.001 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --min-ungapped-score 15 --max-seqs 300 --db-load-mode 0 --gpu 1 --gpu-server 0 --prefilter-mode 1 --threads 4 --compressed 0 -v 3

[=================================================================] 100.00% 25.33K 3m 2s 739ms
Time for merging to pref_0: 0h 0m 0s 4ms
Time for processing: 0h 3m 2s 790ms
align ./mmseqs_work/db ./mmseqs_work/db_gpu ./mmseqs_work/tmp/14843528504956813129/pref_0 ./mmseqs_work/search_result --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.8 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 4 --compressed 0 -v 3

Compute score and coverage
Query database size: 25329 type: Aminoacid
Target database size: 25329 type: Aminoacid
Calculation of alignments
^C
ls: cannot access './mmseqs_work/search_result.m8': No such file or directory
No .m8 file found!
Error: MMseqs2 output file not found at ./mmseqs_work/search_result.m8

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using Google Colab's GPU #914

using Google Colab's GPU #914

Ryosuke-254 commented Dec 3, 2024

using Google Colab's GPU #914

using Google Colab's GPU #914

Comments

Ryosuke-254 commented Dec 3, 2024

必要なツールをインストール

MMseqs2 (GPU版) をダウンロードして展開

CUDAツールをインストール

PyCUDAとその他のPythonライブラリをインストール

Google ColabでのGPU利用状況を確認

MMseqs2ワークディレクトリを作成

入力FASTAファイルを指定

MMseqs2データベースの作成（1回のみ）

データベースをGPU対応フォーマットに変換（makepaddedseqdbを使用）

自身に対してペアワイズ検索（GPUを使用）

.m8ファイルが生成されているか確認

出力結果を解析

MMseqs2出力ファイルを指定

MMseqs2出力形式を読み込む