DNASequence/README.md
2024-09-21 23:28:41 +08:00

7.4 KiB
Raw Blame History

DNASequence

提问者问题原文链接

关于如何构建本项目

请确保安装了构建工具xmake和任意C++构建工具并将路径添加到了PATH目录

如何安装xmake?链接

#编译 -v 表示 verbose 输出详细编译信息具体说明去上面的xmake链接看看
xmake b -v dna_pybind
# 运行
xmake r

# 生成 visual studio 文件夹点进去打开sln文件即可使用visual studio编辑和调试很方便
xmake project -k vsxmake

代码逻辑介绍

DNASequence处理

DNA是双链的互为互补链对DNA样本进行测序时不能确认测出的是哪条链所以就把所有DNA片段的互补链全算出来和原文件放在一起组装。

输入格式只是演示样例不保证其生物上的准确性默认最大dna序列长度支持5e4可自行修改代码扩容

程序将会从项目的根目录中打开filteredReads.txt并处理类似以下若干条dna序列

@SRR13280199.1 1 length=32
ACGTACACATTGCTGTCTGCTGAACCACCTAG
@SRR13280199.1 2 length=32
ACGTACACATTGCTGTCTGCTGAACCACCTAG

pybind支持

在编译完文件后会得到dna.pyd文件和python提示文件dna.pyi用法如下

import dna

help(dna)

dna.dna_reverse("filteredReads.txt","reversedSequence.txt")
Help on module dna:

NAME
    dna - DNASequence processing functions

FUNCTIONS
    dna_reverse(...) method of builtins.PyCapsule instance
        dna_reverse(input_file_path: str, output_file_path: str) -> None

        DNA is double-stranded and complementary to each other, and when sequencing a DNA sample you can't be sure which strand is being measured, so the complementary strands of all the DNA fragments are counted and assembled together with the original file.

FILE
    e:\file\dev\cpp\dnasequence\build\windows\x64\release\dna.pyd


Open input file stream to value [input_file_stream] ok , from ["filteredReads.txt"]
Open output file stream to value [output_file_stream] ok , from ["reversedSequence.txt"]
Chunk size :4294967296 bytes
[Timer: All spent] Start timing
[Timer: chunk_id:[1]] Start timing
[Timer: read_chunk_id:[1]] Start timing
[Timer: read_chunk_id:[1]] Stop timing , used 1253ms
buf_len : 897963094
[Timer: calculate_chunk_id:[1]] Start timing
[Timer: calculate_chunk_id:[1]] Stop timing , used 204ms
[Timer: write_chunk_id:[1] , [Wrote bytes] start_pos : 897963094] Start timing
[Timer: write_chunk_id:[1] , [Wrote bytes] start_pos : 897963094] Stop timing , used 1727ms
[Timer: chunk_id:[1]] Stop timing , used 3185ms
[Timer: All spent] Stop timing , used 3186ms

注意!

输入的时候麻烦最后一行的换行别删

项目前置

如果你想在没有安装Visual Studio的电脑上使用编译后的代码请前往下面的网址安装VCRuntime

需要的windows dll在dll文件夹中可以试试

https://learn.microsoft.com/zh-cn/cpp/windows/latest-supported-vc-redist?view=msvc-170#visual-studio-2015-2017-2019-and-2022

本项目使用了OpenMP进行并行化加速默认开启OpenMP在C++编译器中默认都装了的,请使用较新的编译器 在win平台上似乎无法使用mingw对OpenMP加速但是Visual StudioMSVC 和 Clang 在win平台上都是可以编译的 不要使用Mingw编译win上可以使用Clang,VS(MSVC)linux上使用gcc(g++)即可

块内存大小默认4G如有更多请更改

//原理是这里定义处理函数将函数传入dna::open_file_and_calculate在open_file_and_calculate中会调用传入的函数
//参数列表 <文件分块内存大小单个DNA序列最长大小>("输入文件名","输出文件名",序列处理函数);
//这个函数在src/tools/dna里面
dna::open_file_and_calculate<(size_t)4 * 1024 * 1024 *1024 , (size_t)5e4+5>("filteredReads.txt", "reversedSequence.txt",reverseComplement);

请详细阅读xmake.lua项目配置文件可能涉及到性能优化和计算精度的问题

最好不要使用mingw使用mingw+clang(就是clang)或者msvc(visual studio)

mingw的IO优化不行

性能展示

什么环境下性能最好?

经过测试在windows环境下Visual Studio编译性能最好

perf

Samples: 6K of event 'task-clock:ppp', Event count (approx.): 1541250000
Overhead  Command  Shared Object         Symbol
  73.72%  test     test                  [.] reverseComplement(char*, char*)               ◆
   5.47%  test     [unknown]             [k] 0xffffffffa842aee0                            ▒
   3.02%  test     [unknown]             [k] 0xffffffffc06abd30                            ▒
   2.12%  test     [unknown]             [k] 0xffffffffa7a5ba37                            ▒
   1.98%  test     [unknown]             [k] 0xffffffffa84435e1                            ▒
   1.15%  test     [unknown]             [k] 0xffffffffa8443ee5                            ▒
   0.99%  test     [unknown]             [k] 0xffffffffa760ecee                            ▒
   0.92%  test     [unknown]             [k] 0xffffffffa83ab787                            ▒
   0.86%  test     [unknown]             [k] 0xffffffffa72d138b                            ▒
   0.76%  test     [unknown]             [k] 0xffffffffa7309ed4                            ▒
   0.57%  test     libc.so.6             [.] __memchr_evex                                 ▒
   0.44%  test     [unknown]             [k] 0xffffffffc0671725                            ▒
   0.26%  test     libc.so.6             [.] __memset_evex_unaligned_erms                  ▒
   0.24%  test     [unknown]             [k] 0xffffffffa76c08f8                            ▒
   0.23%  test     [unknown]             [k] 0xffffffffa7a5ad88                            ▒
   0.21%  test     libgomp.so.1.0.0      [.] 0x0000000000024ab2                            ▒
   0.21%  test     [unknown]             [k] 0xffffffffa8443d57                            ▒
   0.18%  test     [unknown]             [k] 0xffffffffa739e005                            ▒
   0.13%  test     [unknown]             [k] 0xffffffffa75b7cdb                            ▒
   0.13%  test     [unknown]             [k] 0xffffffffc067013b                            ▒
   0.11%  test     [unknown]             [k] 0xffffffffa76168b4                            ▒

800MB fastq DNA 序列处理性能展示

Open input file stream to value [input_file_stream] ok , from ["filteredReads.txt"]
Open output file stream to value [output_file_stream] ok , from ["reversedSequence.txt"]
Chunk size :4294967296 bytes
[Timer: All spent] Start timing
[Timer: chunk_id:[1]] Start timing
[Timer: read_chunk_id:[1]] Start timing
[Timer: read_chunk_id:[1]] Stop timing , used 1031ms
buf_len : 897963094
[Timer: calculate_chunk_id:[1]] Start timing
omp_get_num_threads() : 12
[Timer: calculate_chunk_id:[1]] Stop timing , used 277ms
[Timer: write_chunk_id:[1] , [Wrote bytes] start_pos : 897963094] Start timing
[Timer: write_chunk_id:[1] , [Wrote bytes] start_pos : 897963094] Stop timing , used 1169ms
[Timer: chunk_id:[1]] Stop timing , used 2479ms
[Timer: All spent] Stop timing , used 2479ms

关于版权

本项目算法版权归提问者所有,可不遵循开源协议,其它人使用请遵循开源协议,或者欢迎咨询我