DNASequence/README.md
2024-09-21 23:53:49 +08:00

170 lines
7.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# DNASequence
[提问者问题原文链接](https://www.zhihu.com/question/36143261/answer/3624848144)
## 关于如何构建本项目
> 请确保安装了构建工具xmake和任意C++构建工具并将路径添加到了PATH目录
[如何安装xmake?链接](https://gitee.com/tboox/xmake#%E5%AE%89%E8%A3%85)
```bash
#编译 -v 表示 verbose 输出详细编译信息具体说明去上面的xmake链接看看
xmake b -v
# 运行
xmake r
# 生成 visual studio 文件夹点进去打开sln文件即可使用visual studio编辑和调试很方便
xmake project -k vsxmake
```
## 代码逻辑介绍
DNASequence处理
DNA是双链的互为互补链对DNA样本进行测序时不能确认测出的是哪条链所以就把所有DNA片段的互补链全算出来和原文件放在一起组装。
> 输入格式只是演示样例不保证其生物上的准确性默认最大dna序列长度支持5e4可自行修改代码扩容
>
> 程序将会从项目的根目录中打开filteredReads.txt并处理类似以下若干条dna序列
```
@SRR13280199.1 1 length=32
ACGTACACATTGCTGTCTGCTGAACCACCTAG
@SRR13280199.1 2 length=32
ACGTACACATTGCTGTCTGCTGAACCACCTAG
```
## pybind支持
> 在编译完文件后会得到dna.pyd文件和python提示文件dna.pyi用法如下
```python
import dna
help(dna)
dna.dna_reverse("filteredReads.txt","reversedSequence.txt")
```
```
Help on module dna:
NAME
dna - DNASequence processing functions
FUNCTIONS
dna_reverse(...) method of builtins.PyCapsule instance
dna_reverse(input_file_path: str, output_file_path: str) -> None
DNA is double-stranded and complementary to each other, and when sequencing a DNA sample you can't be sure which strand is being measured, so the complementary strands of all the DNA fragments are counted and assembled together with the original file.
FILE
e:\file\dev\cpp\dnasequence\build\windows\x64\release\dna.pyd
Open input file stream to value [input_file_stream] ok , from ["filteredReads.txt"]
Open output file stream to value [output_file_stream] ok , from ["reversedSequence.txt"]
Chunk size :4294967296 bytes
[Timer: All spent] Start timing
[Timer: chunk_id:[1]] Start timing
[Timer: read_chunk_id:[1]] Start timing
[Timer: read_chunk_id:[1]] Stop timing , used 1253ms
buf_len : 897963094
[Timer: calculate_chunk_id:[1]] Start timing
[Timer: calculate_chunk_id:[1]] Stop timing , used 204ms
[Timer: write_chunk_id:[1] , [Wrote bytes] start_pos : 897963094] Start timing
[Timer: write_chunk_id:[1] , [Wrote bytes] start_pos : 897963094] Stop timing , used 1727ms
[Timer: chunk_id:[1]] Stop timing , used 3185ms
[Timer: All spent] Stop timing , used 3186ms
```
# 注意!
> 输入的时候麻烦最后一行的换行别删
## 项目前置
> 如果你想在没有安装Visual Studio的电脑上使用编译后的代码请前往下面的网址安装VCRuntime
>
> 需要的windows dll在dll文件夹中可以试试
[https://learn.microsoft.com/zh-cn/cpp/windows/latest-supported-vc-redist?view=msvc-170#visual-studio-2015-2017-2019-and-2022](https://learn.microsoft.com/zh-cn/cpp/windows/latest-supported-vc-redist?view=msvc-170#visual-studio-2015-2017-2019-and-2022)
> 本项目使用了OpenMP进行并行化加速默认开启OpenMP在C++编译器中默认都装了的,请使用较新的编译器
> 在win平台上似乎无法使用mingw对OpenMP加速但是Visual StudioMSVC 和 Clang 在win平台上都是可以编译的
> 不要使用Mingw编译win上可以使用Clang,VS(MSVC)linux上使用gcc(g++)即可
## 块内存大小默认4G如有更多请更改
```cpp
//原理是这里定义处理函数将函数传入dna::open_file_and_calculate在open_file_and_calculate中会调用传入的函数
//参数列表 <文件分块内存大小单个DNA序列最长大小>("输入文件名","输出文件名",序列处理函数);
//这个函数在src/tools/dna里面
dna::open_file_and_calculate<(size_t)4 * 1024 * 1024 *1024 , (size_t)5e4+5>("filteredReads.txt", "reversedSequence.txt",reverseComplement);
```
## ***请详细阅读xmake.lua项目配置文件可能涉及到性能优化和计算精度的问题***
> 最好不要使用mingw使用mingw+clang(就是clang)或者msvc(visual studio)
>
> mingw的IO优化不行
## 性能展示
> 什么环境下性能最好?
> 经过测试在windows环境下Visual Studio编译性能最好
> perf
```
Samples: 6K of event 'task-clock:ppp', Event count (approx.): 1541250000
Overhead Command Shared Object Symbol
73.72% test test [.] reverseComplement(char*, char*) ◆
5.47% test [unknown] [k] 0xffffffffa842aee0 ▒
3.02% test [unknown] [k] 0xffffffffc06abd30 ▒
2.12% test [unknown] [k] 0xffffffffa7a5ba37 ▒
1.98% test [unknown] [k] 0xffffffffa84435e1 ▒
1.15% test [unknown] [k] 0xffffffffa8443ee5 ▒
0.99% test [unknown] [k] 0xffffffffa760ecee ▒
0.92% test [unknown] [k] 0xffffffffa83ab787 ▒
0.86% test [unknown] [k] 0xffffffffa72d138b ▒
0.76% test [unknown] [k] 0xffffffffa7309ed4 ▒
0.57% test libc.so.6 [.] __memchr_evex ▒
0.44% test [unknown] [k] 0xffffffffc0671725 ▒
0.26% test libc.so.6 [.] __memset_evex_unaligned_erms ▒
0.24% test [unknown] [k] 0xffffffffa76c08f8 ▒
0.23% test [unknown] [k] 0xffffffffa7a5ad88 ▒
0.21% test libgomp.so.1.0.0 [.] 0x0000000000024ab2 ▒
0.21% test [unknown] [k] 0xffffffffa8443d57 ▒
0.18% test [unknown] [k] 0xffffffffa739e005 ▒
0.13% test [unknown] [k] 0xffffffffa75b7cdb ▒
0.13% test [unknown] [k] 0xffffffffc067013b ▒
0.11% test [unknown] [k] 0xffffffffa76168b4 ▒
```
> 800MB fastq DNA 序列处理性能展示
```
Open input file stream to value [input_file_stream] ok , from ["filteredReads.txt"]
Open output file stream to value [output_file_stream] ok , from ["reversedSequence.txt"]
Chunk size :4294967296 bytes
[Timer: All spent] Start timing
[Timer: chunk_id:[1]] Start timing
[Timer: read_chunk_id:[1]] Start timing
[Timer: read_chunk_id:[1]] Stop timing , used 1031ms
buf_len : 897963094
[Timer: calculate_chunk_id:[1]] Start timing
omp_get_num_threads() : 12
[Timer: calculate_chunk_id:[1]] Stop timing , used 277ms
[Timer: write_chunk_id:[1] , [Wrote bytes] start_pos : 897963094] Start timing
[Timer: write_chunk_id:[1] , [Wrote bytes] start_pos : 897963094] Stop timing , used 1169ms
[Timer: chunk_id:[1]] Stop timing , used 2479ms
[Timer: All spent] Stop timing , used 2479ms
```
## 关于版权
本项目算法版权归提问者所有,可不遵循开源协议,其它人使用请遵循开源协议,或者欢迎咨询我