一次诡异的 python 错误排查

输入表格示例:

library reads_num reads_len bases_num GC avgQ Q20 Q30
A-Ctrl 46,875,170 150.0 7.00 Gb 49.90% 36.6 93.24% 85.97%
B-Ctrl 154,351,882 150.0 23.00 Gb 45.40% 36.9 93.80% 86.93%
C-Ctrl 58,126,912 150.0 8.00 Gb 60.05% 36.05 92.19% 83.96%
D-Ctrl 143,761,494 150.0 21.00 Gb 42.00% 37.25 94.46% 88.15%

初次尝试

1
2
3
4
5
6
7
8
# 错误的结果
table = open('qc.stat', 'r')
header = [[],] * len(table.readline().strip().split('\t'))
for i in table:
for x,y in enumerate(i.strip().split('\t')):
header[x].append(y)
header
# table.close()

结果输出

[['A',
  '46,875,170',
  '150.0',
  '7.00 Gb',
  '49.90%',
  '36.6',
  '93.24%',
  '85.97%',
  'B',
  '154,351,882',
  '150.0',
  '23.00 Gb',
  '45.40%',
  '36.9',
  '93.80%',
  '86.93%',
  'C',
  '58,126,912',
  '150.0',
  '8.00 Gb',
  '60.05%',
  '36.05',
  '92.19%',
  '83.96%',
  'D',
  '143,761,494',
  '150.0',
  '21.00 Gb',
  '42.00%',
  '37.25',
  '94.46%',
  '88.15%'],
 ['A',
  '46,875,170',
  '150.0',
  '7.00 Gb',
  '49.90%',
  '36.6',
  '93.24%',
  '85.97%',
  'B',
  '154,351,882',
  '150.0',
  '23.00 Gb',
  '45.40%',
  '36.9',
  '93.80%',
  '86.93%',
  'C',
  '58,126,912',
  '150.0',
  '8.00 Gb',
  '60.05%',
  '36.05',
  '92.19%',
  '83.96%',
  'D',
  '143,761,494',
  '150.0',
  '21.00 Gb',
  '42.00%',
  '37.25',
  '94.46%',
  '88.15%'],
 ...]

正确解析

1
2
3
4
5
6
7
8
# 正确代码
table = open('qc.stat', 'r')
header = [[i,] for i in table.readline().strip().split('\t')]
for i in table:
for x,y in enumerate(i.strip().split('\t')):
header[x].append(y)
header
# table.close()

结果输出

[['library', 'A', 'B', 'C', 'D'],
 ['reads_num', '46,875,170', '154,351,882', '58,126,912', '143,761,494'],
 ['reads_len', '150.0', '150.0', '150.0', '150.0'],
 ['bases_num', '7.00 Gb', '23.00 Gb', '8.00 Gb', '21.00 Gb'],
 ['GC', '49.90%', '45.40%', '60.05%', '42.00%'],
 ['avgQ', '36.6', '36.9', '36.05', '37.25'],
 ['Q20', '93.24%', '93.80%', '92.19%', '94.46%'],
 ['Q30', '85.97%', '86.93%', '83.96%', '88.15%']]

原因分析

1
2
3
4
5
6
7
8
9
10
11
12
13
# 错误原因查找

head = '#library\treads_num\treads_len\tbases_num\tGC\tavgQ\tQ20\tQ30'

header = [[], ] * len(head.strip().split('\t'))
print 'error method...'
for i in header:
print id(i)

header = [[i,] for i in head.strip().split('\t')]
print '\nright method...'
for i in header:
print id(i)
error method...
4435237272
4435237272
4435237272
4435237272
4435237272
4435237272
4435237272
4435237272

right method...
4435236984
4435354904
4435318688
4435237416
4435318040
4435319768
4435318472
4435319624

python独特的变量命名方式:变量名(>= 1个)指向储存数据物理地址,使用其中任何一个名称都可以对数据进行操作

---------本文结束,感谢您的阅读---------