The algorithm

Input file

The input is bam file with mapped reads sorted by the start mapping point.

Step 1 - grouping read according start mapping position and finding their average position

The algorithm go over the reads and check if the next read is “close” (parameter) to the average of the start mapping points of the previous reads in the “last positions” (parameter).

  • If the read is “close”, it is attached to the group of the previous reads and the algorithm recalculate the average in “last positions” with the mapping position the current read and continues to check the next read.
  • Else, the read don’t belong to the group of the previous reads, and the group is closed and is sent to the next step.

Parameters:

–local-average-num-positions (“last positions”): number of positions before the position of the current read for them we calculate the average of the mapped reads.

–max-dist-external-edge-from-average (“close”): the maximum distance in bases of the current read from the average of the previous reads (in “last positions”), for which the read called “close” to the group of the previous reads.

Step 2 - finding average of each start and end edges of the exons in the group from step 1

For each group from step 1:

Go over the start and end of the exons of the reads in the group from step 1, and for each start of an exon i (called “internal edge i”) do:

  1. Sort ascendingly the reads according the mapping point of “internal edge i”.
  2. Go over the “internal edge i” of the reads and check if it is “close” (parameter) to the average of the “internal edges i” of the previous reads in the “last positions” (parameter).
  • If “internal edge i” is “close”, it is attached to the sub-group of “internal edges i” and the algorithm recalculate the average of “internal edges i” in “last positions” with the mapping position the current exon and continues to check the “internal edge i” of the next read

  • Else, the “internal edge i” of the read don’t belong to the sub-group of the “internal edge i” and became to be “internal edge i+1” (and thus the enumerate of the all next start points of the exons of this read are uploaded by 1). In the next iteration of the loop the current “internal edge i” will be recalculate with the i+1 sub-group.

    Parameters:

    –local-average-num-positions (“last positions”): - the same parameter like in the step 1. number of positions before the position of the current read for them we calculate the average of the mapped reads.

    –max-dist-internal-edge-from-average (“close”): the maximum distance in bases of the current read from the average of the previous reads (in “last positions”), for which the read called “close” to the group of the previous reads.

  1. Return on step 2 also for the end points of the exons.
  • In the end edge of the last exon we defined “close” with the parameter from step 1: –max-dist-external-edge-from-average

Stranded protocol

  • In non-stranded protocol we cannot know which of the external edges of the read is the start of the transcript and which is the end. Therefore we use only in 2 parameters for declaring edges “external” - for the two edges of the read, and “internal” for the internal edges of the exons:

    –max-dist-internal-edge-from-average and

    –max-dist-external-edge-from-average.

  • In stranded protocol we know what is the start and the end sides of the transcript, so we use with other 2 parameters instead the parameter –max-dist-internal-edge-from-average:

    –max-dist-first-edge-from-average and

    –max-dist-end-edge-from-average.

Generally, the end edge is more noisy, so the user can enable more flexibility.

In stranded protocol the algorithm get separate bam file for each strand and run on them separately. The package contains a script for the separating the bam to the differenct strands.

Step 3 - build transcripts

For each group from step 2, go over on each read in the group, and combine the reads with the same average of edges - internal and external. Each such group called transcript.

Output

Gtf file - contains the transcripts and its exons. Each transcript has the counts of reads in this transcript.

Step 4 (optional) - intersection with known gtf

If the user supply input of gtf with known genes, the output will contain also the intersection of each transcript and exon with the nearest known gene and the percent of the overlapping.