News


Jan. 1, 2017

Whole-genome sequencing product

Note: SNP and INDEL variants are given with amino acid, intron/exon, protein, etc. information using VEP and SNPeff. Additional data is provided in BAM format, and/or FASTQ.

As a whole-genome sequencing (WGS) product, this represents the most comphrehensive genetic analysis available today and greatly expands the scope of our product line. We have recently completed a pilot study and the results are impressive, leading us to now officially launch this as a standard product; this will supplement our existing Y Elite and Y Prime tests, which target the Y chromosome and are only applicable to males. By using Illumina's groundbreaking HiSeq X next-generation sequencing platform, we are able to provide high-quality WGS results at a best-in-market sub-$2000 price for individual customers.

Details of results delivery are still being developed and refined, but we are currently set up to provide the following:

  • BAM file (roughly 50 GB)
  • variant summary reports from SnpEff and VEP
  • autosomal and chrX variant identification (as two VCF files):
    1. novel variants (annotated with SnpEff and VEP)
    2. results for a set of over 100 million known SNPs from dbSNP build 142
  • mtDNA sequence (as FASTA file)
  • Y-DNA analysis (for males)

In terms of the technical details of the underlying raw data, the sequencing produces:

  • 2 x 150 bp paired-end reads
  • approximately 30x average depth of coverage

Although Full Genomes is not providing any interpretation of autosomal and X chromosome results, the raw data are provided in a format compatible with a number of tools that provide the opportunity for in-depth analysis. Examples of currently-available online analysis tools include:

As with other Full Genomes products, this is intended for ancestry/research-use only, and should not be relied upon for medical or diagnostic purposes.

Interested customers can order the WGS product through the Full Genomes website here.

Happy SNP'ing!

The FGC Team


Oct. 24, 2014

October 2014 analysis update release notes

Full Genomes Corporation (FGC) has begun releasing new mitochondrial DNA (mtDNA) analyses to customers. The new analysis for each kit is distributed as a FASTA file. FASTA is widely-recognized sequence representation format that is compatible with many mtDNA databases and analysis tools.

The FGC team would like to thank Ian logan and Dr. Ann Turner for very helpful feedback during the development of these new mtDNA files.

The team also thanks George Jones for the suggestion of distributing these files, as well as the FGC customers who volunteered their mtDNA data for testing and refinement of the FASTA generation process.


Who will receive these new mtDNA analysis results?

These mitochondrial sequence FASTA files will be available to all FGC customers, including those with results from FGC's Y Elite and Y Prime tests, whole-genome sequencing, and Big Y analysis. The FGC team is currently targeting distribution to all customers with available sequencing results over the course of the next week.


How does this new FASTA file differ from the mtDNA analysis reports I've already received?

Firstly, the new results are in a format that is widely-used and compatible with many current mtDNA databases and analysis tools.

Secondly, the results are based on a newly-developed bioinformatics pipeline, designed specifically for performing this mitochondrial sequence analysis. The approach uses more advanced techniques, designed to improve mutation detection and reduce false positives. As a result, the mutations that are indicated in the results may differ slightly from the earlier analysis.

Finally, FGC customers who ordered analysis of Big Y .bam files will be able to more easily interpret their mtDNA results. The previous analysis of Big Y files reported mtDNA results using the Yoruba reference sequence, rather than the more widely-used rCRS reference sequence and the more recent RSRS reference sequence. The FASTA sequence representation is not tied to a particular reference sequence, and many analysis tools will readily analyze the FASTA sequence in the context of these more commonly-used rCRS and RSRS reference sequences.


What can I do with the FASTA file?

There are several opportunities to use the mtDNA results in the FASTA file, which can allow you to get a better understanding of your mtDNA, determine your mtDNA haplogroup, and possibly even contribute to ongoing research of mtDNA and the human mtDNA tree.

  • Opening the file: The first thing you might want to try is to simply open the file in a text editor (Windows users can typically double-click on the file to view it in Notepad or WordPad). The first line provides several statistics to describe the sequence, including its completeness and quality. The actual mtDNA sequence appears on the next line. This sequence may be copied from within the text editor and pasted to other applications (web tools, e-mails, etc.).
  • Determining your mtDNA haplogroup: One particularly useful mtDNA analysis tool available on the web is James Lick's mthap tool, which provides haplogroup classification based on mtDNA data supplied in a number of formats, including FASTA. To use the tool with your FASTA file:

    1. Go to the mthap page.
    2. Click the Choose File or Browse button at the top left and select the location of the FASTA file on your computer.
    3. Finally, click Upload and wait a minute or so for the report to appear, including identification of mutations and haplogroup classification.

    This tool is regularly updated as the mtDNA tree is refined, so you may want to check back periodically for updates. You can also visit PhyloTree to see how your haplogroup fits within the human mtDNA phylogenetic tree.

  • GenBank submission: If you are a whole-genome sequencing customer or one of the lucky few who happened to obtain a near-complete mtDNA sequence from a Y sequencing test (with at least 16545 bp covered, as indicated by the second number at the top of the file) then you may wish to consider submitting your mtDNA sequence to the GenBank database, used by researchers around the world. An mtDNA expert, Ian Logan (ianlogan22@btinternet.com) is graciously offering his time and expertise to take a look at your results, determine whether they are suitable for submission to GenBank, and help with the submission process; he has already helped a significant number of individuals submit their mtDNA sequences to GenBank.

  • Custom mtDNA reports: Dr. Ann Turner is able to provide customized mtDNA reports based on FASTA files; further details may be found on ISOGG's page here.

What is a FASTA file?

FASTA is a format that is commonly used to represent DNA sequences. It consists of a comment line (starting with ">"), followed by one or more lines with the actual DNA sequence of interest. (Further details and examples are at Wikipedia's "FASTA format" article.)

In this case, the sequence starts with the origin (position 1) of the "+" strand of the circular mitochondrial genome and continues (in the standard 5'-3' direction) to higher position numbers.


I ordered a Y chromosome sequencing test, so why do I have mitochondrial sequence results?

Although FGC's Y chromosome tests are designed to sequence the Y chromosome, the "targeting" is not perfect. The mitochondrial results that are obtained from these Y chromosome tests are considered "off-target" coverage. On the other hand, mtDNA is also relatively abundant, as there are typically a number of copies per cell. Ultimately, the mtDNA results are a fortuitous side effect for anyone interested in mtDNA for genetic genealogy, anthropology, or other applications. However, the quality and completeness of the mtDNA results can vary significantly from kit to kit, and much of the mtDNA sequence (20% or more) will be undetermined in many cases.

On the other hand, mitochondrial DNA results from whole-genome sequencing are completely "intentional" and should generally allow determination of complete or near-complete mtDNA sequence with relatively high reliability.


What do the statistics at the top of the FASTA file mean?

Three statistics are reported at the top of the file:

  1. Sequence length: This first number is the number of letters shown in the sequence on the following line, and should correspond to the size of the mitochondrial genome for the person tested. This can vary slightly from person to person due to insertions and/or deletions of base pairs that accumulate as the mtDNA is passed down from generation to generation. The length is typically within a few base pairs of 16570.
  2. Coverage: The second number indicates the number of letters in the mtDNA sequence that were determined or "called"; this is effectively a count of all letters that are not "n" or "N" in the sequence on the following line. This may considered roughly as the number of mitochondrial sites with at least one "read" in the sequencing results. (This is not necessarily exact as there may be a handful of cases per mtDNA sequence where "N" is used to indicate that the base cannot be reliably determined even though the position has been "read".) For kits with Y sequencing results, this number is highly variable and might effectively be attributed to "the luck of the draw" (for further detail, see below answer for "Why do I have so many "N"s and lower-case letters?"). For kits with whole-genome sequencing results, this number should generally be nearly as high as the first number indicating the length of the sequence.
  3. High-reliability coverage: The last number indicates the number of letters in the mtDNA sequence that are UPPER-CASE and not "N". This effectively counts the number of mitochondrial sites for which the results are sufficient to provide a result with high confidence. Further details about the differences between the "high-reliability" UPPER-CASE letters and the lower-reliability, lower-case letters are discussed in a separate answer below.

What are the "N"s and "n"s?

The letter "N" is used to indicate cases where the base cannot be determined. Most of these cases arise when there are no available "reads" of a particular mtDNA site, though in some cases "N" is used to indicate a base that cannot be reliably determined even though the position has been "read".


Why are some of the letters shown in UPPER-CASE and some in lower-case?

Letters shown in lower-case correspond to bases that are more uncertain, whereas bases in UPPER-CASE may be considered "high reliability".

Bases shown in UPPER-CASE are supported by at least four reads. UPPER-CASE bases are also required to satisfy additional requirements in cases where there is evidence of variation from the reference sequence.


Why do I have so many "N"s and lower-case letters? / Why are the mtDNA quality and coverage so different from one set of results to another?

In the case of Y sequencing tests, the mtDNA results are essentially incidental. Therefore, these Y sequencing tests generally result in relatively few reads from mitochondrial DNA. Small changes in the abundance of mtDNA in the sequencing library can have a relatively large effect on the ability to determine the mtDNA sequence. (When there are a small number of mtDNA reads, it is essentially "the luck of the draw" whether a particular mtDNA site will be covered. A number of factors can impact the abundance of mtDNA that is sequenced, including the "copy number" of mtDNA in the submitted DNA sample and Y chromosome targeting efficiency at the lab.

mtDNA results from whole-genome sequencing can be expected to be much less susceptible to these issues, and should provide nearly complete mtDNA coverage in all cases.


What are the other letters besides A, T, C, G, and N?

These other letters are standardized "ambiguity codes" that may be used to represent multiple nucleotides (the standard A/T/C/G "letters"). For example, "K" represents either "G" or "T". (See Wikipedia's "Nucleic acid notation" article for further details.)

In these mitochondrial FASTA sequences, the ambiguity codes can indicate either ambiguous sequencing results or heteroplasmy (i.e. a mixture of different mtDNA sequences within the cells of your body). If a particular ambiguity code is shown as a lower-case letter in the FASTA file, it is more likely to represent an ambiguous result, possibly due to a sequencing artifact. On the other hand, if it is shown as an UPPER-CASE letter, it is more likely to be a genuine heteroplasmy.


What does it mean if the file indicates a "potential length variant"?

Currently, there is no standard format for representing ambiguous results or a mixture of sequences with different lengths (corresponding to insertions or deletions) as a single FASTA sequence. Therefore, these cases are instead reported as separate FASTA sequences, labeled as “potential length variants”.

As with the "ambiguity codes" discussed above, these can indicate either ambiguous results or heteroplasmy. If the length variation occurs in a region shown in lower-case letters, it is more likely to be an ambiguous result, possibly due to a sequencing artifact. If it is in a region shown in UPPER-CASE letters, it is more likely to be a genuine heteroplasmy.


What if my results differ from those obtained from mtDNA-specific testing? (e.g. Family Tree DNA mtDNA tests)?

As noted above, Y chromosome sequencing tests are specifically designed not to target mitochondrial DNA and other portions of the genome. The mitochondrial results that are obtained from such tests are considered "off-target" coverage (though fortuitous to those interested in mtDNA). As a result, the sequencing provides relatively few mitochondrial DNA "reads" and mtDNA sequence coverage will vary significantly from test to test. So, sequencing performed elsewhere that specifically targets mitochondrial DNA should generally be expected to provide results that are more reliable. (Note that reliability should be much less of an issue with whole-genome sequencing results, as offered in FGC's recent pilot product.)

Discrepancies are more likely with heteroplasmies and in the lower-quality, lower-case portions of the sequence in the FASTA file. Also, as mentioned above, mtDNA results based on Y chromosome targeted sequencing will tend to be less reliable than those based on whole-genome sequencing (WGS).

The FGC team would appreciate feedback about any discrepancies, particularly any that are seen in the "high reliability" UPPER-CASE portions of the sequence or in analysis of whole-genome sequencing results; we are also particularly interested in examining any cases where the new FASTA report appears to contain errors that were not present in the older "mttype" mtDNA variant reports; please send details of any such discrepancies to support@fullgenomes.com.


What about potential health implications of the mtDNA results?

FGC is providing only raw sequence and mutation information for mtDNA results, with a focus on ancestry, genealogical, and anthropological use. The results are provided without any analysis or interpretation with regard to potential health implications. Customers are urged to bear in mind potential health implications and the potential for inaccuracy of results when sharing or interpreting the provided mtDNA information.


If you have additional questions that are not addressed above, please feel free to contact us via e-mail at support@fullgenomes.com; you may also wish to consult with one of the mtDNA experts who participate in the genetic genealogy community.

The FGC Team