Genomes Download (FTP) FAQ
- Why is the NCBI genomes FTP site being reorganized?
- What are the highlights of the redesigned FTP site?
- Will the content of the old FTP site go away? What is the timeline for transitioning to the new FTP site?
- Are all genomes available in NCBI nucleotide available on the new FTP site?
- Are files on the new FTP site updated following annotation updates?
- My organism of interest is available in both GenBank and RefSeq. Is the genome the same? Which one should I use?
- How are the new FTP directories structured?
- What is the file content within each specific assembly directory?
- How can I find the sequence and annotation of my genome of interest?
- Where can I find information to help me chose between the many different assemblies for a species?
- Why was the sequence identifier format in the FASTA files changed?
- Do you provide assembly data formatted for use by sequence read alignment pipelines?
- Are you planning to add more content or formats?
- Are repetitive sequences in eukaryotic genomes masked?
- How do alignment programs treat the lower-case masking in genomic fasta files?
- How can sequence with lower-case masking be converted to unmasked sequence?
- How can sequence with lower-case masking be converted to sequence masked with Ns?
-
Why is the NCBI genomes FTP site being reorganized?
Historically, the genomes FTP site has been populated by different process flows and NCBI working groups leading to undesirable differences in available content and file formats. Also, data for GenBank genomes and RefSeq genomes were located in different areas of the NCBI FTP site that had different organization.
NCBI has redesigned the genomes FTP site to expand the content and facilitate data access through an organized predictable directory hierarchy with consistent file names and formats. The updated site provides greater support for downloading assembled genome sequences and/or corresponding annotation data. The new FTP site structure provides a single entry point to access content representing either GenBank or RefSeq data.
-
What are the highlights of the redesigned FTP site?
The updated genomes FTP provides more uniformity across species. It offers a consistent core set of files for the genome sequence and annotation products of all organisms and assemblies in scope.
The reorganized genomes FTP site supports download needs such as:
- Retrieve the unmasked or soft-masked genome sequence for a specific genome assembly
- Retrieve GenBank or RefSeq Gene, RNA and protein annotation for a specific organism and a specific assembly, or a specific RefSeq annotation release
- Retrieve annotation in either GenBank flat-file or GFF format
- Matching sequence identifiers in FASTA & GFF files to facilitate RNA-Seq and other analyses
- Confirm downloaded content is complete using provided md5checksums
-
Will the content of the old FTP site go away? What is the timeline for transitioning to the new FTP site?
The initial release of the redesigned genomes FTP site in August 2014 adds three new directories, namely ‘genbank’, ‘refseq’, and ‘all’ to the existing ftp area – ftp://ftp.ncbi.nlm.nih.gov/genomes/. These directories provide a core set of files representing both sequence and annotation content in several formats (see below). Additional file formats will be added in future updates as described below.
We plan to maintain the older content and structure of the preexisting /genomes/ site, as well as that at ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/, in parallel with the new structure until March 2015. Ultimately, older content in the /genomes/ area will be fully replaced by or moved to archival subdirectories within the /genomes/refseq/ area and older content in the /genbank/genomes/ area will be fully replaced by or moved to archival subdirectories within the /genomes/genbank/ area.
-
Are all genomes available in NCBI nucleotide available on the new FTP site?
Genome sequence and annotation data is provided for organisms in scope for NCBI’s Assembly resource (which excludes viruses at this time). Data are provided for both GenBank and RefSeq assembly versions. There are FTP directories for the latest version in each assembly chain, plus directories for a few older assembly versions.
-
Are files on the new FTP site updated following annotation updates?
Yes, the FTP files for the latest version of an assembly are updated after the annotation on any of the sequences in the assembly changes.
The FTP files for the latest version of an assembly may also be updated:
- to make the files conform to the latest specifications for a particular data format
- to correct errors in conversion of the primary data from the NCBI databases into the various FTP file formats
Files for old versions of assemblies will not usually be updated.
-
My organism of interest is available in both GenBank and RefSeq. Is the genome the same? Which one should I use?
GenBank content includes genome assemblies that are submitted to members of the International Nucleotide Sequence Database Collaboration. GenBank submissions may or may not include annotation information which, when provided, was generated by different groups using different methods. Note that for prokaryotes, GenBank annotation may have been generated using NCBI’s prokaryotic genome annotation service. In contrast, RefSeq genomes are selected from, and are a subset of, the available GenBank genomes and annotation data is available for all RefSeq genomes. RefSeq annotation content originates from NCBI’s prokaryotic or eukaryotic genome annotation pipelines, or is propagated from the GenBank submission.
For some assemblies, both GenBank and RefSeq content may be available. RefSeq genomes are a copy of the submitted GenBank assembly. In some cases the assemblies are not completely identical as RefSeq has chosen to add a non-nuclear organelle unit to the assembly or to drop very small contigs or reported contaminants. Equivalent RefSeq and GenBank assemblies, whether or not they are identical, and RefSeq to GenBank sequence ID mapping, can be found in the assembly report files available on the FTP site or by download from the Assembly resource.
-
How are the new FTP directories structured?
The base structure of the revised genomes ftp site includes several main directory areas that provide sequence and annotation content, or report files. Sequence and annotation content is further organized by major taxonomic groupings, then by species, then by assembly. Sequence content is defined by the Assembly resource. The revised genomes FTP site provides directories for:
- GenBank content organized by taxonomic group, then by species and assembly
- RefSeq content organized by taxonomic group, then by species and assembly
- all (union of GenBank and RefSeq) organized by individual assembly [This directory is not suitable for browsing because it holds many thousands of entries.]
- Assembly reports
- Genome reports
Within the GenBank and RefSeq directories, the directory hierarchy is:
-
Taxonomic group
-
Genus_species
-
All assemblies
- Individual assemblies
-
Latest assembly versions
- Individual assemblies
-
RefSeq representative genomes (if any)
- Individual assemblies
-
RefSeq reference genomes (if any)
- Individual assemblies
-
Future: additional groupings will be added in the future. For example:
- Annotation release data sets from NCBI’s eukaryotic annotation pipeline
-
All assemblies
-
Genus_species
- Future: the design supports adding other groupings of interest in the future
The first layer of organization consists of the following directories:
-
genbank: content includes primary submissions of assembled genome sequence and associated annotation data, if any, as exchanged among members of the International Nucleotide Sequence Database Collaboration, of which NCBI’s GenBank database is a member. The GenBank directory area includes genome sequence data for a larger number of organisms than the RefSeq directory area; however, some assemblies are unannotated. The subdirectory structure includes:
- archaea
- bacteria
- fungi
- invertebrate
- other – this directory is only provided for GenBank and includes submissions of synthetic genomes.
- plant
- protozoa
- vertebrate_mammalian
- vertebrate_other
-
refseq: content includes assembled genome sequence and RefSeq annotation data. All RefSeq genomes have annotation. RefSeq annotation data may be calculated by NCBI annotation pipelines or propagated from the GenBank submission. The RefSeq directory area includes fewer organisms than the GenBank directory area because not all genome assemblies are selected for the RefSeq project. Subdirectories include:
- archaea
- bacteria
- fungi
- invertebrate
- plant
- protozoa
- vertebrate_mammalian
- vertebrate_other
- all: content is the union of GenBank and RefSeq assemblies. Subdirectories are provided per assembly accession and version. This directory holds many thousands of entries and should only be accessed as a path to a known assembly.
- ASSEMBLY_REPORTS: content consists of two summary reports as well as a separate report for each assembly, conveying meta-data for GenBank and RefSeq genome assemblies and associated sequence and other identifiers. The two summary report files include meta-data details of either all the GenBank assemblies or all the RefSeq assemblies. These summary files provide a ftp path that can be used to retrieve the sequence and annotation data.
- GENOME_REPORTS: content consists of summary reports of genome sequencing projects, associated annotation statistics, and some defined reference datasets within the RefSeq project. Reports are provided by the Genomes resource.
Example directory hierarchy:
The directory hierarchy for the Genbank Escherichia coli K-12 subst. MG1655 genome, which has the assembly accession GCA_000005845.2 and default assembly name of ‘ASM584v2’ looks like this:
-
genomes
-
genbank
-
bacteria
-
Escherichia_coli
-
all_assembly_versions
- GCA_000005845.2_ASM584v2 – this directory layer is named using the pattern: [Assembly accession.version]_[assembly name]
-
all_assembly_versions
-
Escherichia_coli
-
bacteria
-
genbank
The directory hierarchy for the annotated human reference genome looks like this:
-
genomes
-
refseq
-
vertebrate_mammalian
-
Homo_sapiens
- all_assembly_versions
- latest_assembly_versions
-
reference
- GCF_000001405.26_GRCh38
-
Homo_sapiens
-
vertebrate_mammalian
-
refseq
-
What is the file content within each specific assembly directory?
The initial release includes the following core files and formats for all assemblies, as relevant. All files are named according to the pattern:
[assembly accession.version]_[assembly name]_content.[format]*_assembly_report.txt
Tab-delimited text file reporting the name, role and sequence accession.version for objects in the assembly. The file header contains meta-data for the assembly including: assembly name, assembly accession.version, scientific name of the organism and its taxonomy ID, assembly submitter, and sequence release date.
*_assembly_stats.txt
Tab-delimited text file reporting statistics for the assembly including: total length, ungapped length, contig & scaffold counts, contig-N50, scaffold-L50, scaffold-N50, scaffold-N75 & scaffold-N90.
*_assembly_regions.txt
Provided for assemblies that include alternate or patch assembly units. Tab-delimited text file reporting the location of genomic regions and the alt/patch scaffolds placed within those regions.
*_assembly_structure directory
Contains AGP files that define how component sequences are organized into scaffolds and/or chromosomes. Other files define how scaffolds and chromosomes are organized into non-nuclear and other assembly-units, and how any alternate or patch scaffolds are placed relative to the chromosomes. Only present if the assembly has internal structure.
*_genomic.fna.gz
FASTA format of the genomic sequence(s) in the assembly. Repetitive sequences in eukaryotes are masked to lower-case. The genomic.fna.gz file includes all top-level sequences in the assembly (chromosomes, plasmids, organelles, unlocalized scaffolds, unplaced scaffolds, and any alternate loci or patch scaffolds). Scaffolds that are part of the chromosomes are not included because they are redundant with the chromosome sequences; sequences for these placed scaffolds are provided under the assembly_structure directory.
*_genomic.gbff.gz
GenBank flat file format of the genomic sequence(s) in the assembly.
*_genomic.gff.gz
Annotation of the genomic sequence(s) in Generic Feature Format Version 3 (GFF3).
*_protein.faa.gz
FASTA format of the accessioned protein products annotated on the genome assembly.
*_protein.gpff.gz
GenPept format of the accessioned protein products annotated on the genome assembly.
*_rm.out.gz
RepeatMasker output; Provided for Eukaryotes.
*_rm.run
Documentation of the RepeatMasker version, parameters, and library; Provided for Eukaryotes.
*_rna.fna.gz
FASTA format of accessioned RNA products annotated on the genome assembly; Provided for RefSeq assemblies as relevant (Note, RNA and mRNA products are not instantiated as a separate accessioned record in GenBank and are provided for some RefSeq genomes, most notably the eukaryotes.).
*_rna.gbff.gz
GenBank flat file format of RNA products annotated on the genome assembly; Provided for RefSeq assemblies as relevant.
md5checksums.txt
File checksums are provided for all data files in the directory.
-
How can I find the sequence and annotation of my genome of interest?
Genome assemblies of interest can be found using one of two methods.
Using the NCBI Assembly resource
Genome assemblies of interest can be found using the search bar, advanced search page or browse by organism table provided by the Assembly resource
GenBank or RefSeq data for the assembly can be obtained by following the links to the FTP site from the "Access the data" section of the right-hand sidebar.
Using the assembly summary report files
Download the relevant assembly summary files that report assembly meta-data.
-
Either the two master assembly summary files:
ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt -
Or an assembly summary file for a species from the appropriate directory under genbank or refseq. e.g.
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Salmonella_enterica/assembly_summary.txt
Search the meta-data fields, or filter the files, to find assemblies of interest.
The field named "ftp_path" provides the path to the FTP directory containing the data for each assembly.
-
Either the two master assembly summary files:
-
Where can I find information to help me chose between the many different assemblies for a species?
There can be many different genome assemblies available for species with medical, agricultural or scientific relevance. The Genus_species directories under the "genbank" and "refseq" directory trees each contain an assembly_summary.txt file that provides general information on all assembly versions included in the directory, such as release date, submitter organization, assembly level and status. See for example ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/Sulfolobus_islandicus/assembly_summary.txt
After assemblies of interest have been identified using the data from the species-specific assembly_summary.txt file, they can be accessed via the "all_assembly_versions" directory for that species.
Alternatively, any assemblies that the NCBI Reference Sequence (RefSeq) group has selected to be reference or representative genomes can be readily accessed via the directories named "reference" or "representative" in the Genus_species directories under the "genbank" and "refseq" directory trees.
-
Why was the sequence identifier format in the FASTA files changed?
We changed the sequence identifier format in the FASTA files to make our datasets more usable by the community.
NCBI has traditionally used a compound FASTA sequence identifier string in which multiple IDs were separated by '|' characters. This format provides more information but requires that the individual sequence identifiers be parsed out of the compound string. The FASTA files on the redesigned genomes FTP site have a simple sequence identifier string that is just the sequence accession.version, for example:
>U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome
>NC_000001.11 Homo sapiens chromosome 1, GRCh38 Primary AssemblyThis sequence identifier is identical to that used in the GFF annotation files on the genomes FTP site. Providing sequence and annotation files with matching sequence identifiers supports their use in commonly used RNA-Seq analysis packages and in other analysis pipelines that rely on simple string comparison to match sequence identifiers.
-
Do you provide assembly data formatted for use by sequence read alignment pipelines?
Genomic FASTA with modified sequence identifiers and index files convenient for analysis with Next Generation Sequencing tools are currently provided for the Genome Reference Consortium's human and mouse assemblies: GRCh38 & GRCm38.p3. RefSeq annotation in GFF3 format with sequence identifiers matching those in the FASTA files are also provided to facilitate use in RNA-Seq analysis pipelines.
The three analysis sets provided for GRCh38 (no_alt_analysis_set, full_analysis_set, full_plus_hs38d1_analysis_set) and the two analysis sets provided for GRCm38.p3 (no_alt_analysis_set, full_analysis_set) differ from the corresponding full assemblies by one or more of the following:
- omission of alternate locus and patch scaffolds that cause complications for sequence read alignment programs that are not alt-aware
- hard masking of duplicate copies the pseudo-autosomal regions and centromeric arrays
- addition of "decoy" sequences
-
Are you planning to add more content or formats?
Yes, we plan to add the following additional files to the genomes FTP site:
-
Product GFF reports
- Feature annotation of annotated, accessioned RNA and mRNA products in GFF format
- Feature annotation of annotated, accessioned protein products in GFF format
- Feature count reports
- Sequence ID mapping table
- Nucleotide FASTA of annotated CDS features
- Conceptual translations of CDSs in fasta format
- Location of gaps within the genomic sequences
- Submitted quality score data
- Sequence and annotation in binary ASN.1 format
-
Packages containing data from multiple assemblies for popular groupings, such as:
- All complete bacterial genome assemblies
- All RefSeq representative bacterial genomes
- Sequences formatted for alignment pipelines for more organisms
- Data for older versions of assemblies
-
Product GFF reports
-
Are repetitive sequences in eukaryotic genomes masked?
Repetitive sequences in eukaryotic genome assembly sequence files, as identified by WindowMasker, have been masked to lower-case.
The location and identity of repeats found by RepeatMasker are also provided in a separate file. These spans could be used to mask the genomic sequences if desired. Be aware, however, that many less studied organisms do not have good repeat libraries available for RepeatMasker to use.
-
How do alignment programs treat the lower-case masking in genomic fasta files?
Alignment programs typically have parameters that control whether the program will ignore lower-case masking, treat it as soft-masking (i.e. only for finding initial matches) or treat it as hard-masking. The program's documentation should indicate the default behavior.
By default NCBI BLAST will ignore lower-case masking but this can be changed by adding options to the blastn command-line.
To have blastn treat lower-case masking in the query sequence as soft-masking add:-lcase_masking
To have blastn treat lower-case masking in the query sequence as hard-masking add:
-lcase_masking -soft_masking false
-
How can sequence with lower-case masking be converted to unmasked sequence?
Here are two examples of commands that will remove lower-case masking:
perl -pe '/^[^>]/ and $_=uc' genomic.fna > genomic.unmasked.fna
-or-
awk '{if(/^[^>]/)$0=toupper($0);print $0}' genomic.fna > genomic.unmasked.fna -
How can sequence with lower-case masking be converted to sequence masked with Ns?
Here are two examples of commands that will convert lower-case masking to masking with Ns (hard-masked):
perl -pe '/^[^>]/ and $_=~ s/[a-z]/N/g' genomic.fna > genomic.N-masked.fna
-or-
awk '{if(/^[^>]/)gsub(/[a-z]/,"N");print $0}' genomic.fna > genomic.N-masked.fna