File formats (3/3: Getting familiar) | TeselaGen Biotechnology Help Center

Previously

A little bit more about file formats

Finally, becoming familiar with bioinformatic file types might be useful for scientific work. Even when a sequence in plain format contains only IUPAC characters and this is a very valuable biological knowledge, this kind of information could be considered raw material. Usually the scientists need to store more information about a determined sequence (RNA/DNA or protein); information that allows to relate the sequence to its function. That is particularly relevant to the synthetic biology field, for that reason there have been different types of files developed that are capable of containing more information about the sequence. For example, from storing more than one sequence in the same file to store lines of annotations, indicators as quality value, ID or LOCUS, length, etcetera. Here we’ll talk about the most common file types supported by our platform.

GenBank (.gb, .gbk)

The GenBank file format is commonly used because it allows for the storage of extra information in addition to the DNA/protein sequence. If you want to take a deeper look at its structure (data element or field) you can see an example on this article from NCBI.

Any GenBank file contains this information:

Locus (Locus name, Sequence length, Molecule type, GenBank division, Modification date)
Definition
Accession
Version (GI)
Keywords
Source (Organism)
Reference (Authors, Title, Journal, Pubmed)
Features (Source, Taxon, CDS, GI or Translation, Gene)
Origin
Sequence.

FASTA (.fasta)

FASTA is a text file format for representing raw biological sequences. A FASTA file contains one line (defline) for a name and an optional description which is distinguished from the sequence by a greater-than (">") symbol at the beginning. This line is followed by several lines that contain the letters from the sequence (IUPAC/IUB). NCBI for BLAST (Basic Local Alignment Search Tool) recommends all lines of text to be shorter than 80 characters in length. You can refer to this article from NCBI for more information about FASTA files.

ZIP (.zip)

As most are aware, the ZIP format is a type of archive file that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. So, in this context, allowing you to download and upload bulks of your data.

Tabular information (.csv and .xlsx)

Both of them are formats that typically store tabular data. But there are some differences that might be interesting for you. CSV stands for Comma-Separated-Values, it’s a plain text that stores your data but not the operations on it. CSV is a common data exchange format -to a certain extent makes the data “raw” again- widely used even when it is not fully standardized. On the other hand, XLSX is a binary file (created by Microsoft Excel) that holds the same data and also the operations on it. Exporting your data in this format will create a spreadsheet that is viewable and editable in Excel. This makes the data easy to re-group, combine, and re-format. Anyway CSV files can be opened or edited by Microsoft Excel and even by text editors (that is not possible for XLSX files). So, in our particular context it’s more a user preference considering what format they are used to work with.

JSON (.json)

A JSON file is basically a text file written with JavaScript Object Notation, which means it is a type of syntax for storing data. This format allows software to store computational Objects as text, which makes it very useful to store complex biological designs in a format that is compatible with almost any operating system.