SGN Documentation

View the Project on GitHub solgenomics/sgn

Home  

0. Breedbase Database Description

0.1. Breeding Program

In Breedbase, data belonging to distinct breeding programs is compartmentalized using the concept of a ‘breeding program’. A breeding program is defined simply with a unique name and a description. Many of the following concepts, such as locations, field trials, genotyping plates, genotyping data experiments, crossing experiments, and seedlots can only be created in association to a breeding program, allowing clear ownership of data. Only a curator on the system can create a new breeding program. A breeding program is created as an entry of the project table in the database.

0.2. Locations

A location is defined in Breedbase using a unique name and abbreviation. It can have one of the following types: Farm, Field, Greenhouse, Screenhouse, Lab, Storage, or Other. The location is saved with longitude and latitude coordinates, an altitude in meters, and is associated to a country and breeding program. An interactive map can be used to add, edit, delete, and view locations. Apart from the interactive map, an Excel file can be uploaded to add locations in bulk. Locations are linked to many of the following concepts, such as field trials, genotyping plates, genotyping data experiments, crossing experiments, and seedlots. A location is created as an entry of the nd_geolocation table with associated properties in the nd_geolocationprop table, following an entity-attribute-value (EAV) model. Property entries include the country and location type, and optionally, the address, continent, and administrative regions; they are stored in nd_geolocationprop using terms from the ‘geolocation_property’ controlled vocabulary. Controlled vocabulary is a fundamental part of the Chado database schema and to the Natural Diversity module on which Breedbase is built on. Only a submitter or curator on the system can create new locations in the database.

0.3. Germplasm

To describe how germplasm, or accessions, are managed in Breedbase, it is first important to understand how organisms, or species, are managed. The Breedbase database is pre-populated with the complete NCBI taxonomy record, defining all known species with their associated genus, abbreviation, common name, and GenBank taxon identifier. Researchers using Breedbase can find their crops of interest within the 100,000+ organisms available. Germplasm, or accessions, are always created in association to the organism. A single Breedbase instance can be used for a variety of crop organisms; however, for logistical reasons, it is recommended to utilize separate instances for individual crops.

The required information to create an accession is only a unique name and the organism species name; however, germplasm can be optionally annotated with the following properties: variety, donor, donor institute, donor PUI, country of origin, state, institute code, institute name, biological status of accession code, notes, accession number, PUI, seed source, type of germplasm storage code, acquisition date, organization, location code, ploidy level, genome structure, ncbi taxonomy id, transgenic, introgression parent, introgression backcross parent, introgression map version, introgression chromosome, introgression start position in base pairs, and introgression end position in base pairs. Many of these germplasm properties are derived from the Breeding API (BrAPI) specification (BrAPI, 2019). Germplasm can be added to the database using an interactive list tool or using an Excel file upload; the Excel file upload also allows for storing and updating of all attributes listed above.

One of the most critical issues in germplasm management is creation of duplicate germplasm records for a single unique entity; often the duplication occurs because of typographical errors such as adding white space between characters or transcription errors when manually writing germplasm names. To address this issue, Breedbase does a fuzzy search across all incoming germplasm names before they can be stored in the system. The fuzzy search will detect any germplasm names existing in the system which vaguely resemble the incoming germplasm names, and then the uploader can choose to add their germplasm name as a new synonym of the existing germplasm entry in the database or they can choose to simply adopt usage of the existing germplasm unique name. Adding a synonym to the existing germplasm entry is generally the most convenient option, given that Breedbase will recognize the synonym in all downstream cases. Given this, it is still critical that all synonym names are unique and non-unique synonyms cannot be added.

Germplasm are the foundation for many of the following concepts, such as seedlots, field trials, genotyping plates, genotyping data projects, and crossing experiments. They are stored in the stock table with associated properties in the stockprop table, following an EAV model. The optional properties listed above are stored in the stockprop table using terms from the ‘stock_property’ controlled vocabulary. The stock table, as will be described in the following sections, is used to store a variety of stock-like entities, including plot and plant entries, and seedlots; in the case of germplasm, the stock table entry has a type named ‘accession’ from the ‘stock_type’ controlled vocabulary.

In Breedbase, germplasm can be grouped into populations. A population is defined with a unique name and a list of germplasm names. Populations are useful in downstream analysis for clustering and demarcating groups of germplasm. A population is stored as an entry in the stock table with a type named ‘population’ from the ‘stock_type’ controlled vocabulary.. Entries in the stock_relationship table link germplasm entries and population entries in the stock table using a type name ‘member_of’ from the ‘stock_relationship’ controlled vocabulary.

For query performance and versatility, a PostgreSQL materialized view is generated to collapse all information from the stock and stockprop table EAV model into a simple row and column table structure called materialized_stockprop. The materialized view is regenerated whenever new stock entries are added. The germplasm search, and more generally the stock search, construct complex and efficient queries using the materialized view.

0.4. Seedlots

As described above, germplasm are abstract in the sense that they do not have a physical storage location or a physical inventory. To address this issue, Breedbase has a seedlot concept. A seedlot is defined with a unique name, a breeding program, a storage location and box name, a seedlot content of either a germplasm name or a cross name, and a seedlot inventory count and/or weight in grams. Optionally, the seedlot can be tagged to an organization and can have a description. The contents of the seedlot can be either an existing germplasm name or cross name, allowing flexibility for performing inventories at various stages of the breeding program, such as during crossing nurseries or seed increases.

Seedlots must have a seed count or a weight in grams, or both, to represent its current inventory state. This inventory is tracked in Breedbase using a transaction system, allowing accountability for when seed was added to or withdrawn from a seedlot. Transactions can occur between seedlots for tracking seed transfers across breeding programs; more generally, transactions can be recorded for any addition or subtraction of seed by a submitter on Breedbase. Transactions are created automatically if a field trial has indicated that seedlots were planted in specific plots of the experiment; more information on this can be found in the following field trials section.

Seedlots are stored as entries in the stock table using the type name ‘seedlot’ from the ‘stock_type’ controlled vocabulary. The relationship between the seedlot and its contents is stored as a stock_relationship entry using the type name ‘collection_of’ from the ‘stock_relationship’ controlled vocabulary. The stock entry is linked to stockprop entries in an EAV model for the following properties: current count, current weight in grams, storage box location, and optionally, organization; all of these terms are part of the ‘stock_property’ controlled vocabulary. The current count and current weight in grams values are only for quick querying and are updated whenever a new transaction occurs; these values are updated by summing over all past transactions. The seedlot stock entry is linked relationally to its breeding program and location via the nd_experiment linking tables; the nd_experiment table has the type name ‘seedlot_experiment’ from the ‘experiment_type’ controlled vocabulary.

Transactions are stored in the stock_relationship table using the type name ‘seed transaction’ from the ‘stock_relationship’ controlled vocabulary. The value field of this stock_relationship entry contains a JSON encoded string with the following information: amount, weight, operator, timestamp, and description. The stock_relationship table can link any stock entry to any other stock entry, allowing a seedlot to be transacted with any seedlot or plot entry.

Seedlots can be created one at a time using an interactive interface on Breedbase, or they can be created in bulk by uploading an Excel spreadsheet. Uploading of this Excel spreadsheet also has the effect of updating any existing inventory to match current counts and weights. Breedbase performs this update by creating a new ‘adjustment transaction’ to increment or decrement the current count and current weight to match the values uploaded in the Excel spreadsheet.

To aid researchers in collecting inventory information, Breedbase supports the Inventory Android application (http://wheatgenetics.org/bioinformatics/35-inventory). Using the Inventory application, a researcher needs only to scan a barcode with the name of the seedlot, then collect the weight of the seedlot automatically using a smartphone connected scale. The output from this application can be directly uploaded into Breedbase and the current inventory will be adjusted by creating a new ‘adjustment transaction’ as described above.

0.5. Crossing Experiments

Critical to plant breeding programs is the generation of new material through crossing experiments. To organize collection of data from crosses performed, Breedbase requires creation of a top-level crossing experiment; the crossing experiment is defined with a unique name, a breeding program, a location, a year, and a description. The individual crosses performed are then stored under the crossing experiment; individual crosses are defined by a unique cross name, a location, a crossing experiment, and a cross type. The cross type can be one of the following: biparental, self, open-pollinated, bulk, bulk selfed, bulk and open-pollinated, doubled haploid, polycross, reciprocal, or multicross. Depending on the type of cross performed, different requirements are needed; for example, in a biparental cross, information from both the male and female parent is required, whereas in an open-pollinated cross, information on only the female is required. In the case of an open-pollinated cross, a population name representing a group of male germplasm can be given as the male parent.

Breedbase tracks parental information from crosses in two ways. The first way is through the germplasm names of the female and male parents, allowing for simple ancestry tracking of AxB pedigrees for the progeny from a cross. Pedigrees will be further described in the following section; however, it is important to note that when a cross is created in Breedbase, the pedigree between progeny germplasm and parental germplasm is automatically created as well. This first form of parental tracking is applied in all cases when a cross is created in Breedbase. The second way is through the plot or plant names of the male and female parents. The plot or plant names of the parents are related to the field trial in which they are planted, as will be described in the following field trial section. This second form of tracking is more detailed; however, it is completely optional because of the difficulty in recording this information in many cases.

Recording information on the parental plots is possible with mobile data collection platforms. Of note are customized Open Data Kit (ODK) Android applications, such as BTract. With BTract, after scanning barcodes to track the precise male and female plots or plants involved in the pollination, the application assigns and prints a unique cross barcode label; through ODK data synchronization, the cross information can be uploaded into Breedbase. Another mobile Android application is the Intercross App, which can be used to scan parental barcodes and associate a unique cross name to the performed cross. The output from Intercross can be uploaded directly into Breedbase. In crossing experiments where evaluation of crosses is performed, Breedbase can store the following optional cross properties: pollination date, tag number, number of flowers, number of bags, number of fruits, and number of seeds. These cross properties are arbitrary and are not found in any controlled vocabulary for practical reasons. The cross properties are stored in a JSON encoded string where the key is one of the listed cross properties, or an entirely new property, and the value is the observed quantity; the JSON encoded string is stored in the stockprop table using the type name ‘crossing_metadata_json’ from the ‘stock_property’ controlled vocabulary. These properties are set in the configuration file for the Breedbase instance, allowing researchers flexibility in defining these terms.

Crosses can be created one at a time using an interactive interface on Breedbase or they can be uploaded in bulk using an Excel spreadsheet. The uploaded Excel file contains the unique cross name, the cross type, the parents involved, and optionally, values for the evaluated cross properties. Uploading of the spreadsheet has the added action of updating the evaluated cross properties for any existing crosses in the database. After a cross is saved in the database, progeny of the cross can be saved as new germplasm in the database, automatically creating pedigrees for the new germplasm. Progeny germplasm can be interactively created in Breedbase and have unique names templated from the cross name.

Crossing experiments are stored as an entry in the project table and are linked to their breeding program through a project_relationship entry with the type name ‘breeding_program_trial_relationship’ from the ‘project_relationship’ controlled vocabulary; the project entry is linked to entries in the projectprop table in an EAV model to store values for the year and location using type names from the ‘project_property’ controlled vocabulary. Individual cross entries are stored as entries in the stock table using the type name ‘cross’ from the ‘stock_type’ controlled vocabulary; this stock entry is linked to the crossing experiment project via entries in the nd_experiment_stock and nd_experiment_project tables, ultimately linked by an nd_experiment entry with the type name ‘cross_experiment’ from the ‘experiment_type’ controlled vocabulary. A cross is linked to the involved parental germplasm through entries in the stock_relationship table using the type names ‘female_parent’ and ‘male_parent’ from the ‘stock_relationship’ controlled vocabulary for the female and male parents, respectively. The stock_relationship entry linking the cross to the female germplasm parent also contains the cross type in the value field. In the case where plot or plant parental information is known, the cross is linked to the female and male plots using stock_relationship entries with the type names ‘female_plot_of’ and ‘male_plot_of’ from the ‘stock_relationship’ controlled vocabulary, respectively, and the cross is linked to the female and male plants using stock_relationship entries with the type names ‘female_plant_of’ and ‘male_plant_of’ from the ‘stock_relationship’ controlled vocabulary, respectively. Progeny germplasm are created as entries in the stock table using the type name ‘accession’ from the ‘stock_type’ controlled vocabulary. Progeny germplasm of the cross are linked to the female and male germplasm using stock_relationship entries with the type names ‘female_parent’ and ‘male_parent’ from the ‘stock_relationship’ controlled vocabulary; these two relationships constitute the pedigree as is described in the next section. Progeny germplasm are also linked to their cross using a stock_relationship entry with the type name ‘offspring_of’ from the ‘stock_relationship’ controlled vocabulary.

0.6. Pedigree

A pedigree in Breedbase is represented by relationships between germplasm stock table entries via the stock_relationship table. It is important to remember that all germplasm, whether they are progeny or parents in a pedigree, are stored in the same way as entries in the stock table with type name of ‘accession’ from the ‘stock_type’ controlled vocabulary. The pedigree stems from the relationships, such that a progeny germplasm is linked to its female and male germplasm parents through entries in the stock_relationship table with type names of ‘female_parent’ and ‘male_parent’ from the ‘stock_relationship’ controlled vocabulary, respectively.

The pedigree in Breedbase can be queried and transformed into a pedigree string of any format, such as Purdy. More interestingly, the pedigree can be displayed visually and interactively using the BrAPI-Pedigree-Viewer (https://github.com/solgenomics/BrAPI-Pedigree-Viewer) implemented in Breedbase. More information on Breeding API applications (BrAPPs) can be found in the section below.

Pedigrees can be created in Breedbase three ways. The first way is by uploading a simple Excel file that contains the progeny germplasm name, the female parent germplasm name, the cross type, and optionally, the male parent germplasm name. The cross type here can be bi-parental, open, or self, and the male parent can be a population name representing a group of male germplasm. This method is ideal for filling in pedigrees for a large amount of historic material. The second way is to add the male and female parent to a single germplasm progeny through an interactive interface on Breedbase; this method is ideal for correcting small issues in pedigrees and germplasm management. The third way is to create progeny from crosses, as was described in the previous crossing experiments section. The third method is preferred given that it provides the most descriptive information for the cross underlying the pedigree and potentially provides evaluation metrics for the cross through cross properties.

0.7. Field Trials

Field trials are a central concept in Breedbase, and provide a structure for linking phenotypic observations to the experimental layout to the germplasm information. A field layout represents the germplasm tested in the experiment under a specific experimental design. There are two ways for adding a field layout into Breedbase. The first is by using an interactive interface on Breedbase for designing a new field layout, while the second is by uploading an Excel spreadsheet of the field layout.

The first method ensures a path of least resistance for the researcher by performing the design calculation and randomization on Breedbase. To begin designing a new field trial in Breedbase, a breeding program, location, unique trial name, trial type, year, description, and design type must be specified; optionally, the plot width, plot length, and field size can be specified. The trial type is simply a tag for the kind of trial being performed and must be selected from the following terms: phenotyping trial, seedling nursery, clonal evaluation, advanced yield trial, preliminary yield trial, uniform yield trial, variety release trial, regional trial, seed multiplication, screenhouse, or crossing block trial. The design type specifies the calculation and randomization to be performed on the distribution of germplasm in the field experiment and must be one of the following: completely randomized, complete block, alpha lattice, lattice, augmented, modified augmented design, nursery/greenhouse, split plot, partially replicated, or Westcott design. Next, design type parameters must be provided depending on the design type specified. For the completely randomized design, only a list of germplasm and the number of replicates is required. For the complete block design, only a list of germplasm and the number of blocks is required. The alpha lattice design requires a list of germplasm, the number of replicates, and the block size, while the lattice design requires only a list of germplasm and the number of replicates. The modified augmented design requires a list of germplasm, a second list of germplasm to use as checks, and the maximum block size. The nursery/greenhouse design is not plot replicated or randomized and only requires a list of germplasm and input for the number of plants for each germplasm. The split plot design requires a list of germplasm, the number of blocks, the number of plants per split-plot factor, and names for all the split-plot factors being tested; the split-plot design is unique in that it will create plots, sub-plots, and plant entries in a nested design and will assign the given split-plot factors as field management factors to the generated sub-plots and plant entries. More on field management factors is found in the section below. The partially replicated design requires a list of unreplicated germplasm, a list of replicated germplasm, the number of rows and columns in the design, a replicate number, a block sequence, and a sub-block sequence. The Westcott design requires a list of germplasm, two germplasm names to use as checks, the number columns, and the number of columns between the two checks. After providing the design parameters, Breedbase will calculate the design using R libraries, including ‘agricolae’. A Perl library called R::YapRI (https://metacpan.org/pod/R::YapRI) is used to interface with R, providing in-memory translation of data objects between Perl and R. The calculated layout is then interactively drawn on Breedbase and the researcher can review the layout before saving it into the database; it is possible to redo randomizations in this step. During this review process, though uptional, it is possible to add field management factors to the field trial; more on field management factors is in the following section. The review process will also warn whether the designed layout is larger than the specified field size if the plot length, plot width, and field size were provided.

The second method for creating field trials in Breedbase is by uploading an Excel spreadsheet. This method is useful for uploading experiments that have previously been performed and for uploading experiments for which Breedbase cannot calculate the design. Both methods produce the same result. During upload of the Excel spreadsheet it is also possible to include field management factors as additional columns in the spreadsheet. Also, during upload of the Excel spreadsheet it is possible to indicate the seedlots that were planted in specific plots; when seedlots are provided, a transaction between the seedlot and plot is created to subtract the specified number or weight of seed from the seedlot.

The end result of both methods is a field trial created in Breedbase with defined plots arranged over a number of blocks and replicates. Every plot has a plot number that is unique only to the field trial and increments linearly across all plots; the plot number can be defined in a serpentine or zig-zag fashion. Breedbase offers an interactive field layout for viewing the field trial spatially; this same interactive interface works as a heatmap to overlay phenotypic measurements for each plot in the field trial.

If it is desired to phenotype individual plants alongside individual plots, Breedbase can link plant entries to plot entries for the field trial. Plant entries can be added by either simply indicating the number of plants to create for each plot in the field trial or by uploading an Excel file. The uploaded Excel file can be of three forms. The first is a spreadsheet with a column specifying an existing plot name and a column defining the name of the new plant entry; in this way, the uploader can control the number of plants to create for each plot and their new plant name. The second way is using a spreadsheet with a column specifying an existing plot name and a column defining a plant index number; this way will create new plant entries with names that are a concatenation of the plot name and the specified index number, allowing the uploader control of how many plant entries to create per plot and the index numbers to use. The third way is using a spreadsheet where the first column specifies an existing plot name and the second column defines how many plants to create for that plot; in this case, new plant entries are created with names that are concatenations of the plot name and an index number, allowing the uploader control of how many plant entries to create for each plot.

Similarly, if it is desired to collect tissue samples from plants in the field trial, Breedbase can link tissue sample entries to plant entries for the field trial. Tissue samples can be created only after plant entries have been created. Tissue samples can be created by specifying how many tissue samples to create for each plant in the field trial. The user can then define a tissue type for each of the tissue samples they will collect, for example leaf and stem. Breedbase will then create tissue sample entries for every plant entry in the field trial; the new tissue sample names are a concatenation of the plant name, the provided tissue type, and an incrementing tissue sample index number.

By creating plot entries, and optionally plant entries and tissue sample entries, for the field trial, it is then possible to print barcodes from Breedbase and it is possible to associate phenotype values to any of these entities. An interactive barcode label design tool is available on Breedbase, allowing researchers to design reusable custom label formats on an easy drag-and-drop templating interface.

In the case where the field trial is used for crossing experiments, for example if the field trial is a crossing block, it is possible to utilize the plot names, or optionally plant names, to track parentage; this is an additional level of parentage tracking on top of tracking germplasm names, as was discussed in detail in the crossing experiments section.

A field trial is stored as an entry in the project table with entries in the projectprop table following an EAV model; entries in projectprop include the year, location, field size, plot width, plot length, design type, and trial type, with all type names coming from the ‘project_property’ controlled vocabulary. The field trial project entry is connected to the breeding program project entry via the project_relationship table using the type name ‘breeding_program_trial_relationship’ from the ‘project_relationship’ controlled vocabulary. The field trial is created with a link to nd_experiment via the nd_experiment_project linking table; the nd_experiment entry has the type name ‘field_layout’ from the ‘experiment_type’ controlled vocabulary. All plot entries are saved individually in the stock table using the type name ‘plot’ from the ‘stock_type’ controlled vocabulary; the plot entries are all linked to the same field trial nd_experiment entry via the nd_experiment_stock linking table. The plot stock entry is linked to its germplasm stock entry via the stock_relationship table using the type name ‘plot_of’ from the ‘stock_relationship’ controlled vocabulary. Each plot’s stock entry has stockprop entries in an EAV model describing the plot’s plot number, row number, column number, block number, replicate number, and whether the plot contains a check germplasm; all of these stockprop terms are from the ‘stock_property’ controlled vocabulary. Optionally, when plant entries are saved in a field trial they are saved as stock entries with the type name ‘plant’ from the ‘stock_type’ controlled vocabulary; the plant entries have a stockprop entry in an EAV model with the type name ‘plant_index_number’ from the ‘stock_property’ controlled vocabulary. The plant entries also inherit all design related stockprop values from their related plot, such as plot number and block number. The plant entries are linked to their germplasm and plot via stock_relationship entries using the type name ‘plant_of’ from the ‘stock_relationship’ controlled vocabulary. Optionally, when tissue sample entries are saved in a field trial they are saved as stock entries with the type name ‘tissue_sample’ from the ‘stock_type’ controlled vocabulary; the tissue sample stock entries are linked to a stockprop in an EAV model using the type name ‘tissue_sample_index_number’ from the ‘stock_property’ controlled vocabulary. The tissue sample entries also inherit all design related stockprop values from their related plot. Field trial tissue sample entries are linked to their related germplasm, plot, and plant entries via stock_relationship entries using the type name ‘tissue_sample_of’ from the ‘stock_relationship’ controlled vocabulary. For ease of determination, additional projectprop entries are added to the field trial indicating whether the field trial contains plant or tissue sample entries, with type names coming from the ‘project_proprty’ controlled vocabulary.

For performance reasons, a cached representation of the field trial is generated for every field trial layout; the cached layout is stored as a JSON encoded string in a projectprop entry using the type name ‘trial_layout_json’ from the ‘project_property’ controlled vocabulary. The cached layout is a JSON object of objects, where the top-level key is the plot number and the corresponding objects contain all design information, such as block number and replicate, as well all related plant and tissue sample information for each plot. The cached layout is regenerated when there are changes to the underlying field layout and is used throughout Breedbase, such as when drawing the field map and printing barcode labels.

0.8. Field Management Factors

Field management factors are used to track management practices applied to specific plots in a field trial. Currently field management factors are user defined, though a controlled vocabulary is ideal; for example, the researcher can specify a specific fertilizer application names ‘01/01/2018 AX 4kg 10N20P50K’ or can define a weeding treatment called ‘01/02/2018 Hoe Weeding’ and then assign that treatment onto specific plots in their field trial. There are three ways to add field management factors onto a field trial in Breedbase.

The first way to add a field management factor is during the trial design process in Breedbase, as was mentioned above in the field trial section. During this process, during the review phase prior to saving the field trial into Breedbase, there is an interactive interface for naming field management factors and assigning plot entries to the field management factors.

The second way is during upload of a field trial into Breedbase, as was mentioned above in the field trial section. During upload of a field trial into Breedbase it is possible to include additional columns in the Excel spreadsheet naming the field management factors; then for each plot that the factor is applied to, a 1 is placed in the cell corresponding the the plot and the field management factor. If the cell is empty, then the field management factor was not applied to that plot.

The third way is by going to the trial detail page of a field trial in Breedbase. There is an interactive interface for naming field management factors and applying them to plot in the field trial. This way can be used at any time on any field trial in Breedbase.

A field management factor is stored as an entry in the project table with connection to the field trial project entry via the project_relationship table using the type name ‘trial_treatment_relationship’ from the ‘project_relationship’ controlled vocabulary. The field management factor is created with an nd_experiment entry linked via the nd_experiment_project table; the nd_experiment entry has the type name ‘treatment_experiment’ from the ‘experiment_type’ controlled vocabulary. The corresponding plot stock entries are linked to this nd_experiment entry via the nd_experiment_stock table.

0.9. Phenotyping Ontologies and Post-Composing

In order to begin collecting phenotypic values for plot entries, or optionally plant and tissue sample entries, it is required that a phenotyping ontology is created in Breedbase. A phenotyping ontology defines terms that have a unique name, a definition, a unique ontology name and ontology identifier, and a unique ontology-specific term identifier. In Breedbase, an example observation variable from an ontology has the form ‘Plant Height|CO:0000002’ where the words preceding the ‘|’ are the observation variable name, the letters after the ‘|’ constitute the unique ontology identifier, and the numbers after the ‘:’ is the unique ontology-specific identifier. This observation variable name format must be used to upload phenotypes.

Ontologies can be highly disputed in terms of exact naming and definitions, and because of this, Breedbase recommends using ontologies curated by Crop Ontology (http://www.cropontology.org/). These ontologies are often structured into hierarchical categories to group observation variables into semantic clusters, such as observation variables for morphological traits versus observation variables for disease related traits. A common representation of an ontology capturing the relationships between hierarchical terms is the obo format. The obo format defines the term names, definitions, term types, ontology identifiers, and relationships between terms. Of critical importance to Breedbase is that the term type relating an observation variable to its higher level term must be ‘VARIABLE_OF’; this is how actual observation variables are separated from hierarchical categorical terms in the ontology. The obo file can be loaded into Breedbase and the observation variable terms can then be used to phenotype plots, plants, and/or tissue samples in a field trial.

The loading of obo files can only be done through the backend by a curator; this is done in order to restrict the number of terms that are added to an ontology. The reason for this is to strictly avoid fragmentation of phenotypic data under duplicated observation variables. To ease interaction between breeders and data curators a trait submission portal https://submit.rtbbase.org/ has been developed including a queuing system using Github; this trait submission system connects directly to Crop Ontology, allowing synchronization of observation variables between Breedbase and the broader ontology community.

An observation variable is ideally defined by a trait indicating the attribute being measured, such as ‘plant height’, a method indicating the process of observation, such as ‘using a ruler’, and a scale indicating the units such as ‘meters’. The ontologies available on Crop Ontology may or may not follow this structural definition, but again, the critical aspect for Breedbase is that the obo ontology indicates observation variables using the ‘VARIABLE_OF’ relationship. The formulation of an observation variable in this fashion allows for reusable name spaces for traits, methods, and scales across ontologies and enables deep phenotypic querying in Breedbase.

As an alternative to loading of strictly loading obo files via the backend, a frontend interactive interface is available on Breedbase for adding observation variables on at a time into the system. This interface must be activated in the Breedbase configuration. The interface first tells the user to select the ontology to which the new observation variable belongs, such as ‘Cassava Observation Variables’; then they define a unique name, such as ‘Plant Height using a ruler in meters’ and a definition for the new observation variable. The next step is to select the trait for which the observation variable belongs, such as ‘Plant height’, from the available trait ontologies in Breedbase; if the trait is not found in the trait ontology selected, then it can be added by defining a unique name and definition. Next, the method, such as ‘Measuring Ruler’, for the observation variable is selected; again, if the method is not found in the method ontology selected, then it can be added by defining a unique name and definition. Finally, the scale, such as ‘Meters’, for the observation variable is selected; if the scale is not found in the scale ontology selected, then it can be added by defining a unique name and definition, and optionally, the minimum, maximum, default, possible enumerated terms, and scale type.

In Breedbase it is often the case that very specific observation variables are required, such that, if the ontology were to contain all observation variables of interest, the ontology would be too large and unwieldy. The primary example of this is the case of collecting metabolic data for tissue samples collected from plant entries in the field trial; the metabolites are measured for tissue samples collected under varying environmental conditions and varying collection times. In this case, the total number of combinations between the possible terms exceeded several million observation variables. Instead of pre-defining all possible terms and saving them into Breedbase, it is possible to post-compose observation variables from defined ontologies; the result in the metabolite example was that small ontologies for metabolites, methods, and scales could be created, and then post-composed into observation variables to record phenotypic data.

To assist in categorizing ontologies, Breedbase uses entries in cvprop to annotate cv entries using type names from the ‘composable_cvtypes’ controlled vocabulary; in this way we can assign an ontology to be a trait ontology indicating it is generally used for observation variables, while another ontology can be tagged as a time ontology indicating it only has time related terms such as ‘month 1’ or ‘month 2’. Regarding time and unit ontologies which are important for most post-composing use-cases, Breedbase recommends using the SGN time and unit ontologies (https://github.com/solgenomics/sgn/tree/master/ontology).

Phenotyping ontology terms are stored in the cvterm table with links to cvterm_relationship entries to represent their place in the ontology hierarchy. When an observation variable is used to annotate a phenotypic record, for example if a plot in a field trial was measured for an observation variable called plant height, then the cvterm entry of the observation variable is linked to the phenotype table entry, as will be discussed in the next section.

0.10. Phenotyping Data

Phenotypic records can be stored in Breedbase as measurements to existing plot, plant, or tissue sample entries in a field trial. Once stored, these phenotypic records power all phenotypic analyses on Breedbase. Prior to storing phenotypes into Breedbase, a researcher can collect phenotypes using a simple Excel or CSV spreadsheet, using the Android Feldbook App (http://www.wheatgenetics.org/research/software/22-android-field-book), or using a BrAPI enabled application. When using an Excel spreadsheet to collect phenotypes, the researcher should first used Breedbase to download a template for data collection; this template ensures that the observation variables being measured and the plot, plant, or tissue sample observation units already exist in the database. As described in the previous section, observation variables must exist in an ontology in Breedbase. And as described in the previous section, plot, plant, or tissue sample entries must exist as part of a field trial in Breedbase. When using the Fieldbook App, the researcher should use Breedbase to download templates to load into Fieldbook; the templates consist of the field trial layout being observed and the observation variables being measured. When using a BrAPI enabled application, such as HIDAP (https://research.cip.cgiar.org/gtdms/hidap/) the system allows users to directly add phenotypic records into Breedbase, while simultaneously using BrAPI to retrieve field trial and observation units information from Breedbase.

Each phenotypic record is stored as an entry in the phenotype table with its own nd_experiment entry. The nd_experiment entry has the type name ‘phenotyping_experiment’ from the ‘experiment_type’ controlled vocabulary; this entry is linked to the stock entry of the plot, plant, or tissue sample that was observed and to the project entry for the field trial in which the phenotype was measured, using the nd_experiment_stock and nd_experiment_project linking tables, respectively.

For performance reasons, a PostgreSQL materialized view is used when querying for phenotypes in Breedbase; the materialized view is updated whenever new field trials and phenotypic records are added to Breedbase. The materialized view has all observation units, whether they are plots, plants, or tissue samples, listed in rows with columns containing all field trial related information and a column named ‘observations’ containing all phenotypic records for the observation unit; the ‘observations’ column is a PostgreSQL JSONb field structured as an array of objects, where each object contains, the observation variable name, the phenotypic value, the timestamp when the phenotype was collected, the timestamp when the phenotype was stored in Breedbase, and the person responsible. PostgreSQL can directly query keys from arrays of objects in JSON or JSONb formatted fields, allowing fast querying of millions of phenotypic records. Almost all phenotyping related queries utilize the materialized view for very fast data retrieval.

0.11. Genotyping Plates

Genotyping plates in Breedbase represent 96 or 384 well plates. A genotyping plate is defined with a unique plate name, a genotyping facility, a genotyping project name, a plate format, a sample type, a breeding program, a location, a year, and a description. The genotyping facility can be Cornell IGD, Intertek, or BGI; while the plate format can be 96 or 384 wells and the sample type can be DNA, RNA, or Tissue. Each well in the plate has its own unique name, a row and column position, and a source observation unit name; the source observation unit name is the plot, plant, or tissue sample name of the field trial entity from which the sample in the well originated. Optionally, each well can have its own ncbi taxonomy id, operator name, notes, tissue type, extraction method, concentration, volume, and whether the well is blank.

There are three methods for adding genotyping plates into Breedbase. The first method is by using an Excel spreadsheet; each row in the spreadsheet represents a well and defines its unique name, row and column position, and the source observation unit name. Optionally, the properties listed above can be included. The second method is using the Android Coordinate Application (http://www.wheatgenetics.org/research/software/89-coordinate); using this app, a researcher can scan the barcode of the source observation unit name and indicate the position of the well on the plate. The Coordinate App allows for interactively defining the layout of the genotyping plate and automatically assigns a unique name for each well; ts output can be directly uploaded into Breedbase. The third method is using Breedbase to design the layout of the plate. In this method, the researcher provides a list of source observation unit names, whether they are plots, plants, or tissue samples in a field trial. Then they can optionally define if any well must be blank, and they can define all other optional well properties mentioned above; Breedbase will then design and save a plate layout for the new genotyping plate, which can be downloaded and assembled later by the researcher.

Once the genotyping plate is saved in Breedbase, it can be downloaded on demand in two genotyping vendor specific formats: Intertek and Dart. These vendors are not working on becoming BrAPI compliant at the time of writing; however, genotyping plate layout downloaded from Breedbase can be e-mailed or ground-mailed to these genotyping facilities along with the physical genotyping plate. An alternative, is the stored genotyping plate can be transferred to the genotyping facility using Breedbase via BrAPI. This requires that the genotyping facility has a BrAPI enabled information system that can accept genotyping plates for submission through the BrAPI specification. Cornell IGD has been actively implementing this connection to Breedbase, allowing a streamlined and seamless interface for sending genotyping plates for processing.

When Breedbase exports a genotyping plate layout, either through a file download or via automated protocols, the sample names are exported using the following convention: ‘well sample name|||germplasm name’. This is the identifier for which the genotyping vendor associates the generated genotyping data to, and for which the genotyping data is returned to the researcher. The reason for this convention, rather than using only the well sample name, is that often researchers like to perform analyses on the genotyping data prior to uploading it back into Breedbase; when researchers perform analyses on the genotyping data directly, they prefer to use germplasm names because they are easily identifiable. However, when the genotyping data is loaded into Breedbase, the genotyping data is associated to the well sample name; in this way Breedbase maintains maximum data integrity. Exporting the sample names using the described convention allows both use cases to occur on the genotyping data when it is returned from the genotyping vendor.

The genotyping plate is stored as an entry in the project table with EAV entries in the projectprop table for year, location, genotyping facility, plate format, sample type, and an indicator for whether the genotyping plate was submitted to the genotyping facility; the projectprop entries are saved using type names from the ‘project_property’ controlled vocabulary. The genotyping plate’s project entry is linked to the breeding program via the project_relationship table using the type name ‘breeding_program_trial_relationship’ from the ‘project_relationship’ controlled vocabulary. The genotyping plate’s project entry is linked to a single nd_experiment entry via the nd_experiment_project linking table; the type name of the nd_experiment entry is ‘genotyping_layout’ from the ‘experiment_type’ controlled vocabulary. The genotyping project name is stored in an entry of the nd_experimentprop table using the type name ‘genotyping_project_name’ from the ‘nd_experiment_property’ controlled vocabulary. The wells in the genotyping plate are stored as entries of the stock table using the type name ‘tissue_sample’ from the ‘stock_type’ controlled vocabulary; in an EAV model, entries in the stockprop table record the well row e.g. ‘A’, the well column e.g. ‘12’, the well position e.g. ‘A12’, concentration, volume, extraction method, the extraction person, the tissue type, notes, the acquisition date, and whether the well is blank, all under type names from the ‘stock_property’ controlled vocabulary. Each well stock entry is linked to the same nd_experiment entry described previously via the nd_experiment_stock table. The well stock entries are linked to their source observation stock entry via the stock_relationship table using the type name ‘tissue_sample_of’ from the ‘stock_relationship’ controlled vocabulary; the tissue sample stock entry is always linked to its associated germplasm stock entry, and is further linked to its associated source field trial plot, plant, and tissue sample entry using additional stock_relationship entries.

For performance reasons, in a manner similar to what was described for field trials above, the genotyping plate is cached as a JSON formatted string via an entry in the projectprop table. The JSON formatted string is an object of objects, where the top-level key in the object is the well position in the genotyping plate, e.g. ‘A12’, and the value object contains all layout information for that specific well, e.g. all the aforementioned stock_prop and stock_relationship elements. The cached layout is regenerated whenever an element of the genotyping plate is changed and allows fast display and download of the genotyping plate layout.

0.12. Genotyping Data Projects and Protocols

Breedbase can store high density genotyping data; to begin describing this it is first important to start with the genotyping data project. For clarity, the genotyping plate described above is used for managing the 96 or 384 well plates which are then sent to the genotyping vendor, and once the genotyping data is generated and returned, the genotyping data can be stored in Breedbase. The genotyping data project is an entry stored in the project table and serves the purpose of grouping genotyping data into an easily queried structure. The genotyping data project is defined simply with a unique name, a description, a year, a genotyping facility, a location, and a breeding program; the year, genotyping facility, and location are stored as entries in the projectprop table in an EAV model using type name from the ‘project_property’ controlled vocabulary. The genotyping data project is linked to its breeding program via the project_relationship table using the type name ‘breeding_program_trial_relationship’ from the ‘project_relationship’ controlled vocabulary.

A genotyping protocol must be defined before the genotyping data can be stored in Breedbase. To define a genotyping protocol the researcher provides a unique name, a description, the reference genome name, the species of the samples, and a location where the data was generated. The genotyping protocol encompasses information about all of the markers which were genotyped on a set of samples, including their name, base pair position, chromosome number, reference allele, alternate alleles, quality score, filter information, additional information, and their scoring format; these information fields are identical to those found in the VCF specification, on which the Breedbase genotyping storage is modeled. The genotyping protocol is filled directly from the uploaded VCF or Intertek genotyping data results file; thereby, the uploader only needs to provide the unique protocol name, description, reference genome name, species of the sample, a location, and the VCF or Intertek formatted genotyping data file.

The genotyping protocol is stored in the nd_protocol table with a single entry in the nd_protocolprop table via an EAV model. The entry in the nd_protocolprop table uses the type name ‘vcf_map_details’ from the ‘protocol_property’ controlled vocabulary; this entry has a JSON encoded string containing all information about the genotyping protocol, including all information about the markers in the genotyping protocol. In the next section, this JSON encoded string is described in detail.

Once the high density genotyping data is returned from the genotyping vendor in either a VCF format or the custom Intertek format, it can be loaded into Breedbase. The researcher needs only specify a genotyping data project and a genotyping protocol name for which to store the genotyping data results to in Breedbase, as described above. How the high density genotyping data is then stored in the database is described in the following section.

0.13. Genotyping Data

To store high density genotyping data, Breedbase relies on the JSON features of PostgreSQL. The genotype data is uploaded into Breedbase using the VCF format or a custom Intertek format, and is stored under a genotyping data project and a genotyping protocol as described above. The preferred method for handling genotyping data in Breedbase is: 1) to first store the genotyping plate in Breedbase; 2) send the genotyping plate layout along with the physical genotyping plate to the genotyping vendor; 3) upload the returned genotyping data in Breedbase. Because of historic methods where the genotyping plate was not loaded into Breedbase prior to genotyping, Breedbase allows genotyping data to still be uploaded and associated simply to germplasm names instead of genotyping plate wells. In other words, the genotyping data is linked directly to the stock table entry of the associated genotyping plate well sample or a simple germplasm name; this depends on the sample names in the returned genotyping data file, whether the returned data file uses identifiers for individual wells in the genotyping plate as discussed above in the genotyping plate section or if the returned data file uses simple germplasm names.

Each of the stock table entries that were genotyped, whether they are genotyping plate well samples or germplasm names, is linked to an entry in the genotype table via the nd_experiment_stock and nd_expeiment_genotype tables, ultimately linked by an entry in the nd_experiment table using the type name ‘genotyping_experiment’ from the ‘experiment_type’ controlled vocabulary. Each stock table entry is linked to its own nd_experiment entry, in a manner similar to how the phenotyping data is saved, as described above; the nd_experiment entry is also linked to relevant nd_protocol and project tables via the nd_experiment_protocol and nd_experiment_project tables, respectively, linking the relevant genotyping protocol and genotyping data project. The entry in the genotype table is linked to an entry in the genotypeprop table via an EAV model; the entry in genotypeprop is a JSON formatted string stored using the type name ‘vcf_snp_genotyping’ from the ‘genotype_property’ controlled vocabulary.

The two key JSON formatted objects for storing high-density genotyping data in Breedbase are an entry for the entire genotyping protocol in the nd_protocolprop table and an entry in the genotypeprop table for each of the genotyped samples. The entry in the nd_protocolprop table is a complex object with the following top-level keys: ‘reference_genome_name’, ‘species_name’, ‘header_information_lines’, ‘sample_observation_unit_type_name’, ‘marker_names’, ‘markers’, ‘markers_array’. The ‘reference_genome_name’ key stores a string value of the user defined reference genome e.g. ‘Manihotv6.1’. The ‘species_name’ stores a string value of the organism species name that the samples belong to; this species name must be in the database in the organism table prior and is only for convenience. The ‘header_information_lines’ key stores an array value of the commented header information lines from the uploaded VCF or Intertek formatted file. The ‘sample_observation_unit_type_name’ stores a string value that is either ‘accession’ or ‘tissue_sample’ and is used only to distinguish whether the genotyping protocol was used to sample germplasm names or genotyping plate tissue sample wells, as was discussed above. The ‘marker_names’ key stores an array value of all the marker names involved in the genotyping protocol. The ‘markers’ key stores a value that is an object of objects; the top-level key is the marker name and the corresponding object contains key value pairs for the chromosome, base pair position, comma separated alternate alleles, reference allele, quality, filter information, marker summary information, and marker score format information. These marker information fields are taken directly from the VCF format model. The ‘markers_array’ key stores the same information as the ‘markers’ array, but in a format suited for certain JSON formatted queries in PostgreSQL; the value is an array of objects where each object contains information about a single marker e.g. the chromosome, position, and other fields mentioned previously.

For each genotyped sample there is an entry in the genotypeprop table and the value is a JSON formatted object of objects. The top-level key is the marker name and the corresponding object value contains all genotype score information for that marker and sample; the genotype score information is stored as simple key value pairs in the corresponding object where keys come directly from the format field in the VCF specification e.g. GT, DP, GQ, etc. During genotype upload of a VCF into Breedbase, the genotypeprop table value entry is constructed by encoding all genotype score information for a sample into the JSON object of objects. During genotype upload of an Intertek custom genotyping file, only the GT key is populated for all markers in the genotypeprop value JSON object of objects.

0.14. BrAPPs/BrAPI Interfacing

The Breeding Application Programming Interface (BrAPI) defines standard data objects and methods for exchanging data. This specification lends itself for a wide range of use cases including the transfer of phenotypic information from mobile data collection applications into Breedbase and the transfer of genotypic data from Breedbase into visualization software such as Flapjack. BrAPI enables a streamlined experience for a researcher by cutting out any manual data transfer steps and increases efficiency and integrity of analyses.

The aforementioned Android applications of Fieldbook, Intercross, and Inventory should all interface directly with Breedbase via BrAPI, allowing for example, a researcher to query field layout and observation variables from Breedbase into Fieldbook, collect phenotypic measurements using FIeldbook, and then upload the phenotypic measurements into Breedbase.

Desktop applications for analysis such as Flapjack or Helium should be able to query genotypic and pedigree information from Breedbase via BrAPI. Web-applications are the most obvious application for interfacing with Breedbase via BrAPI; for example, Breedbase could interface with other database systems to compare and enrich the data within Breedbase. To accelerate adoption of BrAPI, several BrAPI applications (BrAPPs) have been written; these BrAPPs are standalone JavaScript applications that can be embedded in any web service and they provide a user interface for accessing, analyzing, and displaying information from BrAPI enabled databases such as Breedbase.