User Manual for
T A S S E L
T A S S E L
-Trait Analysis by aSSociation, Evolution and Linkage
Version 3
The Buckler Lab at Cornell University
(December 22, 2011)
www.maizegenetics.net/tassel
Disclaimer: While the Buckler Lab at Cornell University has performed extensive testing and results are,
in general, reliable, correct or appropriate results are not guaranteed for any specific set of data. It is
strongly recommended that users validate TASSEL results with other software.
Further help: Additional help is available beyond this document. Users are welcome to report bugs,
request new features through the TASSEL website. Questions are also welcome to our current team
members. For more quick and precise answers, please address your questions to the most pertinent
person:
Tassel User Group
(recommended)
General Information
http://groups.google.com/group/tassel
tassel@googlegroups.com
Ed Buckler (Project leader)
Data import, GDPC, Pipeline
Statistical analysis
esb33@cornell.edu
Terry Casstevens
tmc46@cornell.edu
Peter Bradbury
pjb39@cornell.edu
Zhiwu Zhang
zz19@cornell.edu
Contributors: Yogesh Ramdoss, Michael E. Oak, and Karin J. Holmberg, N. Stevens, and Yang Zhang.
ii
The TASSEL project is supported by the National Science Foundation and the USDA-ARS.
Main Web Site: http://www.maizegenetics.net/tassel
Open source code: http://sourceforge.net/projects/tassel
Modified version of the PAL library is used: http://www.cebl.auckland.ac.nz/pal-project
Database access is achieved by GDPC middleware http://www.maizegenetics.net/gdpc
Table of Contents
INSTALLATION
INTRODUCTION
1.1.1 WEB START
1.1.2 STAND-ALONE
1.1.3 OPEN SOURCE CODE
1
GETTING STARTED
1.1
1.2
PANELS
2
DATA MODE
2.1
GDPC
2.2
LOAD
2.2.1 BLOB
2.2.2 HAPMAP
2.2.3 PLINK
2.2.4 FLAPJACK
2.2.5 POLYMORPHISM
2.2.6 PHYLIP
2.2.7 NUMERICAL DATA
2.2.8 SQUARE NUMERICAL MATRIX
2.2.9 GENETIC MAP
2.3
EXPORT
2.4
SITES
2.5
SITE NAMES
2.6
TAXA
2.7
TRAITS
2.8
2.9
TRANSFORM
2.10
SYNONYMIZE TAXA NAMES
2.11
UNION JOIN
2.12
3
ANALYSIS MODE
INTERSECTION JOIN
IMPUTE SNPS
2.9.1 GENOTYPE NUMERICALIZATION
2.9.2 TRANSFORM AND/OR STANDARDIZE DATA
2.9.3 IMPUTE PHENOTYPE
2.9.4 PCA
iii
6
7
7
7
8
8
8
10
10
11
12
12
12
13
13
14
14
16
16
16
17
18
19
19
20
20
20
21
22
23
23
25
26
27
3.1
DIVERSITY
3.2
LINKAGE DISEQUILIBRIUM
3.3
CLADOGRAM
3.4
SNP EXTRACT
3.5
KINSHIP
3.6
GENERAL LINEAR MODEL
3.7
MIXED LINEAR MODEL
3.8
RIDGE REGRESSION
4
RESULT MODE
4.1
TABLE
4.2
TREE PLOT
4.3
2D PLOT
4.4
LD PLOT
4.5
CHART
5
MENUS
5.1
FILE MENU
5.2
CONTINGENCY TEST
5.3
PREFERENCES
6
TUTORIAL
6.1
MISSING PHENOTYPE IMPUTATION
6.2
PRINCIPAL COMPONENT ANALYSIS
6.3
ESTIMATION OF KINSHIP USING GENETIC MARKERS
6.4
ASSOCIATION ANALYSIS USING GLM
6.5
ASSOCIATION ANALYSIS USING MLM
6.6
7
APPENDIX
6.6.1 CONNECTING WITH A DATABASE
6.6.2 DATA QUERY
6.6.3 IMPORTING GDPC DATA INTO TASSEL
6.6.4 SAVING GDPC QUERY RESULTS
5.1.1 SAVE DATA TREE
5.1.2 OPEN DATA TREE
5.1.3 SAVE DATA TREE AS…
5.1.4 OPEN DATA TREE…
5.1.5 SAVE SELECTED AS…
IMPORTING DATA FROM A DATABASE (VIA GDPC)
iv
27
28
29
29
30
30
32
34
36
36
36
37
38
39
41
41
41
41
41
41
41
42
42
43
43
45
49
50
54
57
57
58
61
63
64
JAVA.LANG.OUTOFMEMORYERROR” APPEARS?
WHAT SHOULD I DO?
7.1
NUCLEOTIDE CODES (DERIVED FROM IUPAC)
7.2
TASSEL TUTORIAL DATA SETS
7.3
BIOGRAPHY OF TASSEL
7.4
FREQUENTLY ASKED QUESTIONS
1.
WHAT DO I DO IF TASSEL MISBEHAVES?
2.
WHERE DO I TURN FOR MORE INFORMATION?
3.
HOW DO I JOIN THE FUN: TASSEL ON SOURCEFORGE?
4.
HOW DO I CHANGE THE AMOUNT OF MEMORY USED? WHAT DO I DO WHEN THE “EXCEPTION
5.
WHEN I CLICK ON THE MOST CURRENT VERSION OF TASSEL WEB START, A PREVIOUS VERSION APPEARS.
6.
WHAT SHOULD I SUBSTITUTE FOR MISSING VALUES IN TASSEL?
7.
8.
HOW CAN I CREATE A TASSEL ICON ON DESKTOP?
9.
WHY DO I GET EMPTY SQUARES IN MLM ASSOCIATION ANALYSIS?
10.
WHY SHOULD I EXCLUDE ONE COLUMN OF THE POPULATION STRUCTURE?
11.
CAN KINSHIP REPLACE POPULATION STRUCTURE?
12.
WHY DO TASSEL AND SPAGEDI GIVE DIFFERENT KINSHIP ESTIMATES?
13.
CAN I GET MARKER R SQUARE USING SAS PROC MIXED OR TASSEL MLM?
14.
DOES MLM FIND MORE ASSOCIATIONS THAN GLM?
15.
DO I NEED MULTIPLE TEST CORRECTION FOR THE P VALUE FROM TASSEL?
16.
CAN TASSEL HANDLE DIPLOID GENOTYPE DATA?
17.
HOW TO CITE TASSEL?
IS IT POSSIBLE TO CHANGE DATA NAMES IN THE DATA TREE?
REFERENCES
INDEX
64
65
66
68
68
68
68
68
69
69
69
69
69
69
69
70
70
70
70
70
70
71
73
v
INTRODUCTION
While TASSEL has changed considerably since its initial public release in 2001, its primary function
continues to be providing tools to investigate the relationship between phenotypes and genotypes1. As
indicated by its title – Trait Analysis by aSSociation, Evolution and Linkage – TASSEL has multiple
functions, including association study, evaluating evolutionary relationships, analysis of linkage
disequilibrium, principal component analysis, cluster analysis, missing data imputation and data
visualization.
One of the design elements driving TASSEL development has been the need to analyze ever larger sets of
data2. For example, the MLM (mixed linear model) function for association analysis originally used an
EM (expectation-maximization) algorithm, which is a common method for solving mixed models but is
relatively slow. Subsequently developers implemented the EMMA algorithm to increase computing
speed3. Model compression was added to that to improve speed and statistical power for association
study4. Another technique that optimizes variance components once and then uses the estimates to test
markers now provides the ability to screen the large numbers of markers used in genome-wide association
studies (GWAS). The method was independently described by Zhang et al. and Kang et al. in 2010. This
method was named P3D by Zhang et al.4 and EMMAX by Kang et al.5
TASSEL was designed for a wide range of users, including those not expert in statistics or computer
science. A GWAS using the mixed linear model method to incorporate information about population
structure6-8 and cryptic relationships9 can be performed by in a few steps by “clicking” on the proper
choices using a graphic interface. All the processes necessary for the analysis are performed
automatically, including importing phenotypic and genotype data, imputing missing data (phenotype or
genotype), filtering markers on minor allele frequency, generating principal components and a kinship
matrix to represent population structure and cryptic relationships, optimizing compression level and
performing GWAS.
The command-line version of TASSEL, called the Pipeline, provides users the ability to program tasks
using a script instead of the graphic user interface (GUI). This feature allows researchers to define tasks
using a few lines of code and provides the ability to use TASSEL as part of an analysis pipeline or to
perform simulation studies.
Due to the increasing availability of open data sources, TASSEL utilizes a data browser from the
Genomic Diversity and Phenotype Connection (GDPC) project10 to provide an interface to relational
databases. As a result, TASSEL users can access any data source that provides a GDPC service. Using
this middleware, which provides a common graphical interface, TASSEL users can avoid writing SQL
queries to access data. Currently, GDPC provides connections to Panzea, Gramene, Germinate, and GRIN
(USDA’s Germplasm Resources Information Network).
TASSEL is written in Java, thereby enabling its use with virtually any operating system. It can be
installed using Java Web Start technology by simply clicking on a link at www.maizegenetics.net/tassel.
A stand-alone version of TASSEL can also be downloaded to use in pipeline mode or in any situation
where the user wishes to start the software from a command line.
6
1 Getting Started
A quick way to get started using TASSEL is to load the tutorial data and try performing analyses.
However, because some of the necessary steps may not be intuitive, we recommend that new users follow
the tutorial at end of this manual. The objective of this section is to provide information necessary to
install and start TASSEL software and to provide a brief overview of the interface.
Most functions are organized into three modes (Data, Analysis and Results) which correspond to the first
three buttons on the TASSEL interface as shown below. Clicking one of these buttons changes the
functions represented by the second row of buttons. Those three modes are described in detail in the
subsequent sections of this manual. The screen shot shows TASSEL after the tutorial files have been
loaded.
1.1 Installation
The graphic version of TASSEL can be installed in one of the three ways: using Java Web Start, as a
stand-alone application, or using the source code
1.1.1 Web start
TASSEL can be installed using Java Web Start technology, which automatically checks for the most
recent version of TASSEL each time the application is executed. In addition, Java Web Start will ensure
that the correct version of the Java Runtime Environment is running, thus avoiding complicated
7
installation and upgrade procedures. Users should use Web Start unless they have a specific reason to use
one of the other installation methods.
To begin, Java Web Start (JWS) must be installed (prior to the installation of TASSEL). JWS is included
as part of Java Runtime Environment (JRE) 5.0 and above. PC’s and Mac’s will most likely have JWS
already installed. If you need to install Java, the most recent version is available at http://www.java.com.
The easiest way to tell if it is installed on your computer is to try running TASSEL from the following
link:
http://www.maizegenetics.net/tassel
If you will be using TASSEL frequently and would prefer to launch the application from your desktop
rather than by revisiting the website, Java Web Start can be used to manually launch TASSEL each time
and/or to create a shortcut. Access the Java Application Cache Viewer by going to Start > Settings >
Control Panel > Java. From the General tab, click on Settings in the Temporary Internet Files section
and then click on View Applications… and the Java Application Cache Viewer will appear. (Another
way of achieving this is by going to Start > Run and typing in javaws). The TASSEL icon should now
be visible and can be used to launch the application. Shortcuts can be created from the menu of the Java
Application Cache Viewer: Application > Install Shortcuts.
1.1.2 Stand-alone
Downloading a “stand-alone” version is recommended for anyone who has a slow Internet connection.
While Java Web Start is a very good way of deploying software, it does not ask the user before attempting
to download updates. Thus, a slow Internet connection may start a download process that requires an
unreasonable amount of time to complete. If you are not interested in disabling your network connection
each time before starting TASSEL, we recommend downloading the stand-alone version which does not
attempt to update the program. However, given that TASSEL is a Java application, a Java Runtime
Environment (version 1.6.0 or greater) is still required. To get the stand-alone version, download
tassel3.0_standalone.zip from the TASSEL web site. To run the stand-alone version, double-click on the
JAR file (sTASSEL.jar). Alternatively, from a command prompt (in Windows go to Start > Run and type
in “cmd” or “command”), change into the tassel3.0_standalone directory and execute this command:
start_tassel.bat (For Windows)
start_tassel.pl (For UNIX)
1.1.3 Open source code
Open source code for the TASSEL software package is available at: http://sourceforge.net/projects/tassel.
The package uses a number of other libraries that are included in the TASSEL distribution. These include
a modified version of the PAL library (http://www.cebl.auckland.ac.nz/pal-project/), the COLT library
(http://dsd.lbl.gov/~hoschek/colt/), and jFreeChart (http://www.jfree.org/jfreechart/). GDPC middleware
(http://www.maizegenetics.net/gdpc) provides database access.
1.2 Panels
TASSEL is organized into five main panels. (1) The Control Panel at the top contains menus and buttons
to control functions. (2) The Data Tree Panel is located beneath the Control Panel on the left side. This
panel organizes data sets and results. Data set(s) displayed in the Data Tree Panel must first be selected
before a desired function or analysis can be performed. To select multiple data sets, press the CTRL key
while selecting the data sets. (3) The Report Panel is located below the Data Tree Panel. It displays
8