NeuroLINCS Project
NeuroLINCS Project
The NeuroLINCS Project is part of the NIH Common Fund’s Library of Integrated Network-based Cellular Signatures (LINCS) program, which aims to characterize how a variety of human cells, tissues and entire organism respond to perturbations by drugs and other molecular factors. As Part of the LINCS program, the NeuroLINCS study concentrates on human brain cells, which are far less understood than other cells in the body. Our initial focus is to produce diseased motor neurons from patients by utilizing high-quality induced pluripotent stem cell (iPSC) lines from Amyotrophic Lateral Sclerosis (ALS) and Spinal Muscular Atrophy (SMA) patients in addition to unaffected normal healthy controls. Using state-of-the-art OMICS methods (genomics, epigenomics, transcriptomics, and proteomics), we intend to create a wealth of cellular data that is patient-specific in the context of their baseline genetic perturbations and in the presence of other genetic and environmental perturbagens (e.g. endoplasmic reticulum stress). The primary data will be used to build cell signatures that convey the key features that distinguish the state of a cell and determine its behavior. Ultimately, the analysis of these datasets will lead to the identification of a network of unique signatures relevant to each of these motor neuron diseases.
Getting You Started
Here are some information and videos that will help you get started using Galaxy and the NeuroLINCS pipelines. You can use these workflows/pipelines with your own data or rerun the NeuroLINCS Data found on dbGap and LINCSproject. Links to data provided below.
Galaxy Resources
Introduction To Galaxy | Learn what Galaxy is and how you can use it |
Learning Resources | A collection of videos to help you learn how |
Get Data: Upload File | How to upload your Data |
Upload Data From SRA | How to upload data from SRA |
NeuroLINCS Data Links
NeuroLINCS Home Page | NeuroLINCS website contains information on the project, including technologies, data, and tools developed and used by the team |
NeuroLINCS Data Summary | Data page on NeuroLINCS website showing summary of experiments and links to data |
NeuroLINCS Raw Data | The NCBI data base of Genotypes and Phenotypes study that hosts the NeuroLINCS raw data files for ATAC-Seq, RNA-Seq and whole genome sequences |
NeuroLINCS Raw Protein Data | Chorus Project site that hosts the raw data files for SWATH proteomic assay. Note, you need to sign into Chorus in order to access the files |
NeuroLINCS Processed Data | LINCSproject.org datasets for NeuroLINCS |
- Experiment 1: ATAC-Seq, RNA-Seq and proteomics were carried out on samples obtained from induced Pluripotnent Stem Cells (iPSC) cell lines. These lines were derived from ALS, SMA and Control (unaffected) individuals (three of each).
- ATAC-Seq
- RNA-Seq
- Proteomics - NOTE: We do not have a proteomic workflow in Galaxy but you can still access the data for analysis using your tools.
- Experiment 2: ATAC-Seq, RNA-Seq and proteomics were carried out on samples obtained from motor neuron lines generated from subject induced pluripotent stem cell (iPSC) lines.
- ATAC-Seq
- RNA-Seq
- Proteomics - NOTE: We do not have a proteomic workflow in Galaxy but you can still access the data for analysis using your tools.
Analyzing data using the pipelines/workflow
RNA-Seq Workflow
- Use RNA-Seq Step 1 ‘Secondary Analysis’ workflow below to generate the count matrix (level 3 data) for all samples using raw fastq files.
- If technical or growth replicates are present, use the Rcode to generate the differentially expressed genes (level 4 data). If not, use the RNA-Seq Step 2 ‘Statistical Analysis of Gene Expression’ workflow below to generate the differentially expressed gene list.
ATAC-Seq Workflow
The ATAC pipeline on galaxy will generate BAM files from bowtie2 alignment and narrowPeak files from MACS2 peak calling.
- Download the BAM and narrowPeak files to your computer or server. BAM files are very large so if possible, use an FTP client to transfer them to their destination.
- Create a sample sheet for use with DiffBind using this example as a template. For each sample, provide the path to the BAM file in the “bamReads” column of the sample sheet and the path to the narrowPeak file in the “Peaks” column.
- Fill out the “PeakCaller” section with “macs” for all samples.
- The “SampleID”, “Tissue”, “Factor”, “Condition”, “Treatment”, and “Replicate” column should be filled out with appropriate values for each sample although only the “SampleID” column is required; the other columns provide information to make convenient comparisons in DiffBind.
- Once the sample sheet has been filled out, you can start using the DiffBind ATAC R vignette here to analyze your data.
Workflows
The workflows described below are used for primary analysis of NeuroLINCS cell line data.
Web-based Pipeline For Differential Gene Expression Analysis (RNA-Seq)
NeuroLINCs Transcriptomics Center, UC Irvine
The workflows describes a standard analysis of bulk RNA-seq analysis. For a schematic of the pipeline, click here.
Step 1. Secondary Analysis
Galaxy Workflow '(1 -> 2) - RNAseq_batch'
Step 1: Input dataset collection
inputselect at runtime
Step 2: FastQC
Short read data from your current historyOutput dataset 'output' from step 1
Contaminant list
select at runtime
Submodule and Limit specifing file
select at runtime
Step 3: Trimmomatic
Single-end or paired-end reads?Paired-end (as collection)
Select FASTQ dataset collection with R1/R2 pair
Output dataset 'output' from step 1
Perform initial ILLUMINACLIP step?
False
Trimmomatic Operations
Trimmomatic Operation 1
Select Trimmomatic operation to perform
Cut bases off the end of a read, if below a threshold quality (TRAILING)
Minimum quality required to keep a base
10
Trimmomatic Operation 2
Select Trimmomatic operation to perform
Drop reads below a specified length (MINLEN)
Minimum length of reads to be kept
20
Step 4: HISAT2
Source for the reference genomeUse a built-in genome
Select a reference genome
hg38
Single-end or paired-end reads?
Paired-end Collection
Paired Collection
Output dataset 'fastq_out_paired' from step 3
Specify strand information
Unstranded
Paired-end options
Use default values
Summary Options:
Output alignment summary in a more machine-friendly style.
False
Print alignment summary to a file.
False
Advanced Options:
Input options
Use default values
Alignment options
Use default values
Scoring options
Use default values
Spliced alignment options
Use default values
Reporting options
Use default values
Output options
Use default values
Other options
Use default values
Step 2. Statistical Analysis of gene expression
This step uses DESeq2 standard workflow to test differential expression across two groups, e.g. control vs. ALS.
Galaxy Workflow '(2 -> 3) - RNAseq_batch'
Step 1: Input dataset collection
inputselect at runtime
Step 2: Input dataset
GTF/GFF fileselect at runtime
Step 3: featureCounts
Alignment fileOutput dataset 'output' from step 1
Gene annotation file
in your history
Gene annotation file
Output dataset 'output' from step 2
Output format
Gene-ID "\t" read-count (DESeq2 IUC wrapper compatible)
Create gene-length file
False
Options for paired-end reads:
Count fragments instead of reads
Disabled; all reads/mates will be counted individually
Only allow fragments with both reads aligned
False
Exclude chimeric fragments
True
Advanced options:
GFF feature type filter
exon
GFF gene identifier
gene_name
On feature level
False
Allow read to contribute to multiple features
False
Strand specificity of the protocol
Unstranded
Count multi-mapping reads/fragments
Disabled; multi-mapping reads are excluded (default)
Minimum mapping quality per read
12
Exon-exon junctions
False
Long reads
False
Count reads by read group
False
Largest overlap
False
Minimum bases of overlap
1
Minimum fraction (of read) overlapping a feature
0
Minimum fraction (of feature) overlapping a read
0
Read 5' extension
0
Read 3' extension
0
Reduce read to single position
Leave the read as it is
Only count primary alignments
False
Ignore reads marked as duplicate
False
Ignore unspliced alignments
False
For more information regarding DESeq2, please visit this page.
Web-based Pipeline For Assay for Transposase-Accessible Chromatin followed by sequencing (ATAC-Seq)
NeuroLINCs Epigenomics Center, MIT
Assay overview
The ATAC-seq experiment provides genome-wide profiles of chromatin accessibility. Briefly, the ATAC-seq method works as follows: loaded transposase inserts sequencing primers into open chromatin sites across the genome, and reads are then sequenced. The ends of the reads mark open chromatin sites. The ATAC-seq pipeline is used for statistical signal processing of short-read sequencing data and quality control, producing alignments and measures of enrichment. In its current form, it is a prototype and will likely undergo substantial change within the next year.
ATAC Pipeline
Our ATAC pipeline takes in BAM files containing aligned reads and outputs peaks and peak annotations.
The first step in the pipeline is to remove all reads mapped to mitochondrial DNA from the BAM file. Since we observe 30-60% mitochondrial contamination for NeuroLINCS samples, removing mitochondrial reads will remove considerable noise from downstream analysis. Afterward, peak calling is performed on BAM using MACS2 with the following parameters: –format BAM –gsize hs –qvalue .05. We have prepared a background bam file for MACS2 peak calling by extracting naked genomic DNA from iMNS, performing ATAC on the genomic DNA, and sequencing the resulting library. Peak annotation is performed using the script “map_peaks_to_known_genes.py” from the ChipSeqUtil package; we map genes to peaks within a window of +/- 10kb. Bigwig files for the reads and BigBed files for the peaks are generated for visualization of data on a genome browser.
Galaxy Workflow | imported: fraenkel_ATAC_batch_experimental_paired (for in house usage)
Step 1: Input dataset collection
inputselect at runtime
Step 2: Input dataset
encode blacklist regionsselect at runtime
Step 3: Trimmomatic
Single-end or paired-end reads?Paired-end (as collection)
Select FASTQ dataset collection with R1/R2 pair
Output dataset 'output' from step 1
Perform initial ILLUMINACLIP step?
False
Trimmomatic Operations
Trimmomatic Operation 1 Select Trimmomatic operation to perform
Cut bases off the start of a read, if below a threshold quality (LEADING)
Minimum quality required to keep a base
15
Trimmomatic Operation 2
Select Trimmomatic operation to perform
Cut bases off the end of a read, if below a threshold quality (TRAILING)
Minimum quality required to keep a base
15
Step 4: FastQC
Short read data from your current historyOutput dataset 'fastq_out_paired' from step 3
Contaminant list
select at runtime
Submodule and Limit specifing file
select at runtime
Step 5: Bowtie2
Is this single or paired libraryPaired-end Dataset Collection
FASTQ Paired Dataset
Output dataset 'fastq_out_paired' from step 3
Write unaligned reads (in fastq format) to separate file(s)
False
Write aligned reads (in fastq format) to separate file(s)
False
Do you want to set paired-end options?
No
Will you select a reference genome from your history or use a built-in index?
Use a built-in genome index
Select reference genome
hg19
Set read groups information?
Do not set
Select analysis mode
1: Default setting only
Do you want to use presets?
No, just use defaults
Save the bowtie2 mapping statistics to the history
True
Step 6: BAM filter
Select BAM datasetOutput dataset 'output' from step 5
Remove reads that are smaller than
Not available.
Remove reads that are larger than
Not available.
Keep only mapped reads
True
Keep only unmapped reads
False
Keep only properly paired reads
True
Discard properly paired reads
False
Remove reads that match the mask
Empty.
Remove reads that have the same sequence
-1
Remove reads that start at the same position
False
Remove reads with that many mismatches
Not available.
Remove secondary alignment reads
True
Remove reads that do not pass the quality control
False
Remove reads that are marked as PCR dupicates
False
Remove reads that are in any of the regions
select at runtime
Remove reads that are NOT any of the regions
select at runtime
Strand information from BED file is ignored
False
Exclude reads NOT mapped to a reference
Empty.
Exclude reads mapped to a particular reference
chrM
Filter by maximum mismatch ratio
Not available.
Step 7: Sort
BAM FileOutput dataset 'outfile' from step 6
Sort by
Chromosomal coordinates
Step 8: MarkDuplicates
Select SAM/BAM dataset or dataset collectionOutput dataset 'output1' from step 7
Comments
If true do not write duplicates to the output file instead of writing them with appropriate flags set
True
Assume the input file is already sorted
True
The scoring strategy for choosing the non-duplicate among candidates
SUM_OF_BASE_QUALITIES
Regular expression that can be used in unusual situations to parse non-standard read names in the incoming SAM/BAM dataset
[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*.
The maximum offset between two duplicte clusters in order to consider them optical duplicates
100
Barcode Tag
Empty.
Select validation stringency
Lenient
Step 9: bamCoverage
BAM/CRAM fileOutput dataset 'outFile' from step 8
Bin size in bases
50
Scaling/Normalization method
Normalize to reads per kilobase per million (RPKM)
Coverage file format
bigwig
Region of the genome to limit the operation to
Empty.
Show advanced options
no
Step 10: MACS2 callpeak
Are you pooling Treatment Files?No
ChIP-Seq Treatment File
select at runtime
Do you have a Control File?
No
Format of Input Files
BAM
Effective genome size
H. sapiens (2.7e9)
Build Model
Build the shifting model
Set lower mfold bound
5
Set upper mfold bound
50
Band width for picking regions to compute fragment size
300
Peak detection based on
q-value
Minimum FDR (q-value) cutoff for peak detection
0.05
Additional Outputs
Peaks as tabular file (compatible wih MultiQC)
Advanced Options:
When set, scale the small sample up to the bigger sample
False
Use fixed background lambda as local lambda for every peak region
False
Save signal per million reads for fragment pileup profiles
False
When set, use a custom scaling ratio of ChIP/control (e.g. calculated using NCIS) for linear scaling
1.0
The small nearby region in basepairs to calculate dynamic lambda
1000
The large nearby region in basepairs to calculate dynamic lambda
10000
Composite broad regions
No broad regions
Use a more sophisticated signal processing approach to find subpeak summits in each enriched peak region
False
How many duplicate tags at the exact same location are allowed?
1
Step 11: multiBigwigSummary
Sample order mattersNo
Bigwig files
Output dataset 'outFileName' from step 9
Choose computation mode
Bins
Bin size in bp
10000
Distance between bins
0
Region of the genome to limit the operation to
Empty.
Save raw counts (scores) to file
True
Show advanced options
no
Step 12: Intersect intervals
File A to intersect with BOutput dataset 'output_narrowpeaks' from step 10
Combined or separate output files
One output file per 'input B' file
File(s) B to intersect with A
select at runtime
Calculation based on strandedness?
Overlaps on either strand
What should be written to the output file?
Write the original entry in A for each overlap (-wa)
Treat split/spliced BAM or BED12 entries as distinct BED intervals when computing coverage.
False
Minimum overlap required as a fraction of the BAM alignment
Empty.
Require that the fraction of overlap be reciprocal for A and B
False
Report only those alignments that **do not** overlap with file(s) B
True
Write the original A entry _once_ if _any_ overlaps found in B.
False
For each entry in A, report the number of overlaps with B.
False
Print the header from the A file prior to results
False
Step 13: plotPCA
Matrix file from the multiBamSummary or multiBigwigSummary toolsOutput dataset 'outFile' from step 11
Image file format
Title of the plot
Empty.
Save the matrix of PCA and eigenvalues underlying the plot.
False
Show advanced options
no
Step 14: plotCorrelation
Matrix file from the multiBamSummary toolOutput dataset 'outFile' from step 11
Correlation method
Spearman
Plotting type
Heatmap
Minimum value for the heatmap intensities
Empty.
Maximum value for the heatmap intensities
Empty.
Color map to use for the heatmap
RdYlBu
Title of the plot
Empty.
Plot the correlation value
True
Plot height
9.5
Plot width
11.0
Skip zeroes
False
Image file format
Remove regions with very large counts
True
Save the matrix of values underlying the heatmap
False
Step 15: BED-to-bigBed
ConvertOutput dataset 'output' from step 12
Converter settings to use
Full parameter list
Items to bundle in r-tree
256
Data points bundled at lowest level
512
Do not use compression
False
Step 16: computeMatrix
Select regionsSelect regions 1
Regions to plot
Output dataset 'output' from step 12
Sample order matters
Yes
Score files
Score files 1
Score file
Output dataset 'outFileName' from step 9
computeMatrix has two main output options
reference-point
The reference point for the plotting
center of region
Discard any values after the region end
False
Distance upstream of the start site of the regions defined in the region file
1000
Distance downstream of the end site of the given regions
1000
Show advanced output settings
no
Show advanced options
yes
Length, in bases, of non-overlapping bins used for averaging the score over the regions length
50
Sort regions
maintain the same ordering as the input files
Method used for sorting
mean
Define the type of statistic that should be displayed.
mean
Convert missing values to 0?
False
Skip zeros
False
Minimum threshold
Not available.
Maximum threshold
Not available.
Scaling factor
Not available.
Labels for the samples (each bigwig)
Empty.
Use a metagene model
False
trascript designator
transcript
exon designator
exon
transcriptID key designator
transcript_id
Blacklisted regions in BED/GTF format
select at runtime
Step 17: plotHeatmap
Matrix file from the computeMatrix toolOutput dataset 'outFileName' from step 16
Show advanced output settings
no
Show advanced options
no