Defining the three cell lineages of the human blastocyst by single-cell RNA-seq

Here, we provide fundamental insights into early human development by single-cell RNA-sequencing of human and mouse preimplantation embryos. We elucidate conserved transcriptional programs along with those that are human specific. Importantly, we validate our RNA-sequencing findings at the protein level, which further reveals differences in human and mouse embryo gene expression. For example, we identify several genes exclusively expressed in the human pluripotent epiblast, including the transcription factor KLF17. Key components of the TGF-β signalling pathway, including NODAL, GDF3, TGFBR1/ALK5, LEFTY1, SMAD2, SMAD4 and TDGF1, are also enriched in the human epiblast. Intriguingly, inhibition of TGF-β signalling abrogates NANOG expression in human epiblast cells, consistent with a requirement for this pathway in pluripotency. Although the key trophectoderm factors Id2, Elf5 and Eomes are exclusively localized to this lineage in the mouse, the human orthologues are either absent or expressed in alternative lineages. Importantly, we also identify genes with conserved expression dynamics, including Foxa2/FOXA2, which we show is restricted to the primitive endoderm in both human and mouse embryos. Comparison of the human epiblast to existing embryonic stem cells (hESCs) reveals conservation of pluripotency but also additional pathways more enriched in hESCs. Our analysis highlights significant differences in human preimplantation development compared with mouse and provides a molecular blueprint to understand human embryogenesis and its relationship to stem cells.


Data acquisition and processing
We integrated previously published datasets with our own blastocyst sequencing data using a consistent read alignment method. SRA files were obtained via ftp from the Gene Expression Omnibus, under the accession numbers GSE36552 and GSE45719.
The SRA files were converted into FASTQ format using the fastq-dump program from the SRA toolkit (http://www.ncbi.nlm.nih.gov/Traces/sra/). The reference human genome sequence was obtained from Ensembl, along with the gene annotation (GTF) file. The reference sequence was indexed using the bowtie2-build command.

Read mapping and counting
Reads were aligned to the reference human genome sequence using Tophat2 (Kim et al., 2013), with gene annotations to obtain BAM files for each of the single-cell samples. BAM files were then sorted by read coordinates and converted into SAM files using SAMtools. The process of mapping and processing BAM files was automated using a custom Perl script. The number of reads mapping to each gene were counted using the program htseq-count (Anders et al., 2015). The resulting count files for each sample were used as input for differential expression analysis using DESeq using the hg19 human or mm9 mouse genome reference sequence.

Expression analysis
To investigate differences in global gene expression, a PCA of the top 8000 genes with the most variable expression was performed on the human and mouse RPKM data separately. The R package prcomp was used to generate the PCA, using both the scaling and centering options. The R package NOISeq (Tarazona et al., 2011) was used to identify genes differentially expressed between the TE and EPI cells in human, and between TE and ICM cells in mouse. To increase sensitivity, genes with an RPKM > 5 in four or more samples were retained for NOISeq analysis.
Differentially expressed genes were identified after applying a 95% probability threshold. Ensembl Biomart was used to find human-mouse orthologous pairs within the list of differentially expressed human and mouse gene.
We used a second independent method to detect differentially expressed genes between human EPI and TE using DESeq (Anders and Huber, 2010). Firstly, the function 'estimateSizeFactors' and 'estimateDispersions' were used to estimate biological variability and calculate normalised relative expression values across the different blastocyst samples. Initially, this was performed without sample labels (option: method='blind') to allow unsupervised clustering of the blastocyst samples using principal components analysis and hierarchical clustering. The dispersion estimates were recalculated with the sample labels included and with the option: method='pooled'. The function 'nbinomTest' was then used to calculate p-values to identify genes that show significant differences in expression between different cell types.

Development • Supplementary information
A k-means clustering analysis was performed to find clusters of genes co-expressed during pre-implantation development. The mean RPKM value for each developmental time point was calculated for subsequent k-means clustering analysis (Figs S1, S2).
Genes with a fold change of greater than two between any two stages were retained.
The R package 'MFuzz' was used to generate the k-means clusters using the kmeans2 function, with the number of clusters set to 50. A custom R script was used to generate plots for the k-means clusters and trendlines were drawn based on the kmeans centroids. The k-means clusters were clustered further using the R function 'hclust' and heatmaps were generated using R package 'pheatmap'.