The user interface of DiMmer is designed as a simple workflow comprising 8 simple and intuitive steps. In order to easy the usage of the software, each step and option is accompanied by informative descriptions. Here, you find a more detailed and exhaustive description of the steps.
java -Xmx4096M -jar dimmer.jar
Xmx
is used for increasing the allowed memory usage of DiMmer. In case DiMmer still aborts with a out-of-memory message, try increasing this value accordingly. The amount of memory required is determined by the size of your experiment.Infinium HumanMethylation450 BeadChip
with 60 patients, a value of -Xmx3024M
is sufficient.
In order to perfrom an analysis with DiMmer, you have to make sure that you have a valid sample annotation file (click here for an example file) reflecting your experiment. The file format of the annotation file format is very simple; it is just a comma-separated file with at least wo columns linking the probe to the IDAT file:
Sentrix_ID
Sentrix_Position
These columns allow DiMmer to link each probe to the corresponding IDAT file (that is the actual methylation data). Each Sentrix_ID
will be a directory and the combination of Sentrix_ID
and Sentrix_Position
represents the IDAT filename.
Optionally, you can provide three additional standard columns:
Group_ID
Pair_ID
Gender_ID
Group_ID
splits your data into test and control group. The column Pair_ID
is used to identify pairs of connected samples, e.g. when performing a twin study, each pair of twins is identified by the same number. The column Gender_ID
encodes for the sexes of your sample.
In cases where these three columns are omitted, DiMmer assumes a specific order of your samples in the annotation file in order to separate test and control group (in the middle of your dataset) and for identifying the connect pairs in the case of a paired study (each sample is followed by its paired sample). If the column Gender_ID
is missing, DiMmer performs an automatic gender detection.
Additionally, you may add an arbitrary number of columns representing the nature of your experiment. You might define one column for each phenotype of interest and for each known co-factors.
In this step, you select whether your data is paired
(e.g. twin data) or unpaired
. This decision has direct influence on the following permutation tests. In case of paired data, the permutation randomly swaps samples with same pair_ID
(in case provided) whereas the unpaired permutation shuffles all patient/control labels without repetition.
Two different methods can be applied for computing the statistical significance of the CpGs. For simple case-control studies, the t-test
is the preferred method as it is considerably faster than the more complex regression. You can choose between a left-sided, right-sided t-test if you want to investigate under, upper methylated regions or both respectively. The linear regression (LR)
can also be applied for binary data, however, it is intended to be utilized to compute statistical significance of continuous data with multiple labels. The use of linear regression
also allows for the cell composition estimation, which can be used to correct the p-values of your phenotype of interest.
One cannot draw any conclusion solely based on the p-Value derived by the aforementioned t-test or regression of a CpG. This is the reason we perform permutation tests estimating the statistical significance of the CpGs. That means, we randomly assign case-control labels to the samples and perform the same statistical test as above. This results in a distribution of permuted p-values allowing to judge the significance of the original p-values. Again, permutations using the t-test
are considerably faster compared to the ones using LR
. A larger number of permutations leads to a more precise estimation of the significance. As this might take some time, you should allow DiMmer to utilize several CPUs at once. Therefore, set the number of threads to a values between 1 and the number of CPUs of your computer. Furthermore, the DiMmer also allows for correcting for multiple testing. In total, DiMmer automatically provides the user with four different p-Value variances:
Empirical p-values
False Discovery Rate (FDR)
Family-Wise Error Rate (FWER)
Step-Down minP.
These corrected p-values are all available after the permutation test. When starting to discovering differentially methylated regions (DMRs), you can choose either one these p-values, but we recommend to use the least stringent test (the empirical p-Value) as the DMR search does not require a correction for multiple testing.
A DMR, differentially methylated region, is a group of differentially methylated CpGs occurring in close proximity to each other. DiMmer defines a DMR as a sequence of consecutive differentially methylated CpGs with a smaller genomic distance of 1000 base-pairs. You can change the maximal acceptable distances of the CpGs in the field Max. CpG distance
. A CpG is regarded as methylated when the chosen p-Value is below a given threshold which can be set in the p-value cutoff
field. You have to select which p-values should be used (original p-value empirical p-values, FDR, FWER or step-down minP).
In the field window size
you can specify the minimal number of consecutive differentially methylated CpGs your require for a region. The default value is 5.
Furthermore, you can define how many exceptions (i.e., not differentially methylated CpGs) you allow within the search window defined above. Default value is 2 and can be changed in the field number of exceptions
.