(PDF) Integrated siRNA design based on surveying of features associated with high RNAi effectivenessHomeBiomoleculesRNABiological ScienceNucleic AcidsBiochemistrysiRNAArticlePDF AvailableIntegrated siRNA design based on surveying of features associated with high RNAi effectivenessFebruary 2006BMC Bioinformatics 7(1):516DOI:10.1186/1471-2105-7-516SourcePubMedAuthors: Wuming GongWuming GongThis person is not on ResearchGate, or hasn t claimed this research yet. Yongliang RenYongliang RenThis person is not on ResearchGate, or hasn t claimed this research yet. Qiqi XuCanon Medical Systems Corporation (China) Yejun WangShenzhen University Health Science Center, ChinaShow all 7 authorsHide Download full-text PDFRead full-textDownload full-text PDFRead full-textDownload citation Copy link Link copied Read full-text Download citation Copy link Link copiedCitations (53)References (74)Figures (7)Abstract and FiguresShort interfering RNAs have allowed the development of clean and easily regulated methods for disruption of gene expression. However, while these methods continue to grow in popularity, designing effective siRNA experiments can be challenging. The various existing siRNA design guidelines suffer from two problems: they differ considerably from each other, and they produce high levels of false-positive predictions when tested on data of independent origins.Using a distinctly large set of siRNA efficacy data assembled from a vast diversity of origins (the siRecords data, containing records of 3,277 siRNA experiments targeting 1,518 genes, derived from 1,417 independent studies), we conducted extensive analyses of all known features that have been implicated in increasing RNAi effectiveness. A number of features having positive impacts on siRNA efficacy were identified. By performing quantitative analyses on cooperative effects among these features, then applying a disjunctive rule merging (DRM) algorithm, we developed a bundle of siRNA design rule sets with the false positive problem well curbed. A comparison with 15 online siRNA design tools indicated that some of the rule sets we developed surpassed all of these design tools commonly used in siRNA design practice in positive predictive values (PPVs).The availability of the large and diverse siRNA dataset from siRecords and the approach we describe in this report have allowed the development of highly effective and generally applicable siRNA design rule sets. Together with ever improving RNAi lab techniques, these design rule sets are expected to make siRNAs a more useful tool for molecular genetics, functional genomics, and drug discovery studies. : Non-redundant significant features meeting the criteria (P wald 0.01) and (P 70 0.01 or P 90 0.01)…  : Non-redundant DRM rule set for the highest α level:RS 0.951 .…  : Contingency table for the outcome of prediction tasks.…  The distribution of the number of siRNA experiments per study is highly skewed in the siRecords collection. A. Studies were categorized based on the number of siRNA experiments reported. Only 6 out of the 1,417 studies (0.4%) reported 30 siRNA experiments per study. B. The distribution of the total number of records in each category. Six hundred and fifty-seven records (representing 17.5% of the entire dataset) originated from the 6 studies with 30 records per study.…  +2Survey of features associated with the achievement of higher efficacies. The efficacy of a siRNA experiment is rated on a four-level scale. In Set A, the percentages of records achieving these ratings are 34.1%, 34.6%, 16.3% and 14.9%, respectively. The distribution of the efficacy ratings across the four levels changes when certain feature is present in the siRNA experiments. For 14 selected features (they constitute 7 pairs of complementary features ), the efficacy rating distributions of the subpopulations of siRNA experiments carrying these features are presented. Dotted vertical lines extend from the distribution of the general population.… Figures - available via license: Creative Commons Attribution 2.0 GenericContent may be subject to copyright. Discover the world s research20+ million members135+ million publications700k+ research projectsJoin for freePublic Full-text 1Available via license: CC BY 2.0Content may be subject to copyright. BioMed CentralPage 1 of 21(page number not for citation purposes)BMC BioinformaticsOpen AccessResearch articleIntegrated siRNA design based on surveying of features associated with high RNAi effectivenessWuming Gong, Yongliang Ren, Qiqi Xu, Yejun Wang, Dong Lin, Haiyan Zhou and Tongbin Li*Address: Department of Neuroscience, University of Minnesota, Minneapolis, MN 55455, USAEmail: Wuming Gong - wuming@biocompute.umn.edu; Yongliang Ren - yongliang@biocompute.umn.edu; Qiqi Xu - qiqi@biocompute.umn.edu; Yejun Wang - yejun@biocompute.umn.edu; Dong Lin - lindong@biocompute.umn.edu; Haiyan Zhou - haiyan@biocompute.umn.edu; Tongbin Li* - toli@biocompute.umn.edu* Corresponding author AbstractBackground: Short interfering RNAs have allowed the development of clean and easily regulatedmethods for disruption of gene expression. However, while these methods continue to grow inpopularity, designing effective siRNA experiments can be challenging. The various existing siRNAdesign guidelines suffer from two problems: they differ considerably from each other, and theyproduce high levels of false-positive predictions when tested on data of independent origins.Results: Using a distinctly large set of siRNA efficacy data assembled from a vast diversity of origins(the siRecords data, containing records of 3,277 siRNA experiments targeting 1,518 genes, derivedfrom 1,417 independent studies), we conducted extensive analyses of all known features that havebeen implicated in increasing RNAi effectiveness. A number of features having positive impacts onsiRNA efficacy were identified. By performing quantitative analyses on cooperative effects amongthese features, then applying a disjunctive rule merging (DRM) algorithm, we developed a bundle ofsiRNA design rule sets with the false positive problem well curbed. A comparison with 15 onlinesiRNA design tools indicated that some of the rule sets we developed surpassed all of these designtools commonly used in siRNA design practice in positive predictive values (PPVs).Conclusion: The availability of the large and diverse siRNA dataset from siRecords and theapproach we describe in this report have allowed the development of highly effective and generallyapplicable siRNA design rule sets. Together with ever improving RNAi lab techniques, these designrule sets are expected to make siRNAs a more useful tool for molecular genetics, functionalgenomics, and drug discovery studies.BackgroundShort interfering RNAs (siRNAs) are double-strandedRNAs typically of length between 19 and 25 with 2nucleotide overhangs on the 3 ends, and they are capableof inducing sequence-specific, post-transcriptional dele-tion of gene products, leading to the silencing of the geneactivity. Naturally occurring siRNAs are cleavage productsfrom long double-stranded RNAs (dsRNAs) by Dicer, aribonuclease III enzyme [1,2]. The siRNA-induced mRNAdegradation is a complicated process involving multiplePublished: 27 November 2006BMC Bioinformatics 2006, 7:516 doi:10.1186/1471-2105-7-516Received: 10 June 2006Accepted: 27 November 2006This article is available from: http://www.biomedcentral.com/1471-2105/7/516© 2006 Gong et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 2 of 21(page number not for citation purposes)steps, initiated by the binding of siRNA with RISC (RNAinduced silencing complex), followed by RISC s activa-tion, resulting in the recognition of the target mRNA andthe degradation of the latter [1,3,4]. As a gene knock-down tool used in labs, siRNAs can also be chemicallysynthesized and introduced into the cells by direct trans-fection [5,6] or delivered into the cells in forms of hairpinprecursors through plasmid or viral vectors [7,8]. ThesiRNA-based gene knock-down techniques are preferredby many because of their ability to disrupt individualgene s function without affecting related genes [9]. Thesetechniques are particularly attractive for gene silencingstudies in mammalian cells, because, unlike longer dou-ble-stranded RNAs, siRNAs are not likely to trigger inter-feron responses which lead to non-specific mRNAdegradation [5].The efficacy issue represents a major challenge in siRNAdesign. This issue concerns the question of how to choosefrom the large number of candidate siRNAs the ones thatgive rise of the highest levels of knock-down activity. It iswell known that only a fraction of these candidate siRNAsare highly effective in silencing the target genes. Two siR-NAs targeting the mRNA sites that are separated by only afew nucleotides could exhibit very different knock-downefficacies [10,11]. What are the properties some siRNAspossess that render them more effective in knocking downthe target genes than others? This is an issue of heateddebate. Several sets of rules for designing high-efficacy siR-NAs have been proposed (e.g., [11-14]). In addition, along list of factors have been claimed to influence siRNAknock-down efficacy and thus should be considered insiRNA design [15-26].There are significant disagreements among these designrules and considerable controversies over these claims.This situation has been discussed extensively in severalrecent review articles [27,28], therefore we only list someexamples of these disagreements here: [20] suggested thatthe sequence information alone was sufficient in deter-mining the efficacy of a siRNA; however, [15,22,24] advo-cated the need to incorporate thermodynamic properties(calculated using tools such as Mfold [29]) in assistingsiRNA design; while [17,25] emphasized the importanceof the accessibility to the mRNA sites by the siRNAs, andendorsed methods of filtering candidate siRNAs based onmRNA secondary structure properties. On factors deter-mined by siRNA sequences, [12,30] recommended choos-ing of sequences of intermediate G/C contents (around50%) for effective siRNAs, while [11,18,24,31,32]endorsed the choosing of sequences of lower G/C con-tents ( 60%) to increase the chance of making high-effi-cacy siRNAs. On position-specific properties, [11]suggested that the nucleotides on positions 3, 10, 13 and19 on the sense strand played a critical role in determiningthe knock-down efficacy; while [14] claimed that posi-tions 19 and 11, and perhaps 6, 13 and 16 on the sensestrand were important in determining the knock-downefficacy of the siRNAs.The debates over siRNA efficacy go beyond the disagree-ments among these design rules. In fact, the effectivenessof these rules per se is in question. [17] showed that mostpublished siRNA design tools output large numbers ofineffective siRNAs, and had a similar performance to (oreven worse than) a random selector when tested on dataof an independent origin. [20] made similar observations,and alleged that several published efficacy predicting algo-rithms gave close to random classification on unseen data.At least two groups of researchers pointed out that manyexisting studies on siRNA design criteria suffered from the overfitting problem [20,24]. This term describes scenar-ios where rules are extracted from datasets that have smallsample sizes, low signal-to-noise ratios, and unique exper-imental settings. Rules obtained under these conditionsare prone to spurious effects caused by noise in the datasamples or specific aspects of the experimental settings orboth; rules obtained in this manner are likely to performunsatisfactorily when used on data obtained under differ-ent experimental settings.The key to countering the overfitting problem and devel-oping truly effective and generally applicable siRNAdesign rules is the availability of a large collection ofsiRNA efficacy data from diverse origins. We recentlyundertook the effort to document all siRNA experimentsin published studies and provide sensible efficacy ratingsof these experiments. This effort resulted in siRecords, thelargest known curated database of mammalian siRNAexperiments with consistent efficacy ratings [33]. Theavailability of the siRecords data makes it possible to betteranalyze factors responsible for achieving effective RNAiexperiments.In this study, we first conducted a survey on the siRecordsdata of all known features previously implicated toinfluence siRNA knock-down efficacy. This surveyresulted in a list of features that significantly boosted thechance of achieving higher siRNA efficacies. Then, weexamined quantitatively how these significant featuresinteract with one another in their joint effects on achiev-ing higher efficacies. The combinations of features thatgive rise to the highest levels of boosting to siRNA effica-cies were picked and reorganized using a disjunctive rulemerging (DRM) procedure, which led to a bundle of non-redundant rule sets with controlled stringency level. Theperformance of these rule sets (termed the DRM rule sets)was then assessed using a reserved dataset and compared BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 3 of 21(page number not for citation purposes)with existing design tools commonly used in currentsiRNA design practice.An implementation of the DRM rule sets developed in thisstudy is available for testing as an online siRNA designserver [34].ResultsOverview of siRecords datasiRecords is a continuing effort aimed to document allmammalian siRNA experiments reported in literature,and provide systematically rated efficacies for these exper-iments [33]. Currently, about 9000 records of siRNAexperiments targeting more than 3000 genes are hosted inthe siRecords database. For each siRNA experiment, wedocument the siRNA sequence, the target gene, key infor-mation about experimental conditions (cell line used; themethod of producing the siRNA – chemically synthesizedor vector-based; the method of testing the siRNA efficacy– western blot or real-time PCR or others), and an efficacyrating (elaborated below).For this investigation, we picked all complete records of19-mer siRNA experiments (21-mers if the two overhang-ing nucleotides on the 3 ends are counted) from thesiRecords collection (dated 12/12/2005). The distributionof number of records per study is highly skewed – about17.5% of the records (657 siRNA experiments) originatedfrom 0.4% of the studies (6 studies, each reporting ≥ 30siRNA experiments, Figure 1). To prevent our analysesfrom being biased by this small number of studies, welimited the number of siRNA experiments originated froma single study to be ≤ 30. For these studies where morethan 30 siRNA experiments were reported, we randomlypicked 30 to include in our analyses and discarded therest. The resulting dataset includes the records of 3277siRNA experiments targeting 1518 genes originated from1417 independent studies. We randomly divided thedataset into two subsets at a 2:1 ratio. The larger subset –termed Set A – included 2184 records, and was used tosurvey features significantly associated with high efficaciesand analyze the combinatorial effects of these features.The other subset (termed Set T, 1093 records) wasreserved to test the conclusions obtained through theanalyses of Set A.Survey of features significantly boosting siRNA efficacyWe set out to determine, using the Set A data, what fea-tures of the siRNA experiments are associated with ele-vated RNAi efficacies. A feature is a binary property of asiRNA experiment concerning a factor potentially relevantto siRNA efficacy, for example, the 6th nucleotide of thesiRNA sequence = A. Each feature has a complementaryfeature . A feature and its complementary feature consti-tute a feature pair . More discussions about the defini-tion of feature and related terms can be found in Methods.In siRecords, the effectiveness of any siRNA experiment israted on a four-level scale: very high (if the gene productwas reduced by ≥ 90%), high (if the gene product wasThe distribution of the number of siRNA experiments per study is highly skewed in the siRecords collectionFigure 1The distribution of the number of siRNA experiments per study is highly skewed in the siRecords collection. A. Studies were categorized based on the number of siRNA experiments reported. Only 6 out of the 1,417 studies (0.4%) reported 30 siRNA experiments per study. B. The distribution of the total number of records in each category. Six hundred and fifty-seven records (representing 17.5% of the entire dataset) originated from the 6 studies with 30 records per study. BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 4 of 21(page number not for citation purposes)reduced by 70–90%), medium (if 50–70% knock-downwas achieved); and low (if 50% knock-down wasobserved). In Set A, the percentages of records receivingvery high, high, medium and low efficacy ratings are34.1%, 34.6%, 16.3% and 14.9% respectively (Figure 2).The decision of using this four-level rating scheme wasmade based on balanced considerations about the useful-ness and the reliability of the ratings [33]. One conse-quence of this decision is that that the conventional t-testtype of analysis [11] can not be performed on this dataset,because the dependent variable (efficacy rating) is not acontinuous variable, but rather a categorical, ordinal vari-able. Proper categorical analysis techniques need to beadopted to analyze this type of data [35].We chose to use the Wald test of monotone trend to assessthe evidence that the presence of a feature is associatedwith a significant up-shift (or down-shift) of the efficacydistribution. In addition, we conducted odds ratio permu-tation tests for two efficacy levels: 90% and 70% effi-cacies, because in siRNA design practice, we are interestedin assessing whether a feature leads to increased chancesof achieving higher efficacies (see Methods). For instance,a Wald test of monotone trend indicated that the presenceof the feature the 6th nucleotide of the siRNA sequence = A isassociated with significant up-shift of the efficacy distribu-tion (P = 0.0058); odds ratio permutation tests showedthat the presence of this feature led to significant increasein the probabilities of achieving both 90% (P = 0.043)and 70% (P = 0.0024) efficacies (see Supplementary Fig-ure 1 in Additional file 1).We examined 276 features (they constitute 138 featurepairs ) for their association with higher RNAi efficacies,using the Wald test of monotone trend and the odds ratiopermutation tests. The features we examined include, toour knowledge, all that have been implicated in previousstudies to improve siRNA effectiveness. Each of these fea-tures can be placed into one of five categories. The first cat-egory is based on nucleotide identities at specificpositions on the 19-mer siRNA sequence, e.g. the 6thnucleotide = A; there are 76 feature pairs in this category.The second category includes 19 feature pairs that areeither composite sequence features, e.g. there are at leastthree (A/U) s in the seven nucleotides at the 3 end of thesiRNA, or features that are defined based on the G/C con-tent of the siRNA. The third category consists of 13 featurepairs that are based on the thermodynamics of the siRNAsSurvey of features associated with the achievement of higher efficaciesFigure 2Survey of features associated with the achievement of higher efficacies. The efficacy of a siRNA experiment is rated on a four-level scale. In Set A, the percentages of records achieving these ratings are 34.1%, 34.6%, 16.3% and 14.9%, respectively. The distribution of the efficacy ratings across the four levels changes when certain feature is present in the siRNA experiments. For 14 selected features (they constitute 7 pairs of complementary features ), the efficacy rating distributions of the subpopula-tions of siRNA experiments carrying these features are presented. Dotted vertical lines extend from the distribution of the general population. BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 5 of 21(page number not for citation purposes)as measured by the melting temperature, or bindingenergy. The fourth category, consisting of 16 feature pairs,includes features based on target mRNA sites, such as therelative positions of the target sites on the mRNA, and thelocal secondary structures of the target regions. Finally,the fifth category includes 14 feature pairs that are basedon experimental settings, such as the cell lines used in theexperiments (HeLa cells, HEK293 cells, and others), themethods used for making and delivering the siRNAs, andthe methods used to evaluate the efficacy of the siRNA(Western blot, PCR-based, and others). The complete listof these tested features, and references to the studies thatimplicated them in enhancing siRNA efficacies, are pro-vided in Supplementary Tables 1-5 in Additional file 1.Of the features examined, we found 34 that were associ-ated with a significant improvement in the efficacy distri-bution (P 0.01, Wald test of monotone trend; FDRcontrolled at 0.056 by the q-value technique [36]); amongwhich, 26 significantly elevated the chance of achieving 90% efficacies (P 0.01, odds ratio permutation test, FDRcontrolled at 0.038), and 27 significantly enhanced theprobability of achieving 70% efficacies (P 0.01, oddsratio permutation tests, FDR controlled at 0.044; see Sup-plementary Tables 1-5 in Additional file 1). There are sev-eral cases of sub-feature – super-feature relationshipsamong these significant features. For example, the featuresthe 6th nucleotide = A, and the 6th nucleotide ≠ C wereboth significant features, however, the former is a sub-fea-ture of the latter since when the former feature is present,the latter must also be present. In each occurrence of sub-feature – super-feature relationship, we eliminated all butthe one feature determined to be the most significant bythe Wald test. The feature the 6th nucleotide = A was thuseliminated because the Wald test P value of this featurewas higher than that of the feature the 6th nucleotide ≠ C.G/C content related features were treated as a special case.Several different G/C content ranges were suggested inprevious studies as being possibly associated with highRNAi effectiveness (32–79%, 30–70%, 30–52%, 35–60%, 20–50% and 31.6–57.9%) [11,12,18,24,30-32]. Allthese features were tested. Although they do not constitutesub-feature – super-feature relationships, we treated thesefeatures as redundant features, and retained only one ofthem (G/C content is between 35 and 60%) because ityielded the lowest P value (0.00018) in the Wald test. Theresulting list of non-redundant significant features isshown in Table 1. Detailed discussions about these signif-icant features, and comparisons of our analyses with pre-vious findings can be found in the Additional file 1.Combined effects of multiple significant featuresThe presence of any single significant feature was not suf-ficient to improve the efficacy distribution substantially.When present alone, the significant features listed in Table1 increased the probability of achieving 90% efficaciesby an average of only 2.5% (from 34.1% to 36.6%), andthey increased the chance of achieving 70% efficacies byTable 1: Non-redundant significant features meeting the criteria (Pwald 0.01) and (P70 0.01 or P90 0.01)Feature name % Low % Medium % High % Very high P70P90Pwald2nd nucleotide = A 12.1 16.0 33.8 38.1 0.01 0.0026 0.00194th nucleotide = C 14.1 15.4 31.5 39.0 0.098 0.00036 0.00756th nucleotide ≠ C 14.3 15.6 35.0 35.1 0.00066 0.0089 0.00527th nucleotide ≠ U 14.4 15.9 34.5 35.2 0.01 0.0043 0.00919th nucleotide = C 11.1 16.6 32.6 39.6 0.008 0.00021 0.0005317th nucleotide = A 11.4 15.5 37.1 35.9 0.00049 0.1 0.004918th nucleotide ≠ C 14.4 15.9 34.3 35.4 0.01 0.00071 0.004819th nucleotide = (A/U) 12.0 16.0 35.3 36.7 0.00029 0.0043 0.000058At least three (A/U)s in the seven nucleotides at the 3 end 13.4 16.4 33.7 36.5 0.00001 0.00001 2.5E-09No occurrences of four or more identical nucleotides in a row 14.2 15.9 35.4 34.5 0.00001 0.012 0.0014No occurrences of G/C stretches of length 7 or longer 14.3 16.4 34.9 34.4 0.00001 0.00001 0.000015G/C content is between 35% and 60% 13.3 16.7 35.1 35.0 0.00001 0.0019 0.00018Tm is between 20 and 60°C 13.2 16.5 35.0 35.3 0.0045 0.023 0.003Binding energy of N16–N19 -9 KCal/Mol 11.8 17.1 34.0 37.1 0.01 0.0026 0.00025Binding energy of N16–N19 – binding energy of N1–N4 is between 0 and 1 KCal/Mol12.6 14.9 32.5 39.9 0.01 0.00036 0.0078Local folding potential (mean) ≥ -22.72 KCal/Mol 12.0 14.6 34.7 38.7 0.00001 0.00001 9.3E-09Target site is on CDS 14.4 16.2 34.3 35.2 0.00001 0.00001 0.000055Cell line = HeLa 7.9 10.6 41.2 40.3 0.00001 0.00016 4.0E-09Test method = Western blot 10.8 15.5 34.8 36.9 0.00001 0.00001 3.8E-14Test object ≠ mRNA 13.1 14.5 34.8 37.6 0.00001 0.00001 9.3E-10At P 0.01, the FDR for the three tests: Wald test of monotone trend, permutation test of odds ratios ( 70%) and permutation test of odds ratios ( 90%) were controlled at the levels of 0.056, 0.044 and 0.038 respectively. BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 6 of 21(page number not for citation purposes)an average of merely 2.2% (from 68.7% to 70.9%). Toachieve substantially improved efficacies, the concurrentpresence of several significant features is required.When multiple features are co-present, we cannot assumethat their contributions to the effectiveness of the RNAiexperiments are additive, since features are not alwaysindependent of one another. For instance, the presence ofthe feature the 19th nucleotide = (A/U), clearly increases theprobability that the feature there are at least three (A/U) s inthe seven nucleotides on the 3 end of the siRNA to be true.Indeed, these two features exhibited negative cooperativ-ity: when present alone, they increased the chances ofachieving 90% efficacies by 2.6% and 2.4%, respec-tively; when co-present, these two features resulted inmerely a 2.7% increase in the chance of achieving 90%efficacy, much smaller than the sum of the effects of thetwo features (see Additional file 1 for discussions aboutcooperativity and additive effects of multiple features).In seeking effective siRNA design rules, we should try toidentify combinations of features that exhibit positivecooperativity. The large size and diverse origins of therecords in the siRecords dataset allowed us to systemati-cally analyze how features jointly influence siRNA effica-cies. Three significant features: Cell line = HeLa, Test method= Western blot and Test object ≠ mRNA were excluded fromjoint effect analyses because they are based on experimen-tal settings, which are typically chosen independent ofsiRNA design. For the remaining 17 significant features,we looked at all possible combinations of a fixed number(l = 2,3,4,5 and 6) of features. For each combination of lfeatures, we examined the number of records in Set A thatconcurrently carry all l features, and the percentages ofthese records that achieved 90% and 70% efficacies.For every given l, we focused on the top-10 feature combi-nations, i.e., the 10 combinations that exhibited the high-est percentage of records achieving 90% or 70%efficacies. When there was a tie of more than 10 featurecombinations, all tied combinations were considered. Aswe expected, as l – the number of features in the combina-tions increased, the number of records concurrently carry-ing all l features declined sharply (Figure 3C). Meanwhile,the percentage of experiments achieving 90% and 70% efficacies increased steadily as l, the number of fea-tures included in the feature combinations, increased(Figure 3A and 3B).Highly effective siRNA design rules were obtained by selecting the top l-feature combinations, i.e., the combination of l non-redundant significant features that exhibited the highest percentages of records achieving 70% or 90% efficacies on Set AFigure 3Highly effective siRNA design rules were obtained by selecting the top l-feature combinations, i.e., the combination of l non-redundant significant features that exhibited the highest percentages of records achieving 70% or 90% efficacies on Set A. A. For l = 2 through 6, the subpopulations of Set A records that carry all combinations of l features were examined, and the 10 feature combinations (FCs) that resulted in the highest percentages of records achieving 70% efficacies were selected. When there was a tie of more than 10 FCs, all of them were considered (marked in the graph). The mean percentages of the top FCs are presented in black filled circles. These FCs were used to select siRNA experiments in the Set T, and the results are shown in grey filled circles. Error bars indicate standard errors. The first two data points in the graphs represent the base line levels (the percentage of records achieving 70% efficacies for the entire Set A or Set T), and the mean levels for top-10 individual features (the 10 individual features that led to highest percentages of records achieving 70% efficacies), respectively. B. Simi-larly to A, the top FCs selected with 90% efficacies are plotted, together with the baseline levels and the mean levels for top individual features. C. The numbers of records selected in the top l-feature combinations dropped sharply as l increased. The mean numbers of selected records for Set A (with error bars indicating standard errors) are presented in black filled circles and black open circles for 70% and 90% efficacies, respectively. The numbers of selected records for Set T are presented in corresponding grey symbols. Again, the first two data points represent the baseline levels (numbers of records in entire Set A and Set T), and the numbers of records selected with the top-10 individual features, respectively.70758085909510030405060708090100101001000188 FCs24 FCsHigh+Very high( 70% efficacy)l=6l=5l=4l=3l=2IndividualfeaturesBaseline% Recordsachieving 70% efficacy Set A Set TCB94 FCs14 FCsVery high( 90% efficacy)Al=6l=5l=4l=3l=2IndividualfeaturesBaseline% Recordsachieving 90% efficacy Set A Set Tl=6l=5l=4l=3l=2IndividualfeaturesBaselineNumber of records Set A, 70% Set A, 90% Set T, 70% Set T, 90% BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 7 of 21(page number not for citation purposes)The sigmoid shape of the two ascending curves is an indi-cation of positive cooperativity (see discussion in Addi-tional file 1). This suggests that by simply retaining thefeature combinations that led to the highest percentagesof records achieving efficacies of 90% or 70%, wewere, in effect, exploiting the positive cooperativity, orfavorable interaction, among these features. At l = 5, 24feature combinations had a 100% chance of having effica-cies 70%, that is, every experiment in which the siRNAused had all features contained in any one of the 24 fea-ture combinations exhibited efficacies of 70%. Simi-larly, 14 feature combinations had 100% probabilities ofhaving efficacies 90% at l = 5, meaning that all siRNAexperiments having these feature combinations demon-strated efficacies 90%. At l = 6, 188 feature combinationshad 100% probabilities of having efficacies of 70%, and94 feature combinations had 100% probabilities ofachieving efficacies of 90%.Integrated rule sets for effective siRNA designA disjunction of the top feature combinations describedabove (across l = 2 through 6; a feature combination isalso called a rule thereafter) defines a rule set for designingeffective siRNA experiments. Rule sets defined in this wayare likely to contain redundancies, because if a rule con-sisting of features {f1, f2,..., } is one of the best -fea-ture combinations, then a rule consisting of ( +1)features {f1, f2,..., , f0}, where f0 is any other feature, islikely to be one of the best ( +1)-feature combinationsthus is also selected into the rule set. A disjunctive rulemerging (DRM) algorithm can be applied to removeredundancies of the rule sets, in the mean time allowingthe control over the stringency of the resulting rule sets(see Methods). This algorithm takes in a user-providedstringency parameter α (which has a range of [0, 1]), andproduces a non-redundant set of disjunctive rules, eachrule in the set resulting in ≥ α proportion of the records inSet A reaching efficacies 90%. The rule set rendered forthe highest α level (α = 0.951, denoted as RS0.951) con-tains seven rules (Table 2). Generally speaking, the lowerα level, the larger number of rules are included in the ruleset (see Supplementary Table 6 in Additional file 1).Performance comparison between DRM rule sets and existing design toolsWe assessed the performance of the DRM rule sets, andcompared it with that of 15 existing online design toolscommonly used in siRNA design practice, using the Set Tdata reserved for this purpose (Table 3 and Figure 4). SetT includes the records of 1,093 siRNA experiments, repre-senting 1,014 unique target sites on 744 genes. How dowe assess the performance of a siRNA design program? Agood siRNA design program should (a) provide a suffi-cient number of candidate siRNAs for a given gene; and(b) offer a high PPV (positive predictive value), or a lowfalse positive rate (see Methods).On the number of candidate siRNAs predicted, the DRMrule set with the highest stringency level (RS0.951) pro-duced on average 18.9 predicted effective siRNAs pergene. This indicates that this rule set offers sufficient can-didate siRNAs in an ordinary siRNA design task for a geneof an average length. However, the smallest number ofpredicted effective siRNAs for a gene is 1. This suggeststhat for genes of the shortest lengths, the number of can-didate siRNAs offered by this rule set may not be enough.There are considerations other than achieving high effi-cacy (e.g., avoiding cross-reactivity with other genes) inthe design of siRNA experiments, thus it is desirable tohave multiple candidate siRNAs designed for every gene.For genes of the shortest lengths, we resort to DRM rulesets of lower stringency levels. For example, RS0.845 pro-duced at least 3 potentially effective siRNAs for each gene,and an average of 38.1 potentially effective siRNAs pergene (see Supplementary Figure 3 in Additional file 1).The online design tools varied greatly in the numbers ofcandidate siRNAs they provided. The highest number ofpredicted effective siRNAs was offered by EMBOSS sirnaby Institute Pasteur (639.4 siRNAs per gene). IDT RNAiDesign by IDT, Inc. produced the lowest number of pre-dicted effective siRNAs (5.8 siRNAs per gene). Among the15 online design tools, 10 offered larger numbers of can-didate siRNAs than DRM RS0.951, and 4 provided largernumbers of candidate siRNAs than DRM RS0.845.Given that a sufficient number of candidate siRNAs areprovided, the most important parameter that measuresthe performance of a design tool is the PPV. Only a smallproportion of possible siRNA sites have been experimen-tally tested for effectiveness (1,014 sites among 2,453,510possible 19-mer sites on the 744 genes). Based on theseexperimentally tested siRNA sites, we compared the PPVsof the DRM rule sets to those of 15 existing online designtools. For the 90% efficacy setting and 70% effi-cacy setting, DRM RS0.951 showed PPVs of 72.7% and90.9%, respectively. In other words, 72.7% of the pre-dicted effective siRNAs by DRM RS0.951 had 90% effi-cacy, and 90.9% of the predicted effective siRNAs showed 70% efficacy. This rule set and two others with lower αlevel, RS0.895 and RS0.845 surpassed all online design toolsin PPVs on both settings. Among the 15 online designtools, the three that offered the highest PPVs for the 90% efficacy setting were WI siRNA Selection Program byWhitehead Institute (53.8%), siDESIGN Center by Dhar-macon Inc. (48.5%) and BLOCK-iT RNAi Designer by Inv-ˆlflˆˆlˆlflˆˆl BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 8 of 21(page number not for citation purposes)itrogen Corp. (47.6%), respectively; and the four thatoffered the highest PPVs for the 70% efficacy settingwere siRNA Target Finder by GenScript Corp. (85.2%), WIsiRNA Selection Program by Whitehead Institute (84.6%),siDESIGN Center by Dharmacon. Inc. (81.8%) and siRNATarget Designer by Promega Corp. (81.8%), respectively.Set T is a fair dataset to be used for the purpose of perform-ance comparison between the DRM rule sets and theonline design tools, because it contains no overlappingrecords with Set A, based on which the DRM rule sets werederived. However, Set T might not be considered as a com-pletely independent dataset, because (a), there are recordsin Set T that originated from the same studies as somerecords in Set A; and (b), there are records of siRNA exper-iments in Set T that target the same genes as some experi-ments in Set A. To rule out the possibility that these twofactors might contribute to better performance of theDRM rule sets for unforeseen reasons and unfairly favorthe DRM rule sets in the performance comparison, wecompiled an independent subset of Set T, eliminatingall records that share the same origins of any records in SetA, and all records that target the same genes that are alsotargeted by any records in Set A. We compared the per-formance of the DRM rule sets with that of the 15 onlinedesign tools using this independent subset (including 224siRNAs targeting 197 different genes, see Table 4). Becauseof the reduced size of the dataset (by nearly 80%), the sen-sitivity, specificity and PPVs for all tools and rule setsshowed higher levels of variability. The three DRM rulesets with the highest α levels: RS0.951, RS0.895 and RS0.845achieved 100% PPV. Two online design tools, BLOCK-iTby Invitrogen Corp. and WI siRNA Selection Program byWhitehead Institute also achieved 100% PPV, but theother online design tools achieved lower PPVs that rangebetween 50.0% and 86.4%. Although the small size of theindependent subset prevented this analysis from beingcompletely conclusive, it is fair to state that the compari-son made based on the independent subset is generally inagreement with the comparison made based on the entireSet T.DiscussionIt has been recognized that many existing siRNA designcriteria (and the design tools in which they are imple-mented) failed to provide promised levels of performanceTable 2: Non-redundant DRM rule set for the highest α level:RS0.951.Feature F1F2F3F4F5F6F7F8F9F10F11F12F13F14F15F16F17Rule 1 √√ √ √Rule 2 √√√ √√Rule 3 √√ √ √ √ √Rule 4 √√√√ √√Rule 5 √ √ √√√√Rule 6 √ √√√√ √Rule 7 √√√√ √√List of features:Feature IndexFeature NamesF12nd nucleotide = AF24th nucleotide = CF36th nucleotide ≠ CF47th nucleotide ≠ UF59th nucleotide = CF617th nucleotide = AF718th nucleotide ≠ CF819th nucleotide = (A/U)F9At least three (A/U)s in the seven nucleotides at the 3 endF10No occurrences of four or more identical nucleotides in a rowF11No occurrences of G/C stretches of length 7 or longerF12G/C content is between 35 and 60%F13Tm is between 20 and 60°CF14Binding energy of N16–N19 -9 KCal/MolF15Binding energy of N16–N19 – binding energy of N1–N4 is between 0 and 1 KCal/MolF16Local folding potential (mean) ≥ -22.72 KCal/MolF17Target site is on CDS BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 9 of 21(page number not for citation purposes)Table 3: Comparison in performance between 15 online siRNA design tools and DRM rule sets with four different stringency levels (α = 0.951, 0.895, 0.845 and 0.827).Design Program Institution/Company URL Avg. # Effective siRNAsPredicted per GeneSensitivity( 90%)Specificity( 90%)PPV( 90%)Sensitivity( 70%)Specificity( 70%)PPV ( 70%)Ambion siRNA Target Finder Ambion, Inc. [64] 190.0 0.603 0.456 37.0% 0.574 0.457 70.3%Jack Lin s siRNA Sequence Finder Cold Spring Harbor Laboratory [65] 207.5 0.204 0.759 30.9% 0.221 0.757 67.1%siDESIGN Center Dharmacon, Inc. [66] 9.8 0.042 0.976 48.5% 0.036 0.982 81.8%siRNA Target Finder GenScript Corp. [67] 22.4 0.032 0.979 44.4% 0.030 0.988 85.2%Imgenex sirna Designer Imgenex Corp. [68] 22.8 0.116 0.913 41.5% 0.108 0.929 77.4%EMBOSS sirna Institute Pasteur [69] 639.4 0.778 0.250 35.4% 0.767 0.258 69.9%IDT RNAi Design (SciTools) Integrated DNA Technologies, Inc. [70] 5.8 0.032 0.975 40.0% 0.030 0.979 76.7%BLOCK-iT RNAi Designer Invitrogen Corp. [71] 11.4 0.026 0.985 47.6% 0.020 0.982 71.4%siSearch Karolinska Institutet [72] 19.6 0.016 0.986 37.5% 0.017 0.991 81.3%SiMAX MWG-Biotech, Inc. [73] 35.1 0.161 0.843 35.3% 0.172 0.872 75.1%BIOPREDsi Novartis Institutes for BioMedical Research[74] 10.0 0.794 0.908 31.3% 0.820 0.899 64.6%Promega siRNA Target Designer Promega Corp. [75] 38.0 0.093 0.941 45.5% 0.083 0.958 81.8%QIAGEN siRNA Design Tool QIAGEN, Inc. [76] 29.6 0.167 0.862 38.9% 0.161 0.881 75.3%SDS/MPI University of Hong Kong [77] 432.8 0.656 0.380 35.9% 0.632 0.368 69.2%WI siRNA Selection Program Whitehead Institute [78] 9.5 0.019 0.992 53.8% 0.015 0.994 84.6%DRM RS0.95118.9 0.021 0.996 72.7% 0.013 0.997 90.9%DRM RS0.89520.7 0.032 0.992 66.7% 0.021 0.994 88.9%DRM RS0.84538.1 0.032 0.986 54.5% 0.026 0.984 90.9%DRM RS0.82751.8 0.037 0.973 42.4% 0.038 0.988 87.9%Comparison made based on Set T (1,093 siRNA experiments targeting 744 genes). Default settings were used for the 15 online predicting tools. Two sets of parameters (sensitivity, specificity and PPV) were calculated for each predicting tool or rule set. One was for 90% efficacy (that is, a siRNA experiment was considered as truly effective if it achieved 90% efficacy), the other one was for 70% (considered truly effective if 70% efficacy was reached in the experiment). BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 10 of 21(page number not for citation purposes)when tested with unseen data largely due to the overfit-ting problem in their development [20,24]. Practically,the key to countering this problem is to make use of alarge siRNA efficacy data from diverse origins when devel-oping siRNA design rules. In this study, we took advantageof the recent siRecords collection in our development ofthe DRM rule sets. First, we conducted a survey on thesiRecords dataset of all known features previously impli-cated to influence siRNA knock-down efficacy. This surveyresulted in a list of features that significantly boosted thechance of achieving higher siRNA efficacies. Then, weexamined quantitatively how these significant featuresinteract with one another in their joint effects on achiev-ing higher efficacies. The combinations of features thatgive rise to the highest levels of boosting to siRNA effica-cies were picked and reorganized using the DRM algo-rithm, producing the rule sets. Finally, the performance ofthese rule sets was verified on a reserved dataset (Set T,also from siRecords) and was compared with that of 15online siRNA design tools commonly used in currentsiRNA design practice.The survey of features influencing RNAi effectiveness con-ducted in this study is the largest scale survey of this typeever reported by far (276 features were examined on asiRNA efficacy dataset consisted of 2,184 records of exper-iments originated from 1,141 independent studies).Among the significant features identified in the survey(Table 1) are several that have been implicated in multipleprevious studies to influence the siRNA efficacy. Theyinclude a few features related to weaker binding on the 3 end (the 17th nucleotide = A, the 18th nucleotide ≠ C, the19th nucleotide = (A/U), At least three (A/U)s in the sevennucleotides at the 3 end, Binding energy of N16–N19 -9KCal/Mol, and Binding energy of N16–N19 – binding energyof N1–N4 is between 0 and 1 KCal/Mol), one feature abouta lower G/C range (G/C content is between 35% and 60%),two features related to unusual sequence patterns (Nooccurrences of four or more identical nucleotides in a row andNo occurrences of G/C stretches of length 7 or longer), one fea-ture related to melting temperature (Tm is between 20 and60°C), and one feature related to the target location (Tar-get site is on CDS). However, there are also a small numberof features that were not reported to be significant in anyprevious studies, e.g., the 4th nucleotide = C and the 9thnucleotide = C. It appears that there are higher levels of dis-agreements for sequence related features (Categories 1and 2) than for features defined based on thermodynam-ics of the siRNAs and on target mRNA sites (Categories 3and 4) between our survey results and previous findings,with the exception of the 3-nucleotide segment on the 3 end (N17–N19, the lower G/C content in this segment iscorrelated to lower binding energy on the 3 end). Nota-bly, three Category 5 features (defined based on experi-mental settings) Cell line = HeLa, Test method = Western blotand Test object ≠ mRNA were among those found to bemost significant. Although there have been reports aboutsiRNA efficacy being influenced by cell lines and testmethods [37-40], this is the first quantitative analysisabout how strong these influences are. More details aboutthe significant features found in the survey, and compari-sons of our analyses with previous findings are presentedin the Additional file 1.In a recently published review article, several considera-tions for selecting effective siRNAs were proposed result-ing from summarization and integration of major recentfindings in the field of siRNA design [41]. Comparison ofthese considerations with the survey results obtained inthis study indicates that they generally agree with eachThe ROC graph shows the performance the DRM rule sets of several α levels (filled circles) and that of several existing online predicting tools (open diamonds, Dharmacon denotes Dharmacon Inc. s siDesign Center, GenScript denotes GenScript Corp. s siRNA Target Finder, IDT denotes Integrated DNA Technologies Inc. s RNAi Design (SciTools), Invitrogen denotes Invitrogen Corp. s BLOCK-iT RNAi Designer, and siSearch stands for the siSearch tool by CGB, Karolinska Institutet)Figure 4The ROC graph shows the performance the DRM rule sets of several α levels (filled circles) and that of several existing online predicting tools (open diamonds, Dharmacon denotes Dharmacon Inc. s siDesign Center, GenScript denotes GenScript Corp. s siRNA Target Finder, IDT denotes Integrated DNA Technologies Inc. s RNAi Design (SciTools), Invitrogen denotes Invitrogen Corp. s BLOCK-iT RNAi Designer, and siSearch stands for the siSearch tool by CGB, Karolinska Institutet). A siRNA experiment was considered effective if it achieved 70% efficacy (was rated high or very high efficacy). The dotted line denotes the diagonal of the ROC. Unlike the diagonal line in a ROC of a common training task which represents the performance of a random guesser, the diagonal line shown in this graph repre-sents the general siRNA design practice, because this is where the siRecords data were obtained. Symbols to the left-upper side of the diagonal line represent design rules that perform better than the general design practice. The farther away a symbol is from the dotted line, the better perform-ance the corresponding design tool presents.0.00 0.01 0.02 0.03 0.04 0.050.000.010.020.030.040.05D=0.784D=0.796InvitrogenDharmaconIDTGenScriptsiSearchD=0.827D=0.845D=0.895D=0.951Sensitivity1-Specificity BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 11 of 21(page number not for citation purposes)Table 4: Comparison in performance between 15 online siRNA design tools and 4 DRM rule sets based on independent subset of Set T.Design Program Institution/Company # Predicted effective siRNAs# Predicted ineffective siRNAsSensitivity Specificity PPV (%)Ambion siRNA Target Finder Ambion, Inc. 144 80 0.645 0.362 74.3Jack Lin s siRNA Sequence Finder Cold Spring Harbor Laboratory 44 180 0.229 0.897 86.4siDESIGN Center Dharmacon, Inc. 7 217 0.036 0.983 85.7siRNA Target Finder GenScript Corp. 6 218 0.024 0.966 66.7Imgenex sirna Designer Imgenex Corp. 24 200 0.114 0.914 79.2EMBOSS sirna Institute Pasteur 180 44 0.801 0.190 73.8IDT RNAi Design (SciTools) Integrated DNA Technologies, Inc. 4 220 0.012 0.966 50.0BLOCK-iT RNAi Designer Invitrogen Corp. 2 222 0.012 1.000 100siSearch Karolinska Institutet 0 224 N/A N/A N/ASiMAX MWG-Biotech, Inc. 48 176 0.235 0.845 81.3BIOPREDsi Novartis Institutes for BioMedical Research 4 220 0.018 0.983 75.0Promega siRNA Target Designer Promega Corp. 26 198 0.127 0.914 80.8QIAGEN siRNA Design Tool QIAGEN, Inc. 33 191 0.151 0.862 75.8SDS/MPI University of Hong Kong 151 73 0.663 0.293 72.8WI siRNA Selection Program Whitehead Institute 12 212 0.072 1.000 100DRM RS0.9511 223 0.006 1 100DRM RS0.8954 220 0.024 1 100DRM RS0.8455 219 0.030 1 100DRM RS0.8279 215 0.048 0.983 88.9Comparison made based on the independent subset of Set T (224 siRNA experiments targeting 197 genes). Default settings were used for the 15 online predicting tools. A siRNA experiment was considered effective if it achieved 70% efficacy (was rated high or very high efficacy). BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 12 of 21(page number not for citation purposes)other (see Supplementary Table 8 in Additional file 1). Ofthe 34 features pertinent to the considerations proposedby Pei and Tuschl, 29 were found to be significant inboosting the siRNA efficacy. Among the remaining 5 fea-tures, the feature G/C content is between 30 and 52% wasfound to be associated with a commensurate, though notsignificant improvement in the efficacy distribution (P70 =0.082 and Pwald = 0.056). Two related features, G/C contentis between 35 and 60% and G/C content is between 31.6 and57.9%, however, were found to be highly significant inboosting the siRNA efficacy, agreeing with the commonunderstanding that the effective siRNAs prefer a low-to-medium G/C content. Two features pertinent to the con-siderations proposed by Pei and Tuschl that are related totarget accessibility, siRNA passes the repelling loop filter andAnti-sense siRNA binding energy -10 KCal/Mol were notfound to be significant in our survey. Yet, two other fea-tures closely related to them, H-b index 28.8 and Localfree energy of the most stable structure ≥ -20.9 KCal/Mol, werefound to be significant. The remaining two features perti-nent to the considerations proposed by Pei and Tuschl,Binding energy of N6–N11 ≥ -13 KCal/Mol and 10th nucleo-tide = (A/U), were not found to be significant in our sur-vey.Since the siRecords collection is compiled from publishedsiRNA studies, there is the concern that it may be biasedtowards higher efficacy siRNAs, because researchers areprobably less inclined to report lower efficacy experimentsin their research articles. We can assess how much thisbias is by comparing the efficacy distribution of thesiRecords collection with that of published randomlydesigned siRNAs. In two published studies [11,22], mod-erately large numbers (180) of randomly designed siRNAswere tested for knock-down efficacies. The percentages ofsiRNAs resulting in 50% efficacies in these two studieswere 22.2% and 23.3%, respectively. In the siRecords dataused in this study, the percentage of records receiving low efficacy rating (i.e., produced 50% knock-downefficacies) is 14.3%. In one of these previous studies [22],the percentage of siRNAs resulting in 90% efficacies wasreported to be 29.4%. In the siRecords collection, the per-centage of records receiving very high efficacy rating(i.e., produced 90% efficacies) is 34.3%. Therefore, thesiRecords collection is indeed biased towards the higherefficacy experiments, likely because researchers are lessready to report lower efficacy experiments. However, thisbias is not severe, because nearly 2/3 of the low efficacysiRNA experiments are still included in siRecords. Further-more, the analyses conducted in this study – in particular,the results of the survey of features influencing the siRNAefficacy – are not influenced by the reduced number oflow efficacy siRNAs in the dataset. These analyses are reli-able as long as the dataset includes sufficiently largenumber of low efficacy records (the number of recordsbearing low efficacy used in this study is 467).Another concern over the using of the siRNA data com-piled from published siRNA studies is that the design ofsiRNA experiments in these published studies might bedominated by one or two design tools used in the per-formance comparison (Table 3), compromising the objec-tiveness of this comparison. An analysis of the relativeutility of the 15 online siRNA design tools (see Supple-mentary Table 7 in Additional file 1) suggested that thesedesign tools had varied levels of utility, yet none of themhad dominated the current siRNA design practice (see dis-cussion in Additional file 1).It is desirable to validate the DRM rule sets obtained inthis study using a dataset independent of siRecords. How-ever, it is considerably difficult to find a separate siRNAefficacy dataset that is as large and diverse as the siRecordscollection. In a recent report by Huesken et al., a genome-wide human siRNA library was constructed, in which2,431 randomly selected siRNAs targeting 34 fusionmRNAs were tested for efficacy [42]. There were concernswhen this library of siRNAs was considered as a validationdataset for the DRM rule sets, because, firstly, this datasetis of a singular origin; and secondly, fusion mRNAs wereused against which the siRNA efficacies were tested. Thisis considered as a somewhat questionable practicebecause the native secondary structures may not be wellpreserved in the fusion mRNAs. Although Huesken et al.performed control experiments which suggested thatfusion mRNAs and endogenous mRNAs produced similarefficacy estimates in the setting they adopted, and arguedthat sequence features, rather than secondary structurerelated features were the main determinants of the siRNAefficacy, there have been multiple recent reports aboutsecondary structures playing important roles in determin-ing the siRNA efficacy [17,25], which are backed up by thefinding in our survey that at least one secondary structurerelated feature (Local folding potential (mean) ≥ -22.72KCal/Mol) significantly boosts the chance of achievinghigher siRNA efficacy. Nevertheless, we examined the per-Table 5: Contingency table for the outcome of prediction tasks.Truly Effective Truly IneffectivePredicted Effective NANBPredicted Ineffective NCND BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 13 of 21(page number not for citation purposes)formance of the DRM rule sets using the 2,431 siRNAdataset provided by Huesken et al. The three DRM rulesets with the highest stringency (RS0.951, RS0.895 andRS0.845) identified 23, 32 and 48 effective siRNAs, respec-tively, in this dataset. These selected siRNAs had average normalized inhibitory activity of 0.80, 0.78 and 0.76,respectively. When tested using the 249-siRNA test datasetspecified in that study, the same three DRM rule sets iden-tified 3, 4 and 6 effective siRNAs, respectively, and theaverage normalized inhibitory activity of these siRNAswere 0.96, 0.80 and 0.78, respectively. In Huesken et al.,the average normalized inhibitory activity of the entiredataset was 0.69, and they recommended to use 0.75 or0.80 as cut-offs for selecting effective siRNAs. These resultssuggest that generally speaking, the DRM rule sets werecapable of identifying effective siRNAs in this completelyindependent siRNA efficacy dataset.As more data becomes available in siRecords, we will per-form updated analyses on this data collection with theaim of obtaining more accurate and more reliable siRNAdesign rules. In addition, as there is indication that theDRM rule sets behave differently for subpopulations ofsiRNAs tested under different experimental settings (e.g.,for those validated with Western blot technique and thosevalidated with PCR and other techniques, see Supplemen-tary Figure 4 in Additional file 1), we will refine our anal-yses and develop separate rule sets for these differentsubpopulations of siRNAs.ConclusionIn this study, we identified a bundle of highly effectiveand generally applicable rule sets for siRNA design. Thiswas accomplished by applying a simple strategy in whichwe analyzed a large number of candidate features for asso-ciation with increased siRNA efficacies, then used quanti-tative analyses of the joint effects of these significantfeatures to identify positive cooperativity among thesefeatures. The key to our approach was the use of the largeset of siRNA efficacy data available in siRecords. The avail-ability of this dataset not only made the execution of thisstrategy possible, but also curbed the overfitting problemthat many rules generated by other design protocols sufferfrom. We expect that the design rules revealed in thisstudy, together with improving RNAi lab techniques, willmake siRNAs a more useful tool for molecular genetics,functional genomics, and drug discovery studies.MethodsData preparationAll records of 19-mer siRNAs (not counting the overhang-ing nucleotides on the 3 end) were retrieved from thesiRecords database. The records that failed to meet the fol-lowing criteria were excluded from further analyses: (1)had complete annotations of cell line types, test methods,transfection methods and efficacy classification; (2) hadtarget mRNA lengths ≤ 16,000 nucleotides (this is a limitset by the Mfold program for calculation of thermody-namics features, see below); (3) the siRNA sequence hadno mismatches with the targeted site by pair-wise Blast(NCBI bl2seq v.2.2.9, parameters -p blastn -W 7 -q -1 -FF ). For studies where more than 30 siRNA experimentswere reported, we randomly chose 30 to include in ouranalyses. The cell line types and test methods weregrouped based on ATCC (American Type Culture Collec-tion) [43] and Protocol Online [44], respectively.FeaturesWe define a feature as a binary property of a siRNA exper-iment concerning a factor potentially influencing the effi-cacy of the experiment. For a given siRNA experiment, anydefined feature is either present or absent. Some examplefeatures are listed below:(1) The 6th nucleotide of the siRNA sequence (counting fromthe 5 end on the sense strand) is an adenine (A).(2) The 17th nucleotide of the siRNA sequence is not a guanine(G).(3) There are at least three (A/U) s in the seven nucleotides onthe 3 end.(4) The G/C content of the siRNA sequence is between 30 and52%.For Features (1) and (2), the concerning factors poten-tially influencing the siRNA efficacy are the identities ofthe 6th and the 17th nucleotides of the siRNA sequence,respectively. For Feature (3), the concerning factor is theseven nucleotides as a whole on the 3 end of siRNAsequence. For Feature (4), the concerning factor is the G/C content of the siRNA sequence.Each feature has a complementary feature, that is, the alter-native property concerning the same factor. For instance,the complementary feature of Feature (1) is the 6th nucleo-tide of the siRNA sequence ≠ A; and the complementary fea-ture of Feature (3) is there are at most 2 (A/U) s in the sevennucleotides on the 3 end. For any given siRNA experimentand any given feature, either the feature holds true for theexperiment, or the complementary feature must hold true.A feature and its complementary feature constitute a fea-ture pair.For a given factor, there are multiple ways of formulatingfeatures. In some cases, the so-called sub-feature – super-feature relationships can result. For example, the followingfour features are all concerned with same factor – the iden-tity of the 6th nucleotide of the siRNA sequence: BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 14 of 21(page number not for citation purposes)(5) The 6th nucleotide = A.(6) The 6th nucleotide ≠ A.(7) The 6th nucleotide =C.(8) The 6th nucleotide ≠ C.Wherever Feature (5) is present, Feature (8) must also bepresent. Thus, Feature (5) is a sub-feature of Feature (8),and Feature (8) is a super-feature of Feature (5). Similarly,Features (7) and (6) also constitute a pair of sub-feature –super-feature relationship.Feature definitionsWe surveyed 276 features (constituting 138 feature pairs)in this study. These features can be classified into the fol-lowing five categories:Category 1: Direct sequence featuresWe defined 152 direct sequence features (76 pairs) basedon the positional specific nucleotide identity in the siRNAsequence (on the sense strand). For each position in the19-mer siRNA sequence, 8 features were defined based onwhether or not the nucleotide at the position is an ade-nine (A), a cytosine (C), a guanine (G), or a uracil (U),respectively. Among these features, 24 were previouslyclaimed to favorably influence the siRNA efficacy (seeSupplementary Table 1 in Additional file 1).Category 2: Sequence-derived featuresWe defined 38 sequence-derived features (19 pairs) thatare related to either the sequence compositions or the G/C content of the siRNA (see Supplementary Table 2 inAdditional File 1). All these features have been previouslyclaimed to have impact on the siRNA efficacy. Amongthem, 24 features were defined based on (a) whether ornot the 1st nucleotide is a G/C [13,31,45], (b) whether ornot the 10th nucleotide is an A/U [45], (c) whether the11th nucleotide is a G/C [14], (d) whether the 19thnucleotide is an A/U [13,31,45], (e) whether there are atleast 5 (A/U) s in the last 7 nucleotides at the 3 end [13],(f) whether there are at least 3 (A/U) s in the last 7 nucleo-tides in the 3 end [45], (g) whether there are at least 3 (A/U) s in the 5 nucleotides at the 3 end [11], (h) whetherthe siRNA contains G/C stretches longer than 9 [13,46],(i) whether the siRNA contains G/C stretches longer than7 [18,19], (j) whether there are occurrences of 3 or moreidentical nucleotides in a row [18,32], (k) whether thereare occurrences of 4 or more identical nucleotides in a row[16,18,19,47], and (l) whether there are at least 3 (A/U) sin the 5 nucleotides at the 5 end [11], respectively. Inaddition, 14 features (7 pairs) were defined based onwhether the G/C content of the siRNA falls into the fol-lowing reported optimal G/C ranges: (a) 30 – 52% [11],(b) 32 – 79% [12], (c) 30 – 70% [30], (d) 35 – 60% [18],(e) 20 – 50% [32], (f) 31.6 – 57.9% [31] and (g) 30 – 79%[16].Category 3: Features defined based on thermodynamics of the siRNAFeatures on Tm, folding energy of the sense strand and totalhairpin energy. Ten features (5 pairs) were defined that arerelated to the melting temperature (Tm) of the siRNA, thefolding energy of sense strand, or the total hairpin energyof the siRNA. Among them, 6 features were defined basedon whether or not the Tm falls into the following threeranges 60°C, 20°C, and between 20 and 60°C [11].Two features were defined based on whether or not thefolding energy of sense strand is equal to or greater than -5 KCal/Mol [18]. Two features were defined based onwhether the absolute value of total hairpin energy is lessthan 1 KCal/Mol [24]. The DINAMelt server [48] was usedin the calculation of Tm and hairpin energy [29,49]. Thetotal hairpin energy was calculated as the absolute valueof the sum of hairpin energies of siRNA sense and anti-sense strand in units of KCal/Mol [24] (Chalk, A., per-sonal communication).Features on binding energy. Sixteen features (8 pairs) relatedto the binding energy of siRNA sequences were defined.On the 5 end binding energy, we defined the feature 5 binding energy is between -9 and -5 KCal/Mol and its com-plementary feature [24]. On mid-sequence bindingenergy, we defined 6 features associated with threenucleotide ranges: N6–N11 [22], N7–N11 [15] and N7–N12 [24]. For the nucleotide range N7–N12, we used thereported threshold -13KCal/Mol in the feature definition[24]. For the nucleotide range N7–N11, we defined thefeature based on whether or not the average free energyprofiles fall into the reported optimized range between -1.97 and -1.65 KCal/Mol [15]. For the binding energy ofthe range N6–N11 for which no threshold was explicitlyreported, we took the median value (-13 KCal/Mol) of allsiRNAs in the dataset as the threshold. On 3 end bindingenergy, we defined a feature binding energy of N16–N19 -9 KCal/Mol and its complementary feature [24]. In addi-tion, 6 features (3 pairs) were defined that are associatedwith the difference between the 5 binding energy and 3 binding energy. They are defined based on: (a) whether ornot the difference between the binding energy of N1–N4and N16–N19 is greater than 0 [22,24], (b) whether or notthe difference between the binding energy of N1–N4 andN16–N19 is between 0 and 1 KCal/Mol [24], and (c)whether or not the difference between the binding energyof N1–N5 and N15–N19 is greater than 0 [15], respec-tively (see Supplementary Table 3 in Additional file 1).The nearest neighbor model parameters described in Xia,T. et al. [50] were used for binding energy calculation [29].The binding energy of N1–N4 and N16–N19 were com- BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 15 of 21(page number not for citation purposes)puted as the sum of free energies for 4 base-pair stacksstarting from position 1 in the sense strand and one singlebase stacking energy [21,51] (Chalk, A., personal commu-nication). Calculations of binding energies for N1–N5and N15–N19 were performed similarly to those done forN1–N4 and N16–N19, except that 5 base-pair stacks wereused. Binding energies for N6–N11 and N7–N12 werecomputed as the sum of free energies for 6 base-pair stackswithin positions 6–11 and positions 7–12 in the sensestrand. The average free energy profiles of N7–N11 wascomputed as the average base pair energy of consecutivefive pentamer subsequences starting from positions 7 to11 in the sense strand (Poliseno, L., personal communica-tion).Category 4: Features defined based on target mRNA sitesFeatures on the mRNA target location. Sixteen features (8pairs) related to the siRNA target location on mRNA weredefined, based on whether or not the target region iswithin (a) 5 UTR [12,16], (b) 3 UTR [16], (c) CDS[18],(d) the first 100 nucleotides of CDS [12,16], (e) the firstquartile of CDS, (f) the second quartile of CDS, (g) thethird quartile of CDS[14], and (h) the fourth quartile ofCDS, respectively. The mRNA sequences were obtainedfrom NCBI GenBank. The target region was determined byusing a BLAST search (NCBI bl2seq v.2.2.9 with parameter -W 7 -q -1 -F F ). The targeted site was assigned to a sub-region if the entire target site lied within that sub-region.Feature on the secondary structures of the target mRNA. Four-teen features (7 pairs) that are associated with the second-ary structures of the target mRNA were defined, based on(a) whether or not the calculated hydrogene bond (H-b)index is less than 28.8 [25], (b) whether or not the siRNAtarget region is filtered by repelling loop filter [52], (c)whether or not the local free energy of the most stable struc-ture (LFE_mss) is equal to or greater than -20.9 KCal/Mol[53], (d) whether the average local free energy of the ten moststable structures (LFE_average) is equal to or greater than -20.85 KCal/Mol [53], (e) whether or not the mean localfolding potential (LFP) is equal to or greater than -22.72KCal/Mol, (f) whether or not a non-zero accessibility scorewas obtained for the siRNA target site [54], (g) whether ornot the anti-sense siRNA binding energy is equal to or lessthan -10KCal/Mol [47], respectively (see SupplementaryTable 4 in Additional file 1).The hydrogen bond (H-b) index measures the averagenumber of hydrogen bonds formed between nucleotidesin the target region and the rest of mRNA, and it was cal-culated according to Luo et al. [25]. We used the medianvalue of all siRNAs in the dataset (28.8) as the thresholdsince no threshold was explicitly given in the originalreport. The repelling loop filter was proposed by Yiu et al.for determining the accessibility of the mRNA targetregion [52]. If in at least three of the five most stable struc-tures of the whole-length mRNA (calculated with Mfold),the 19-mer target site was contained by at least one bigrepelling loop , or by at least two repelling loops , thetarget region was identified to be invalid by the repellingloop filter. The LFE (local free energy) was calculatedaccording to Schubert, S., et.al[53], with predicted mRNAsecondary structures calculated using Mfold 3.2 [29,55].The free energy contribution of each sequence local elementin a structure was extracted from the output .det files byMfold; local elements include helices, bulges, and loopsamong others. The LFE of the targeted site was computedas the sum of the free energy contribution of all sequencelocal elements containing one or more nucleotides in thesiRNA target site (Schubert, S., personal communication).The ten most stable secondary structures in the mRNAsequence were also used in our calculations. For eachsiRNA target site, we calculated the LFE for the lowest freeenergy structure of the site (LFE_mss) and the average LFEof the ten most stable secondary structures (LFE_average).Since no thresholds were explicated provided in the origi-nal report, the medians of all LFE values in the dataset (-20.9 KCal/Mol for LFE_mss and -20.85 for theLFE_average) were used as thresholds in the feature defini-tions.The local folding potential (LFP) is a measurement of theRNA local thermodynamic stability [56-58]. We postu-lated that the thermodynamic stability of the siRNA targetsite may influence the RNAi effectiveness. We calculatedthe structure with the lowest free energy for the 100nucleotide region on the mRNA centering around each ofthe 19 nucleotides in the siRNA target site. The LFP wascalculated as the mean of the 19 free energy valuesobtained. In cases when the target site was close to eitherend of the mRNA, so that the 100-nucleotide regionscould not be obtained for certain nucleotides in the 19-mer target site, a shorter mRNA segment was used that wastruncated at the end of mRNA. The median value calcu-lated for the entire dataset (-22.72 KCal/Mol) was used asthe threshold in feature definition.The accessibility of the siRNA target region was recentlyraised as an important factor influencing the siRNA effi-cacy [59]. We conducted the Iterative computational analysis(ICA) using a window size = 800 nucleotides and a stepsize = 100 nucleotides [59,60]. To generate the largestnumber of windows that overlap the siRNA target region,the central base of the siRNA target region was used as thecentral point of the first window; subsequent windowswere extended in both directions to cover the entiremRNA sequence. For each window, the five most stablestructures predicted by Mfold were used. It turned out,however, that the ICA routine produced a filter that is toostringent for practical use. Of the 2,600 siRNA target BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 16 of 21(page number not for citation purposes)regions in our dataset, only 6 were determined to beassessable by this routine. We then took an alternativeapproach, and conducted the accessibility score analysis[54], which produced a similar but less stringent filter. Incalculating the accessibility score, a region receives a non-zero score as long as the most stable structure in each win-dow covering the siRNA target region contains a single-strand segment of length ≥ 10 nucleotides. Of the 2,600siRNA target regions in our dataset, 456 received non-zeroaccessibility scores. Two features were defined based onthis accessibility score filter.The anti-sense siRNA binding energy was proposed as ameasurement of mRNA accessibility [47]. We used theSirna module of the Sfold server to calculate the anti-sensesiRNA binding energy [47]. For each siRNA target sequence,a 200 nucleotide mRNA segment centering around the 19nucleotide target site was extracted. In cases when the tar-get site was close to either end of the mRNA sequence, sothat a 200-nucleotide regions centering around the targetsite could not be obtained, a shorter mRNA segment(truncated at the close end of the mRNA) was used. Thesesegments were sent to the Sirna server for calculation [61].The results were parsed and the anti-sense siRNA bindingenergies were extracted.Category 5: Features defined based on experimental conditionsThe experimental conditions considered in our analysisinclude cell line types, test methods, transfection methodsand test objects. Twelve features (6 pairs) were definedbased on whether or not the RNAi experiment is con-ducted any of the 6 most frequently used cell lines: (a)HeLa, (b) HEK293, (c) MCF7, (d) CV-1 and derivatives, (e)3T3 and (f) T24. Twelve features (6 feature pairs) weredefined based on whether or not the test method is one ofthe six most frequently used test methods: (a) Westernblot, (b) PCR (including RT-PCR, real-time PCR etc.), (c)bDNA, (d) Northern blot, (e) Luciferase assay, and (f)Flow cytometry. Two features were defined based onwhether the transfection method is synthetic siRNAs ortranscription of hairpin precursors. Four features (2 fea-ture pairs) were defined based on whether or not thetested object is (a) mRNA or (b) protein (see Supplemen-tary Table 5 in Additional file 1).Statistical tests of features influencing siRNA efficacyDetermined by the four-level scheme used to rate the effi-cacy of siRNA experiments, proper categorical analysistechniques were needed to analyze these data. For anygiven feature, we calculated the efficacy distribution(among the four levels – very high, high, medium andlow) of all siRNA experiments carrying this feature, andcompared it with the efficacy distribution of all siRNAexperiments carrying the complementary feature of thisfeature. Chi-square (χ2) test of independence is a com-monly used test that finds evidence of difference betweentwo discrete distributions. However, this test assumes thatthe dependent variable (efficacy rating) is a nominal vari-able rather than an ordinal variable, thus it is not able totell us whether the presence of a feature results in higheror lower efficacy. A more appropriate test will find evi-dence of monotone trend, that is, whether the presence ofa feature is associated with a significant up-shift or down-shift of the efficacy distributions among the four levels.Consider the joint probability distribution {πi, j} betweenthe presence/absence of a particular feature (whichdefines i: i = 1 if the feature is present, and i = 0 if the fea-ture is absent), and the four-level efficacy ratings (whichdefines j: j = 3 if efficacy rating = very high , j = 2 if effi-cacy rating = high , j = 1 if efficacy rating = medium ,and j = 0 if efficacy rating = low ). We calculate the prob-abilities of concordance and discordance:Then we calculate the γ difference between these twoprobabilities:The sample γ has approximately a normal distribution,with standard error calculated using the Delta methodwhereLetthen z2 is a Wald statistics that has a chi-squared null dis-tribution with 1 degree of freedom, based on which aWald test can be conducted to find significant monotonetrend [35].The monotone trend test finds evidence about whetherthe presence of a particular feature is associated with sig-ΠΠcijjihkkjhidijjihkkjhi==∑∑∑∑∑∑∑∑ 22ππππ(),().γ=−+ΠΠΠΠcdcd.σπππ24216=+−∑∑()[],() ()ΠΠΠΠcdijjidijccijdππππππijchkjkihhkkjhiijdhkjkihhkkjh()(),=+=+ ∑∑∑∑∑∑∑ ∑i.z222=γσ, BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 17 of 21(page number not for citation purposes)nificant up-shift of the four-level efficacy distribution. Ifthe evidence of such association is found, however, thistest alone is not able to tell us where the up-shift takesplace. In RNAi experiments, we are most concerned withthe chances of achieving higher efficacies. Thus, we alsoconducted permutation tests of odds ratios for achieving 90% and 70% efficacies. In the siRecords data, thechance of achieving 90% (or 70%) efficacies can beapproximated by the proportion of records bearing veryhigh (or high / very high ) efficacy ratings. For a givenfeature, the odds ratio for 90% efficacies, θ90, is definedaswhere π1, 90 is the proportion of records bearing veryhigh efficacy rating (i.e., with 90% efficacies) in thesubset of the experiments carrying the feature, and π0, 90isthe proportion of records bearing very high efficacy rat-ings in the subset of the experiments carrying the comple-mentary feature of the feature concerned. To generate anull distribution of the odds ratio, Set A was randomlysplit into two subsets, one of which was arbitrarily markedwith feature present , the other marked with comple-mentary feature present , and an odds ratio was calculatedaccordingly. This process was repeated 100000 times, andthe 100000 resampled odds ratios constituted the nulldistribution. Given any feature to be tested, the P valuewas calculated asP90 = (i| θ90)/100000,where is the ith resampled odds ratio, and θ90 is thetrue odds ratio of the feature. The odds ratio permutationtest for 70% efficacies was conducted similarly, with theproportion of records bearing very high or high effi-cacy ratings substituted for that of records bearing veryhigh ratings in the above description.Meaningful statistics tests require the use of sufficientlylarge datasets. All features were subject to a dataset sizefilter using an arbitrarily set threshold of 30 records: if agiven feature was carried by fewer than 30 records in SetA, then this feature and the complementary feature of thisfeature were excluded from the statistics tests and follow-ing analyses. Four features – GC stretches of length ≥ 9, G/Ccontent is not between 30 and 79%, Cell line = T24 and Testmethod = Flow cytometry, as well as their complementaryfeatures were excluded for this reason.Control of false discovery rate (FDR)The simultaneous testing of the large number of hypothe-ses requires the curbing of the type I error rate with theconsideration of the multiple testing problem. Wechose to control the FDR by taking the q-value approach[36], because of its ability to adapt to the true distributionof the input p-values. We used the bootstrap method,rather than the default smoother method (which isequivalent to Benjamini and Hochberg s FDR controllingmethod [62]) in estimating the FDR, because U-shape dis-tributions were observed for the input p-values for boththe Wald test and the odds ratio permutation tests, likelyintroduced by the fact that one-sided tests were conductedwhen two-sided signals were present [63].Rules, rule sets and the disjunctive rule merging (DRM) algorithmWe define a rule as a conjunction of (l) features. An l-fea-ture rule is also called an l-feature combination. A rule setis defined as a disjunction of (m) rules. Generally speak-ing, the larger m is, the higher sensitivity the rule setachieves, in the mean time, the lower specificity the ruleset has to offer.The disjunctive rule merging (DRM) algorithm was devel-oped to remove the redundancy in the rule sets resultingfrom the combined effect analysis of multiple features, inthe mean time exerting control over the stringency of therule sets. The listing of the DRM algorithm is as follows.Input: Θ: a set of disjunctive rules that contains redun-dancy; each rule, ri, is a conjunction of mi features: ri = {fi,f2,..., }, and is labeled with Pi = the proportion ofrecords reaching 90% efficacy in the subset of Set Arecords satisfying ri.α: stringency factor, with a range of [0, 1].Initialization: Create rule set RS = φ.Step 1:For every ri ∈ Θ satisfying Pi ≥ α, add ri into RS.Step 2: For j = 2,3,...,5For any rule rp ∈ RS where mp = jFor any rule rq ∈ RS where mq jif rp ⊂ rq, then remove rq from RS.End ForEnd Forθππππ90190 190090 09011=−− ,,,,/( )/( ),θ90iθ90ifmi BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 18 of 21(page number not for citation purposes)End ForOutput: RS: non-redundant set of disjunctive rules withstringency α.It is easy to see that given any α, the rule set resulting fromthe DRM algorithm (thus called a DRM rule set) is fixed.The reverse, however, is not true. A DRM rule set does notcorrespond to a single α value, but rather, a range of dif-ferent α s. For example, the DRM rule sets for any αbetween 0.901 and 1 are exactly the same (containing 7rules). We note this rule set as RS0.951, where 0.951 is themid-point of the range of α for which the rule sets are pro-duced.Naturally, the higher α level, the higher specificity theDRM rule set possesses; meanwhile, the lower sensitivitythe rule set has to offer. Therefore, the DRM algorithmwith variable α values allows us to choose the propercombination of sensitivity and specificity that suits ourneeds. In the siRNA design of a typical setting, we are mostconcerned with achieving high specificity, and can oftentolerate lower sensitivity, since there is a large pool of pos-sible target sites to choose from – for a mRNA of length w,in theory there are (w-19+1) target sites to pick from.Therefore, we are most concerned with the behavior of therule sets with high (close to 1) α values.Performance comparison between DRM rule sets and existing online design toolsDesign tasks were performed for the 744 genes in Set Tusing the following 15 online siRNA design tools with thedefault settings.Ambion siRNA Target Finder (Ambion, Inc.) [64]. We usedthe mRNA sequence as the input. By default, no restrictionof the ending dinucleotides was specified, and no restric-tion of the G/C content was specified. Occurrences of 4 ormore identical nucleotides in a row were allowed.Jack Lin s siRNA Sequence Finder (Cold Spring Harbor Labo-ratory) [65]. We used the full-length mRNA sequence asthe input. The spacer length was set as to be 19.siDESIGN Center (Dharmacon, Inc.) [66]. We used themRNA sequence as the input. No restriction of the leadingsequences was specified. The target region was limited tothe ORF (open reading frame), the G/C content range wasset as 30–52%, and the patterns GGG and CCC wereexcluded. The BLAST filtering option was turned on bydefault.siRNA Target Finder (GenScript Corp.) [67]. We providedthe GenBank accession of the mRNA as the input. Thelength of siRNA was set to be 19. By default, the G/C con-tent range was set to be between 30% and 60%, andsequence selection region was restricted to the ORF.Imgenex sirna Designer (Imgenex Corp.) [68]. The targetmRNA was specified using the GenBank accession. ThesiRNA length was set to be 19. The parameter nucleotidetarget was set to be 50 by default. The parameter firstnucleotide target for siRNA was set as AA . The G/C con-tent range was set to be between 45% and 51%. Occur-rences of 4 identical A s or T s in a row, or 3 identical (C/G) s in a row were not allowed. By default, the BLASTsearch was not performed.EMBOSS siRNA (Institute Pasteur) [69]. We used the full-length mRNA sequence as the input. By default, no restric-tion of the leading or ending dinucleotides was specified.Occurrences of 4 identical nucleotides in a row wereallowed.IDT RNAi Design (SciTools) (Integrated DNA Technologies,Inc.) [70]. The mRNA sequence was provided as the input,and the 21mer option was selected. The Unified RNAiRule Set was used in the design. The G/C content rangewas set to be between 30% and 70%. The asymmetricalend stability base pair length was set to be 5. The 5 anti-sense asymmetrical end stability weight was set to be 0.5,and the 3 overhang was set to be TT by default. Thedefault setting was also used for all motifs preferences.BLOCK-iT RNAi Designer (Invitrogen Corp.) [71]. We pro-vided the mRNA sequence as the input. By default, thesearch in the target region was limited to the ORF. Theminimum/maximum allowed G/C contents were set to be35% and 55%, respectively. The BLAST search option wasturned on by default.siSearch (Karolinska Institutet) [72]. We provided themRNA sequence as the input. By default, the G/C contentrange was set to be between 30% and 60%. The candidatesites with scores of 6 or above were obtained. The mini-mum energy difference between two ends of the siRNAwas set to be 0. Occurrences of 4 (A/U) s in a row were notallowed, and the siRNAs containing immunostimulatorymotifs were removed. The repeat masking was turned onby default.SiMAX (MWG-Biotech, Inc.) [73]. We used the Genbankaccession to specify the target. By default, occurrences of 3 identical nucleotides in a row in the siRNA sequences,or U s at the 3 end were not allowed. The G/C contentrange was set to be between 30% and 53%. The searchrange was restricted to the region between the 100thnucleotide downstream of the start codon and the 100thnucleotide upstream of the end codon. By default, BLAST BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 19 of 21(page number not for citation purposes)filtering or secondary structure analysis was not per-formed.BIOPREDsi (Novartis Institutes for BioMedical Research)[74]. We used the mRNA sequence as the input. Thenumber of predicted siRNAs was set to be 10.Promega siRNA Target Designer (Promega Corp.) [75]. Weused the mRNA sequence as the input. The RNAi systemwas set to be the T7 RiboMAX Express RNAi system . Bydefault, the target length was set to be 19, and the searchregion was set to be the whole input sequence.QIAGEN siRNA Design Tool (QIAGEN, Inc.) [76]. We spec-ified the mRNA sequence as the input. The option StartsiRNA sequence with AA was turned on by default. TheBLAST search was not performed.SDS/MPI (University of Hong Kong) [77]. We used the full-length mRNA sequence as the input. The option MPIPrinciples was selected. The filtering of ineffective siRNAsbased on secondary structures was not performed. Bydefault, the G/C content range was set to be between 30%and 70%, and the search region was restricted to ≥ 100nucleotides downstream of the CDS.Whitehead WI siRNA Selection Program (Whitehead Institutefor Biomedical Research) [78]. We used the mRNA sequenceas the input. By default, the sequence pattern AAN19TT was searched for. The G/C content range was set to bebetween 30% and 70%. Occurrences of 4 or more identi-cal T s, A s or G s in a row were not allowed. Occurrencesof 7 or more consecutive (G/C) s in a row were also notallowed. By default, the checking with BLAST was not per-formed.The performance of a siRNA design rule set, or an onlinesiRNA design tool, can be assessed by several parameters.Two of the most often used ones are specificity and sensi-tivity, as illustrated in Table 5. Specificity is defined as ND/(NB+ ND); and sensitivity is defined as NA/(NA+ NC). AnROC (Receiver Operative Characteristic) curve can beused to visually depict the overall performance of a ruleset. The ROC curve is the plot of sensitivity vs. (1-specifi-city). Another parameter is the positive predictive value(PPV), defined as NA/(NA+ NB). The PPV is a very impor-tant parameter in siRNA design practice, because itdescribes out of the siRNAs predicted to be effective, howbig proportion turn out to be truly effective. The value (1-PPV) is sometimes called the false positive rate .Authors contributionsWG carried out most of the analyses, drafted some pro-portions of the manuscript and the supplementary text,and helped YR with the construction of the siDRM server.YR worked together with WG to design and implementthe siDRM server, and participated in the data compila-tion and pre-processing work. QX, YW, DL and HZ partic-ipated in data compiling and pre-processing work. TLdesigned the project, carried out some analyses, draftedsome proportions of the manuscript and supplementarytext, and improved and finalized the writing. All authorsread and approved the final manuscript.Additional materialAcknowledgementsX. Zheng, S. Li and W. Liu provided technical assistance. We thank B. Wu for very insightful discussions, and the Supercomputing Institute, University of Minnesota for computational resources. This work was supported by Department of Neuroscience, and the Graduate School, University of Min-nesota as well as Minnesota Medical Foundation.References1. Elbashir SM, Lendeckel W, Tuschl T: RNA interference is medi-ated by 21- and 22-nucleotide RNAs. Genes Dev 2001,15(2):188-200.2. Zamore PD, Tuschl T, Sharp PA, Bartel DP: RNAi: double-stranded RNA directs the ATP-dependent cleavage ofmRNA at 21 to 23 nucleotide intervals. Cell 2000,101(1):25-33.3. Bernstein E, Caudy AA, Hammond SM, Hannon GJ: Role for abidentate ribonuclease in the initiation step of RNA interfer-ence. Nature 2001, 409(6818):363-366.4. Hammond SM, Bernstein E, Beach D, Hannon GJ: An RNA-directednuclease mediates post-transcriptional gene silencing inDrosophila cells. Nature 2000, 404(6775):293-296.5. Elbashir SM, Harborth J, Lendeckel W, Yalcin A, Weber K, Tuschl T:Duplexes of 21-nucleotide RNAs mediate RNA interferencein cultured mammalian cells. Nature 2001, 411(6836):494-498.6. Caplen NJ, Parrish S, Imani F, Fire A, Morgan RA: Specific inhibitionof gene expression by small double-stranded RNAs in inver-tebrate and vertebrate systems. Proc Natl Acad Sci U S A 2001,98(17):9742-9747.7. Brummelkamp TR, Bernards R, Agami R: A system for stableexpression of short interfering RNAs in mammalian cells.Science 2002, 296(5567):550-553.8. Rubinson DA, Dillon CP, Kwiatkowski AV, Sievers C, Yang L, KopinjaJ, Rooney DL, Ihrig MM, McManus MT, Gertler FB, Scott ML, Van Par-ijs L: A lentivirus-based system to functionally silence genesin primary mammalian cells, stem cells and transgenic miceby RNA interference. Nat Genet 2003, 33(3):401-406.9. McManus MT, Sharp PA: Gene silencing in mammals by smallinterfering RNAs. Nat Rev Genet 2002, 3(10):737-747.10. Holen T, Amarzguioui M, Wiiger MT, Babaie E, Prydz H: Positionaleffects of short interfering RNAs targeting the human coag-ulation trigger Tissue Factor. Nucleic Acids Res 2002,30(8):1757-1766.Additional File 1Supplementary results and discussions. Discussions of cooperativity between features in their joint effects, performance of DRM rule sets in subsets divided by confounding factors, utility of online siRNA design tools, rationale of DRM procedure, and survey results of features signifi-cant associated with high siRNA efficacy.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-7-516-S1.pdf] BMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 20 of 21(page number not for citation purposes)11. Reynolds A, Leake D, Boese Q, Scaringe S, Marshall WS, Khvorova A:Rational siRNA design for RNA interference. Nat Biotechnol2004, 22(3):326-330.12. Elbashir SM, Harborth J, Weber K, Tuschl T: Analysis of gene func-tion in somatic mammalian cells using small interferingRNAs. Methods 2002, 26:199-213.13. Ui-Tei K, Naito Y, Takahashi F, Haraguchi T, Ohki-Hamazaki H, JuniA, Ueda R, Saigo K: Guidelines for the selection of highly effec-tive siRNA sequences for mammalian and chick RNA inter-ference. Nucleic Acids Res 2004, 32(3):936-948.14. Hsieh AC, Bo R, Manola J, Vazquez F, Bare O, Khvorova A, ScaringeS, Sellers WR: A library of siRNA duplexes targeting the phos-phoinositide 3-kinase pathway: determinants of gene silenc-ing for use in cell-based screens. Nucleic Acids Res 2004,32(3):893-901.15. Poliseno L, Evangelista M, Mercatanti A, Mariani L, Citti L, Rainaldi G:The energy profiling of short interfering RNAs is highly pre-dictive of their activity. Oligonucleotides 2004, 14(3):227-232.16. Cui W, Ning J, Naik UP, Duncan MK: OptiRNAi, an RNAi designtool. Comput Methods Programs Biomed 2004, 75(1):67-73.17. Yiu SM, Wong PW, Lam TW, Mui YC, Kung HF, Lin M, Cheung YT:Filtering of Ineffective siRNAs and Improved siRNA DesignTool. Bioinformatics 2005, 21(2):144-151.18. Wang L, Mu FY: A Web-based design center for vector-basedsiRNA and siRNA cassette. Bioinformatics 2004,20(11):1818-1820.19. Yuan B, Latek R, Hossbach M, Tuschl T, Lewitter F: siRNA Selec-tion Server: an automated siRNA oligonucleotide predictionserver. Nucleic Acids Res 2004, 32(Web Server issue):W130-4.20. Saetrom P, Snove O: A comparison of siRNA efficacy predic-tors. Biochem Biophys Res Commun 2004, 321(1):247-253.21. Schwarz DS, Hutvagner G, Du T, Xu Z, Aronin N, Zamore PD:Asymmetry in the assembly of the RNAi enzyme complex.Cell 2003, 115(2):199-208.22. Khvorova A, Reynolds A, Jayasena SD: Functional siRNAs andmiRNAs exhibit strand bias. Cell 2003, 115(2):209-216.23. Gong D, Ferrell JE Jr.: Picking a winner: new mechanisticinsights into the design of effective siRNAs. Trends Biotechnol2004, 22(9):451-454.24. Chalk AM, Wahlestedt C, Sonnhammer EL: Improved and auto-mated prediction of effective siRNA. Biochem Biophys Res Com-mun 2004, 319(1):264-274.25. Luo KQ, Chang DC: The gene-silencing efficiency of siRNA isstrongly dependent on the local structure of mRNA at thetargeted region. Biochem Biophys Res Commun 2004,318(1):303-310.26. Chiu YL, Rana TM: siRNA function in RNAi: a chemical modifi-cation analysis. Rna 2003, 9(9):1034-1048.27. Swarup G: How to design a highly effective siRNA. J Biosci 2004,29(2):129-131.28. Mittal V: Improving the efficiency of RNA interference inmammals. Nat Rev Genet 2004, 5(5):355-365.29. Zuker M: Mfold web server for nucleic acid folding and hybrid-ization prediction. Nucleic Acids Res 2003, 31(13):3406-3415.30. Kumar R, Conklin DS, Mittal V: High-throughput selection ofeffective RNAi probes for gene silencing. Genome Res 2003,13(10):2333-2340.31. Amarzguioui M, Prydz H: An algorithm for selection of func-tional siRNA sequences. Biochem Biophys Res Commun 2004,316(4):1050-1058.32. Henschel A, Buchholz F, Habermann B: DEQOR: a web-based toolfor the design and quality control of siRNAs. Nucleic Acids Res2004, 32(Web Server issue):W113-20.33. Ren Y, Gong W, Xu Q, Zheng X, Lin D, Wang Y, Li T: siRecords: anextensive database of mammalian siRNAs with efficacy rat-ings. Bioinformatics 2006, 22(8):1027-1028.34. siDRM [http://siRecords.umn.edu/siDRM/]. .35. Agresti A: Categorical Data Analysis. Hoboken, New Jersey ,John Wiley Sons; 2002. 36. Storey JD, Tibshirani R: Statistical significance for genomewidestudies. Proc Natl Acad Sci USA 2003, 100:9440-9445.37. Elmaagacli AH, Koldehoff M, Peceny R, Klein-Hitpass L, Ottinger H,Beelen DW, Opalka B: WT1 and BCR-ABL specific small inter-fering RNA have additive effects in the induction of apoptosisin leukemic cells. Haematologica 2005, 90(3):326-334.38. Nicholson LJ, Philippe M, Paine AJ, Mann DA, Dolphin CT: RNAinterference mediated in human primary cells via recom-binant baculoviral vectors. Mol Ther 2005, 11(4):638-644.39. Guan R, Tapang P, Leverson JD, Albert D, Giranda VL, Luo Y: Smallinterfering RNA-mediated Polo-like kinase 1 depletion pref-erentially reduces the survival of p53-defective, oncogenictransformed cells and inhibits tumor growth in animals. Can-cer Res 2005, 65(7):2698-2704.40. Atkinson PJ, Young KW, Ennion SJ, Kew JN, Nahorski SR, Challiss RA:Altered Expression of Gq/11{alpha} Protein Shapes mGlu1and mGlu5 Receptor-mediated Single Cell Inositol 1,4,5-tri-sphosphate and Ca2+ Signaling. Mol Pharmacol 2005.41. Pei Y, Tuschl T: On the art of identifying effective and specificsiRNAs. Nat Methods 2006, 3(9):670-676.42. Huesken D, Lange J, Mickanin C, Weiler J, Asselbergs F, Warner J,Meloon B, Engel S, Rosenberg A, Cohen D, Labow M, Reinhardt M,Natt F, Hall J: Design of a genome-wide siRNA library using anartificial neural network. Nat Biotechnol 2005, 23(8):995-1001.43. ATCC (American Type Culture Collection) [http://www.atcc.org/]. .44. Protocol Online [http://www.protocol-online.org/]. .45. Jagla B, Aulner N, Kelly PD, Song D, Volchuk A, Zatorski A, Shum D,Mayer T, De Angelis DA, Ouerfelli O, Rutishauser U, Rothman JE:Sequence characteristics of functional siRNAs. Rna 2005,11(6):864-872.46. Naito Y, Yamada T, Ui-Tei K, Morishita S, Saigo K: siDirect: highlyeffective, target-specific siRNA design software for mamma-lian RNA interference. Nucleic Acids Res 2004, 32(Web Serverissue):W124-9.47. Ding Y, Chan CY, Lawrence CE: Sfold web server for statisticalfolding and rational design of nucleic acids. Nucleic Acids Res2004, 32(Web Server issue):W135-41.48. DINAMelt server [http://www.bioinfo.rpi.edu/applications/hybrid/twostate-fold.php]. .49. Markham NR, Zuker M: DINAMelt web server for nucleic acidmelting prediction. Nucleic Acids Res 2005, 33(Web Serverissue):W577-81.50. Xia T, SantaLucia J Jr., Burkard ME, Kierzek R, Schroeder SJ, Jiao X,Cox C, Turner DH: Thermodynamic parameters for anexpanded nearest-neighbor model for formation of RNAduplexes with Watson-Crick base pairs. Biochemistry 1998,37(42):14719-14735.51. siRNA end energy calculation [http://sirna.cgb.ki.se/symme-try/energy_calculation_zamore.pdf]. .52. Yiu SM, Wong PW, Lam TW, Mui YC, Kung HF, Lin M, Cheung YT:Filtering of ineffective siRNAs and improved siRNA designtool. Bioinformatics 2005, 21(2):144-151.53. Schubert S, Grunweller A, Erdmann VA, Kurreck J: Local RNA tar-get structure influences siRNA efficacy: systematic analysisof intentionally designed binding regions. J Mol Biol 2005,348(4):883-893.54. Scherr M, Rossi JJ, Sczakiel G, Patzel V: RNA accessibility predic-tion: a theoretical approach is consistent with experimentalstudies in cell extracts. Nucleic Acids Res 2000, 28(13):2455-2461.55. Mfold 3.2 [http://www.bioinfo.rpi.edu/~zukerm/export/mfold-3.2.tar.gz]. .56. Sczakiel G, Homann M, Rittner K: Computer-aided search foreffective antisense RNA target sequences of the humanimmunodeficiency virus type 1. Antisense Res Dev 1993,3(1):45-52.57. Le SY, Chen JH, Braun MJ, Gonda MA, Maizel JV: Stability of RNAstem-loop structure and distribution of non-random struc-ture in the human immunodeficiency virus (HIV-I). NucleicAcids Res 1988, 16(11):5153-5168.58. Le SY, Chen JH, Maizel JV: Thermodynamic stability and statis-tical significance of potential stem-loop structures situatedat the frameshift sites of retroviruses. Nucleic Acids Res 1989,17(15):6143-6152.59. Overhoff M, Alken M, Far RK, Lemaitre M, Lebleu B, Sczakiel G, Rob-bins I: Local RNA target structure influences siRNA efficacy:a systematic global analysis. J Mol Biol 2005, 348(4):871-881.60. Patzel V, Steidl U, Kronenwett R, Haas R, Sczakiel G: A theoreticalapproach to select effective antisense oligodeoxyribonucle-otides at high statistical probability. Nucleic Acids Res 1999,27(22):4328-4334. Publish with Bio Med Central and every scientist can read your work free of charge BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime. Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralBMC Bioinformatics 2006, 7:516 http://www.biomedcentral.com/1471-2105/7/516Page 21 of 21(page number not for citation purposes)61. Sirna server [http://www.bioinfo.rpi.edu/applications/sfold/sirna.pl]. .62. Benjamini Y, Hochberg Y: Controlling the false discovery rate: apractical and powerful approach to multiple testing. J R StatistSoc B 1995, 57:289-300.63. Dabney A, Storey JD: Qvalue: the manual (http://faculty.wash-ington.edu/jstorey/qvalue/manual.pdf). 2003.64. Ambion siRNA Target Finder [http://www.ambion.com/techlib/misc/siRNA_finder.html]. .65. Jack Lin s siRNA Sequence Finder [http://www.sinc.sunysb.edu/Stu/shilin/rnai.html]. .66. siDESIGN Center [http://www.dharmacon.com/sidesign]. .67. siRNA Target Finder [https://www.genscript.com/ssl-bin/app/rnai]. .68. Imgenex sirna Designer [http://imgenex.com/sirna_tool.php]. .69. EMBOSS siRNA [http://bioweb.pasteur.fr/seqanal/inter-faces/sirna.html]. .70. IDT RNAi Design [http://www.idtdna.com/Scitools/Applica-tions/RNAi/RNAi.aspx]. .71. BLOCK-iT RNAi Designer [https://rnaidesigner.invitro-gen.com/rnaiexpress]. .72. siSearch [http://sonnhammer.cgb.ki.se/siSearch/siSearch_1.7.html]. .73. SiMAX [http://www.mwg-biotech.com/html/s_synthetic_acids/s_sirna_design.shtml]. .74. BIOPREDsi [http://www.biopredsi.org/]. .75. Promega siRNA Target Designer [http://www.promega.com/siRNADesigner/program/]. .76. QIAGEN siRNA Design Tool [http://www1.qiagen.com/Products/GeneSilencing/CustomSiRna/SiRnaDe-signer.aspx]. .77. SDS/MPI [http://i.cs.hku.hk/~sirna/software/sirna.php]. .78. Whitehead WI siRNA Selection Program [http://jura.wi.mit.edu/bioc/siRNAext/]. .Supplementary resource (1)Additional File 1DataNovember 2006Wuming Gong · Yongliang Ren · Qiqi Xu · Yejun Wang · Justin LiCitations (53)References (74)... The transfection reagents oligofectamine and lipofectamine covers 18% and 35% of the database, respectively. G + C content is the crucial character for functional siRNAs 29 . The G + C content profile was used for visualisation and analysing the variation of GC content in genomic sequences 30 . ...... These four nitrogenous bases made different number of hydrogen bonds with each other. Due to three H-bonds between G and C, this base pair is stronger than that of A and T. This makes high G + C containing DNA thermally more stable than AT containing DNA 29 . It has been reported that sequences of intermediate G + C contents (around 50%) were more effective siRNAs, and our dataset contained 86% of siRNA in the intermediate range (30-65%) G + C content 31 . ...BOSSDataFull-text availableSep 2017 Atul Tyagi Manoj Semwal Ashok SharmaView... The transfection reagents oligofectamine and lipofectamine covers 18% and 35% of the database, respectively. G + C content is the crucial character for functional siRNAs 29 . The G + C content profile was used for visualisation and analysing the variation of GC content in genomic sequences 30 . ...... These four nitrogenous bases made different number of hydrogen bonds with each other. Due to three H-bonds between G and C, this base pair is stronger than that of A and T. This makes high G + C containing DNA thermally more stable than AT containing DNA 29 . It has been reported that sequences of intermediate G + C contents (around 50%) were more effective siRNAs, and our dataset contained 86% of siRNA in the intermediate range (30-65%) G + C content 31 . ...A database of breast oncogenic specific siRNAsArticleFull-text availableDec 2017 Atul Tyagi Manoj Semwal Ashok SharmaBreast cancer is a serious problem causing the death of women across the world. At present, one of the major challenges is to design drugs to target breast cancer specific gene(s). RNA interference (RNAi) is an important technique for targeted gene silencing that may lead to promising novel therapeutic strategies for breast cancer. Therefore, identification of such molecules having high oncogene specificity is the need of the hour. Here, we have developed a database named as Breast Oncogenic Specific siRNAs (BOSS, http://bioinformatics.cimap.res.in/sharma/boss/) on the basis of the current research status on siRNA-mediated repression of oncogenes in different breast cancer cell lines. BOSS is a resource of experimentally validated breast oncogenic siRNAs, collected from research articles and patents published yet. The present database contains information on 865 breast oncogenic siRNA entries. Each entry provides comprehensive information of an siRNA that includes its name, sequence, target gene, type of cells, and inhibition value, etc. Additionally, some useful tools like siRNAMAP and BOSS BLAST were also developed and linked with the database. siRNAMAP can be used for the selection of best siRNA against a target gene while BOSS BLAST tool helps to locate the siRNA sequences in deferent oncogenes.ViewShow abstract... Scherer et al. [19] reported that the thermodynamic properties to target specific mRNAs are important characteristics. Soon after these studies, many rational design rules for effective siRNAs have been proposed [20][21][22][23][24][25][26]. For example, Reynolds et al. [22] analyzed 180 siRNAs systematically, targeting every other position of two 197−base regions of luciferase and human cyclophilin B mRNA (90 siRNAs per gene), and found the following eight criteria for improving siRNA selection: (i) G/C content 30−52%, (ii) at least 3 As or Us at positions 15−19, (iii) absence of internal repeats, (iv) an A at position 19, (v) an A at position 3, (vi) an U at position 10, (vii) a base other than G or C at position 19, (viii) a base other than G at position 13. ...A semi–supervised tensor regression model for siRNA efficacy predictionArticleFull-text availableMar 2015BMC BIOINFORMATICS Bui ThangBackgroundShort interfering RNAs (siRNAs) can knockdown target genes and thus have an immense impact on biology and pharmacy research. The key question of which siRNAs have high knockdown ability in siRNA research remains challenging as current known results are still far from expectation.ResultsThis work aims to develop a generic framework to enhance siRNA knockdown efficacy prediction. The key idea is first to enrich siRNA sequences by incorporating them with rules found for designing effective siRNAs and representing them as enriched matrices, then to employ the bilinear tensor regression to predict knockdown efficacy of those matrices. Experiments show that the proposed method achieves better results than existing models in most cases.ConclusionsOur model not only provides a suitable siRNA representation but also can predict siRNA efficacy more accurate and stable than most of state–of–the–art models. Source codes are freely available on the web at: http://www.jaist.ac.jp/~bao/BiLTR/ webcite.ViewShow abstract... The Max-Planck Institute devised a principle aimed at identifying all key features relevant to miRNA design. Nevertheless, this effort has shown to yield many noneffective siRNAs which have shown to have a high false-positive rate [ 54 ]. ...Computational Design of Artificial RNA Molecules For Gene RegulationChapterFull-text availableFeb 2015Meth Mol Biol Alessandro Laganà Dario Veneziano Francesco Russo Alfredo FerroThis volume provides an overview of RNA bioinformatics methodologies, including basic strategies to predict secondary and tertiary structures, and novel algorithms based on massive RNA sequencing. Interest in RNA bioinformatics has rapidly increased thanks to the recent high-throughput sequencing technologies allowing scientists to investigate complete transcriptomes at single nucleotide resolution. Adopting advanced computational technics, scientists are now able to conduct more in-depth studies and present them to you in this book. Written in the highly successful Methods of Molecular Biology series format, chapters include introductions to their respective topics, lists of the necessary materials and equipment, step-by-step, readily reproducible bioinformatics protocols, and key tips to avoid known pitfalls.Authoritative and practical, RNA Bioinformatics seeks to aid scientists in the further study of bioinformatics and computational biology of RNA.ViewShow abstractsiRNA Design and GalNAc-Empowered Hepatic Targeted DeliveryChapterApr 2021Meth Mol BiolMei Lu Mengjie Zhang Bo Hu Yuanyu HuangSmall interfering RNA (siRNA) is a clinically approved therapeutic modality, which has attracted widespread attention not only from basic research but also from pharmaceutical industry. As siRNA can theoretically modulate any disease-related gene’s expression, plenty of siRNA therapeutic pipelines have been established by tens of biotechnology companies. The drug performance of siRNA heavily depends on the sequence, the chemical modification, and the delivery of siRNA. Here, we describe the rational design protocol of siRNA, and provide some modification patterns that can enhance siRNA’s stability and reduce its off-target effect. Also, the delivery method based on N-acetylgalactosamine (GalNAc)-siRNA conjugate that is widely employed to develop therapeutic regimens for liver-related diseases is also recapitulated.ViewShow abstractAdvance research on siRNA design methodsArticleJan 2012J.-F. LiS.-L. PengViewMVRM: A hybrid approach to predict siRNA efficacyConference PaperOct 2015 Bui Thang Le Sy Vinh Tu Bao HoThe discovery of RNA interference (RNAi) leads to design novel drugs for different diseases. Selecting short interfering RNAs (siRNAs) that can knockdown target genes efficiently is one of the key tasks in studying RNAi. A number of predictive models have been proposed to predict knockdown efficacy of siRNAs, however, their performance is still far from the expectation. This work aims to develop a predictive model to enhance siRNA knockdown efficacy prediction. The key idea is to combine both the rule–based and the model–based approaches. To this end, views of siRNAs that integrate available siRNA design rules are first learned using an adaptive Fuzzy C Means (FCM) algorithm. The learned views and other properties of siRNAs are combined to final representations of siRNAs. The elastic net regression method is employed to learn a predictive model from these final representations. Experiments on benchmark datasets showed that the proposed method achieved stable and accurate results in comparison with other methods.ViewShow abstractGene therapy for hereditary hearing loss: lessons from a mouse modelArticleAbraham M SheffieldViewCheminformatics Approach to Gene Silencing: Z Descriptors of Nucleotides and SVM Regression Afford Predictive Models for siRNA PotencyArticleDec 2010 Jerry Osagie Ebalunode Weifan ZhengShort interfering RNA mediated gene silencing technology has been through tremendous development over the past decade, and has found broad applications in both basic biomedical research and pharmaceutical development. Critical to the effective use of this technology is the development of reliable algorithms to predict the potency and selectivity of siRNAs under study. Existing algorithms are mostly built upon sequence information of siRNAs and then employ statistical pattern recognition or machine learning techniques to derive rules or models. However, sequence-based features have limited ability to characterize siRNAs, especially chemically modified ones. In this study, we proposed a cheminformatics approach to describe siRNAs. Principal component scores (z1, z2, z3, z4) have been derived for each of the 5 nucleotides (A, U, G, C, T) from the descriptor matrix computed by the MOE program. Descriptors of a given siRNA sequence are simply the concatenation of the z values of its composing nucleotides. Thus, for each of the 2431 siRNA sequences in the Huesken dataset, 76 descriptors were generated for the 19-NT representation, and 84 descriptors were generated for the 21-NT representation of siRNAs. Support Vector Machine regression (SVMR) was employed to develop predictive models. In all cases, the models achieved Pearson correlation coefficient r and R about 0.84 and 0.65 for the training sets and test sets, respectively. A minimum of 25 % of the whole dataset was needed to obtain predictive models that could accurately predict 75 % of the remaining siRNAs. Thus, for the first time, a cheminformatics approach has been developed to successfully model the structure–potency relationship in siRNA-based gene silencing data, which has laid a solid foundation for quantitative modeling of chemically modified siRNAs.ViewShow abstractTarget Gene Abundance Contributes to the Efficiency of siRNA-Mediated Gene SilencingArticleFeb 2014 Sun woo HongYuanyuan Jiang Soyoun KimDong-Ki LeeThe gene-silencing activity of a small interfering RNA (siRNA) is determined by various factors. Considering that RNA interference (RNAi) is an unparalleled technology in both basic research and therapeutic applications, thorough understanding of the factors determining RNAi activity is critical. This report presents observations that siRNAs targeting KRT7 show cell-line-dependent activity, which correlates with the expression level of KRT7 mRNA. By modulating the target mRNA level, it was confirmed that highly expressed genes are more susceptible to siRNA-mediated gene silencing. Finally, several genes that show different expression levels in a cell-line dependent manner were tested, which verified the expression-level-dependent siRNA activities. These results strongly suggest that the abundance of target mRNA is a critical factor that determines the efficiency of the siRNA-mediated gene silencing in a given cellular context. This report should provide practical guidelines for designing RNAi experiments and for selecting targetable genes in RNAi therapeutics studies.ViewShow abstractShow moreCorrigendum: Design of a genome-wide siRNA library using an artificial neural networkArticleFull-text availableAug 2006NAT BIOTECHNOLDieter HüskenJoerg Lange Craig MickaninMischa ReinhardtViewCorrigendum: A lentivirus-based system to functionally silence genes in primary mammalian cells, stem cells and transgenic mice by RNA interferenceArticleFull-text availableJun 2007Nat. Genet.Douglas A RubinsonChristopher P Dillon Adam Vincent KwiatkowskiLuk Van ParijsRNA interference (RNAi) has recently emerged as a specific and efficient method to silence gene expression in mammalian cells either by transfection of short interfering RNAs (siRNAs; ref. 1) or, more recently, by transcription of short hairpin RNAs (shRNAs) from expression vectors and retroviruses(2-10). But the resistance of important cell types to transduction by these approaches, both in vitro and in vivo(11), has limited the use of RNAi. Here we describe a lentiviral system for delivery of shRNAs into cycling and non-cycling mammalian cells, stem cells, zygotes and their differentiated progeny. We show that lentivirus-delivered shRNAs are capable of specific, highly stable and functional silencing of gene expression in a variety of cell types and also in transgenic mice. Our lentiviral vectors should permit rapid and efficient analysis of gene function in primary human and animal cells and tissues and generation of animals that show reduced expression of specific genes. They may also provide new approaches for gene therapy.ViewShow abstractA lentivirus-based system to functionally silence genes in primary mammalian cells, stem cells and transgenic mice by RNA interferenceArticleMar 2003Nat. Genet.Douglas A RubinsonChristopher P Dillon Adam Vincent KwiatkowskiLuk Van ParijsRNA interference (RNAi) has recently emerged as a specific and efficient method to silence gene expression in mammalian cells either by transfection of short interfering RNAs (siRNAs; ref. 1) or, more recently, by transcription of short hairpin RNAs (shRNAs) from expression vectors and retroviruses. But the resistance of important cell types to transduction by these approaches, both in vitro and in vivo, has limited the use of RNAi. Here we describe a lentiviral system for delivery of shRNAs into cycling and non-cycling mammalian cells, stem cells, zygotes and their differentiated progeny. We show that lentivirus-delivered shRNAs are capable of specific, highly stable and functional silencing of gene expression in a variety of cell types and also in transgenic mice. Our lentiviral vectors should permit rapid and efficient analysis of gene function in primary human and animal cells and tissues and generation of animals that show reduced expression of specific genes. They may also provide new approaches for gene therapy.ViewShow abstractRNAiArticleMar 2000CELL Phillip Zamore Thomas TuschlPhillip A. SharpDavid P BartelDouble-stranded RNA (dsRNA) directs the sequence-specific degradation of mRNA through a process known as RNA interference (RNAi). Using a recently developed Drosophila in vitro system, we examined the molecular mechanism underlying RNAi. We find that RNAi is ATP dependent yet uncoupled from mRNA translation. During the RNAi reaction, both strands of the dsRNA are processed to RNA segments 21–23 nucleotides in length. Processing of the dsRNA to the small RNA fragments does not require the targeted mRNA. The mRNA is cleaved only within the region of identity with the dsRNA. Cleavage occurs at sites 21–23 nucleotides apart, the same interval observed for the dsRNA itself, suggesting that the 21–23 nucleotide fragments from the dsRNA are guiding mRNA cleavage.ViewShow abstractCategorical Data Analysis 2nd EdnBookJan 1990AgrestiViewErratum to \"The gene-silencing efficiency of siRNA is strongly dependent on the local structure of mRNA at the targeted region” [Biochem. Biophys. Res. Commun. 318 (2004) 303–310]ArticleJul 2004BIOCHEM BIOPH RES COKathy Qian Luo Donald C ChangViewRNA interfernce is mediated by 21-and 22-nucleotide RNAsArticleJan 2001GENE DEVSm ElbashirWinfried Lendeckel Thomas TuschlViewImproving the E ciency of RNA Interference in MammalsArticle Vivek MittalViewFunctional siRNAs and miRNAs exhibit strand bias (vol 115, pg 209, 2003)ArticleOct 2003CELL Anastasia Khvorova Angela Reynolds Sumedha D. JayasenaBoth microRNAs (miRNA) and small interfering RNAs (siRNA) share a common set of cellular proteins (Dicer and the RNA-induced silencing complex [RISC]) to elicit RNA interference. In the following work, a statistical analysis of the internal stability of published miRNA sequences in the context of miRNA precursor hairpins revealed enhanced flexibility of miRNA precursors, especially at the 5 -anti-sense (AS) terminal base pair. The same trend was observed in siRNA, with functional duplexes displaying a lower internal stability (Delta0.5 kcal/mol) at the 5 -AS end than nonfunctional duplexes. Average internal stability of siRNA molecules retrieved from plant cells after introduction of long RNA sequences also shows this characteristic thermodynamic signature. Together, these results suggest that the thermodynamic properties of siRNA play a critical role in determining the molecule s function and longevity, possibly biasing the steps involved in duplex unwinding and strand retention by RISC.ViewShow abstractRISC—Asymmetry in the assembly of the RNAi enzyme complexArticleOct 2003CELLDianne S Schwarz Gyorgy Hutvagner Tingting Du Phillip ZamoreA key step in RNA interference (RNAi) is assembly of the RISC, the protein-siRNA complex that mediates target RNA cleavage. Here, we show that the two strands of an siRNA duplex are not equally eligible for assembly into RISC. Rather, both the absolute and relative stabilities of the base pairs at the 5 ends of the two siRNA strands determine the degree to which each strand participates in the RNAi pathway. siRNA duplexes can be functionally asymmetric, with only one of the two strands able to trigger RNAi. Asymmetry is the hallmark of a related class of small, single-stranded, noncoding RNAs, microRNAs (miRNAs). We suggest that single-stranded miRNAs are initially generated as siRNA-like duplexes whose structures predestine one strand to enter the RISC and the other strand to be destroyed. Thus, the common step of RISC assembly is an unexpected source of asymmetry for both siRNA function and miRNA biogenesis.ViewShow abstractShow moreAdvertisementRecommendationsDiscover more about: siRNAProjectResearcher Qiqi XuView projectProjectPrecision Gastric Oncology in Chinese People Yejun Wang Ming-an Sun Qing Zhang(1) To explore the genetic associations of familiar gastric cancers in Chinese people; (2) To improve molecular subtyping associated with treatment regimens and prognosis; (3) To develop panels wi th integrative signatures and computational tools facilitating early diagnosis and treatment selection. ... [more]View projectArticleFull-text availableDesign and validation of siRNAs and shRNAsMay 2009 · Current Opinion in Molecular Therapeutics Tilesi Francesca Piera FradianiValentina Socci[...] Fiorentina AscenzioniRNAi is a highly conserved intracellular mechanism, whereby dsRNA strands conduct post-transcriptional modulation of gene expression through a degradation or inhibition of the translation of target mRNA. Since its discovery in 1998, RNAi has been identified in many different organisms, including mammals, and this mechanism has provided new approaches for studies in cellular and molecular biology, ... [Show full abstract] functional genomics and drug discovery. siRNAs can be predicted by sequence and thermodynamic features, and the wide and proficient application of RNAi relies on the ability to select the most active siRNAs from among numerous predicted molecules. Recently, the first-generation prediction algorithms based on the characteristics of siRNAs, short hairpin (sh)RNAs and micro-(mi)RNAs have been improved by the use of computational models that account for the experimentally determined activities of large numbers of siRNAs/shRNAs and miRNAs. These second-generation algorithms differ from the first-generation algorithms in the computational tools that are used for the prediction of siRNA efficacy; although these new algorithms improve the design of effective siRNAs, they do not eliminate the requirement for an experimental evaluation of the activities of siRNAs. This review reports on the most significant second-generation algorithms of siRNA and shRNA characteristics, as well as on recently designed systems for the experimental evaluation of siRNA activities.View full-textDataFull-text availableAdditional File 1November 2006Wuming GongYongliang Ren Qiqi Xu[...] Justin LiSupplementary results and discussions. Discussions of cooperativity between features in their joint effects, performance of DRM rule sets in subsets divided by confounding factors, utility of online siRNA design tools, rationale of DRM procedure, and survey results of features significant associated with high siRNA efficacy. View full-textArticleFull-text availablesiRecords: A database of mammalian RNAi experiments and efficaciesDecember 2008 · Nucleic Acids ResearchYongliang RenWuming Gong Justin Li[...]Haiyan ZhouRNAi-based gene-silencing techniques offer a fast and cost-effective way of knocking down genes’ functions in an easily regulatedmanner. Exciting progress has been made in recent years in the application of these techniques in basic biomedical researchand therapeutic development. However, it remains a difficult task to design effective siRNA experiments with high efficacyand specificity. We ... [Show full abstract] present siRecords, an extensive database of mammalian RNAi experiments with consistent efficacy ratings.This database serves two purposes. First, it provides a large and diverse dataset of siRNA experiments. This dataset faithfullyrepresents the general, diverse RNAi experimental practice, and allows more reliable siRNA design tools to be developed withthe overfitting problem well curbed. Second, the database helps experimental RNAi researchers directly by providing them withthe efficacy and other information about the siRNAs experiments designed and conducted previously against the genes of theirinterest. The current release of siRecords contains the records of 17 192 RNAi experiments targeting 5086 genes.View full-textArticleFull-text availablesiDRM: An effective and generally applicable online siRNA design toolSeptember 2008 · BioinformaticsWuming GongYongliang Ren Justin Li[...]Haiyan ZhouSmall interfering RNAs (siRNAs) have become an indispensable tool for the investigation of gene functions. Most existingsiRNA design tools were trained on datasets assembled from confined origins, incompatible with the diverse siRNA laboratorypractice to which these tools will ultimately be applied. We have performed an updated analysis using the disjunctive rulemerging (DRM) approach on a ... [Show full abstract] large and diverse dataset compiled from siRecords, and implemented the resulting rule sets in siDRM, a new online siRNA design tool. siDRM also implements a few high-sensitivity rule sets and fast rule sets, links to siRecords, and uses several filters to check unwanted detrimental effects, including innate immune responses, cell toxic effects andoff-target activities in selecting siRNAs. A performance comparison using an independent dataset indicated that siDRM outperforms 19 existing siRNA design tools in identifying effective siRNAs.Availability: siDRM can be accessed at http://siRecords.umn.edu/siDRM/.Contact: toli{at}biocompute.umn.eduSupplementary information: Supplementary data are available at Bioinformatics online.View full-textArticlesiRecords: An extensive database of mammalian siRNAs with efficacy ratingsMay 2006 · BioinformaticsYongliang RenWuming Gong Qiqi Xu[...] Justin LiUnlabelled: Short interfering RNAs (siRNAs) have been gaining popularity as the gene knock-down tool of choice by many researchers because of the clean nature of their workings as well as the technical simplicity and cost efficiency in their applications. We have constructed siRecords, a database of siRNAs experimentally tested by researchers with consistent efficacy ratings. This database will ... [Show full abstract] help siRNA researchers develop more reliable siRNA design rules; in the mean time, siRecords will benefit experimental researchers directly by providing them with information about the siRNAs that have been experimentally tested against the genes of their interest. Currently, more than 4100 carefully annotated siRNA sequences obtained from more than 1200 published siRNA studies are hosted in siRecords. This database will continue to expand as more experimentally tested siRNAs are published.Availability: The siRecords database can be accessed at http://siRecords.umn.edu/siRecords/Read moreLast Updated: 14 Mar 2021Interested in research on siRNA?Join ResearchGate to discover and stay up-to-date with the latest research from leading experts in siRNA and many other scientific topics.Join for free ResearchGate iOS AppGet it from the App Store now.InstallKeep up with your stats and moreAccess scientific knowledge from anywhere orDiscover by subject areaRecruit researchersJoin for freeLoginEmail Tip: Most researchers use their institutional email address as their ResearchGate loginPasswordForgot password? Keep me logged inLog inorContinue with GoogleWelcome back! Please log in.Email · HintTip: Most researchers use their institutional email address as their ResearchGate loginPasswordForgot password? Keep me logged inLog inorContinue with GoogleNo account? Sign upCompanyAbout usNewsCareersSupportHelp CenterBusiness solutionsAdvertisingRecruiting© 2008-2021 ResearchGate GmbH. All rights reserved.TermsPrivacyCopyrightImprint