1.3.1 Identifying research evidence for systematic reviews

This section describes how to undertake a systematic search using a range of methods to identify studies, manage the references retrieved by the searches, obtain documents and write up the search process. Practical examples of constructing search strategies are given in Appendix 2, and Appendix 3 provides examples of how the search should be documented. Issues around the identification of research evidence that are specific to review type such as adverse effects or clinical tests are discussed in the relevant chapters.

Conducting a thorough search to identify relevant studies is a key factor in minimizing bias in the review process. The search process should be as transparent as possible and documented in a way that enables it to be evaluated and reproduced.

Studies can be located using a combination of the following approaches: Minimizing publication and language biases

Decisions about where and how to search could unintentionally introduce bias into the review, so the team needs to consider, and try to minimize, the possible impact of search limitations. For example, restricting the searching to the use of electronic databases, which consist mainly of references to published journal articles, could result in the review being subject to publication bias as this approach is unlikely to identify studies that have not been published in peer reviewed journals. Wider searching is needed to identify research results circulated as reports or discussion papers. The identification of grey literature, such as unpublished papers, is difficult, but some are included on databases such as NTIS (National Technical Information Service) and HMIC (Health Management Information Consortium). Libraries of specialist research organisations and professional societies may also provide access to collections of grey literature.

Searching databases and registers that include unpublished studies, such as records of ongoing research, conference proceedings and theses, can reduce the impact of publication bias. Conference proceedings provide information on both research in progress and completed research. Conference abstracts are recorded in some major bibliographic databases such as BIOSIS Previews, as well as in dedicated databases such as Index to Scientific and Technical proceedings, ZETOC, and the Conference Papers Index.31, 32, 33, 34 It is also worth consulting catalogues from major libraries, for example the British Library and the US National Library of Medicine. The abstracts in conference proceedings may only give limited information, and there can be differences between data presented in an abstract and that included in a final report.35, 36 For these reasons, researchers should try to acquire the full report, if there is one, before considering whether to include the results in a systematic review.

As already discussed, limiting searches to English language papers can introduce language bias. Large bibliographic databases, such as MEDLINE and EMBASE, do include a small number of non-English language journals.37 Using additional databases such as LILACS (Latin American and Caribbean Health Sciences Literature) that contain collections of non-English language research can minimize potential language bias. Searching electronic databases

The selection of electronic databases to search will depend upon the review topic. Lists of databases are available from libraries and from database providers, such as Dialog and Wolters Kluwer, while subject experts will be familiar with the bibliographic databases in their field.

For reviews of health care interventions, MEDLINE and EMBASE are the databases most commonly used to identify studies. The Cochrane Central Register of Controlled Trials (CENTRAL) includes details of published articles taken from bibliographic databases and other published and unpublished sources.38 There are other databases with a narrower focus that could be equally appropriate. These include PsycINFO (psychology and psychiatry), AMED (complementary medicine), MANTIS (osteopathy and chiropractic) and CINAHL (nursing and allied health professions). If the topic includes social care there are a range of databases available including ASSIA (Applied Social Sciences Index and Abstracts), CSA Sociological Abstracts, and CSA Social Services Abstracts, that could be used. The databases referred to above are all subject-based but there are others, such as AgeInfo, Ageline and ChildData, that focus on a specific population group that could be relevant to the review topic.

Due to the diversity of questions addressed by systematic reviews, there can be no agreed standard for what constitutes an acceptable search in terms of the number of databases searched. For example, if the review is on a cross-cutting public health topic such as housing and health it is advisable to search a wider range of databases than if the review is of a pharmaceutical intervention for a known health condition (See Chapter 3, Section 3.3 Identifying research evidence). Searching other sources

In addition to searching electronic databases, published and unpublished research may also be obtained by using one or more of the following methods.

Scanning reference lists of relevant studies

Browsing the reference lists of papers (both primary studies and reviews) that have been identified by the database searches may identify further studies of interest.

Handsearching key journals

Handsearching involves scanning the content of journals, conference proceedings and abstracts, page by page. It is an important way of identifying very recent publications that have not yet been included and indexed by electronic databases or of including articles from journals that are not indexed by electronic databases.39 Handsearching can also ensure complete coverage of journal issues, including letters or commentaries, which may not be indexed by databases. It can also compensate for poor or inaccurate database indexing that can result in even the most carefully constructed strategy failing to identify relevant studies. Selecting which journals to handsearch can be done by analysing the results of the database searches to identify the journals that contain the largest number of relevant studies.

Searching trials registers

Trials can be identified by searching one or more of the many trials registers that exist. It can be a particularly useful approach to identifying unpublished or ongoing trials. Many of the registers are available on the Internet and some of the larger ones, such as www.ClinicalTrials.gov and www.who.int/trialsearch/, include the facility to search by drug name or by condition. While some registers are disease specific, others collect together trials from a specific country or region. Pharmaceutical companies may also make information about trials they have conducted available from their websites.

Contacting experts and manufacturers

Research groups and other experts as well as manufacturers may be useful sources of research not identified by the electronic searches, and may also be able to supply information about unpublished or ongoing research. Contacting relevant research centres or specialist libraries is another way of identifying potential studies. While these methods can all be useful, they are also time consuming and offer no guarantee of obtaining relevant information.

After a thorough and systematic search has been conducted, and relevant studies have been identified, topic experts can be asked to check the list to identify any known missing studies.

Searching relevant Internet resources

Internet searching can be a useful means of retrieving grey literature, such as unpublished papers, reports and conference abstracts. Identifying and scanning specific relevant websites will usually be more practical than using a general search engine such as ‘Google’.

Reviews of transport and ‘welfare to work’ programmes have reported how Internet searching of potentially relevant websites was effective in identifying additional studies to those retrieved from databases.40, 41 It is worth considering using the Internet when investigating a topic area where it is likely that studies have been published informally rather than in a journal indexed in a bibliographic database.

Internet searching should be carried out in as structured a way as possible and the procedure documented (see Appendix 3).

Citation searching

Citation searching involves selecting a number of key papers already identified for inclusion in the review and then searching for articles that have cited these papers. This approach should identify a cluster of related, and therefore highly relevant, papers. As this is in effect a search forward through time, citation searching is not suitable for identifying recent papers as they cannot have been referenced by other older papers.

Citation searching used to be limited to using the indexes Science Citation Index Expanded, Social Sciences Citation Index, and Arts & Humanities Citation Index,. but other resources (including CINAHL, PsycINFO and Google Scholar) now include cited references in their records so these are also available for citation searching. Using similar services offered by journals such as the BMJ can also be helpful.

Using a project Internet site to canvas for studies

Where it has been agreed that a dedicated website should be set up for the review, for example as part of the overall dissemination strategy, this can be used to canvas for unpublished data/grey literature. Inclusion of an email contact address allows interested parties to submit information about relevant research. Posting the inclusion and exclusion criteria on the website may help to ensure submissions are appropriate. Throughout the review process the website should be continually updated with information about the studies identified. Personal responses should be sent to all respondents and where appropriate submitted material should be included in the library of references. Further details about dedicated project websites can be found in Section 1.3.8 Disseminating the findings of systematic reviews.

This approach should probably only be considered for ‘high profile’ reviews and then it should be as an adjunct to active canvassing for unpublished/grey literature. Constructing the search strategy for electronic databases

Search strategies are explicitly designed to be highly sensitive so as many potentially relevant studies as possible are retrieved. Consequently the searches tend to retrieve a large number of records that do not meet the inclusion criteria. While it is possible to increase the precision of a search strategy, and so reduce the number of irrelevant papers retrieved, this may lead to relevant studies being missed.42

Constructing an effective combination of search terms involves breaking down the review question into ‘concepts’. Using the Population, Intervention, Comparator, and Outcomes elements from PICOS can help to structure the search, but it is not essential that every element is used. For example it may be better not to use terms for the outcomes since inclusion might mean that the database being searched fails to show relevant studies simply because the outcome is not mentioned prominently enough in the record, even though the study measured it. For each of the elements used, it is important to consider all the possible alternative terms. For example a drug intervention may be known by a generic name and one or more proprietary names. Advice should be sought from the topic experts on the review team and advisory group.

For a detailed discussion of how to structure a search from a review question, including the use of search filters for study design, see Appendix 2. Text mining

Text mining is a rapidly developing approach to utilizing the large amount of published text now available. Its potential use in systematic reviews is currently being explored and it may in future be an additional useful way of identifying relevant studies.43, 44 The aim of text mining is to identify connections between seemingly unrelated facts to generate new ideas or hypotheses. A number of processes are involved in the technique: a) Information Retrieval identifies documents to match a user's query; b) Natural Language Processing provides linguistic data needed to perform; c) Information Extraction, the process of automatically obtaining structured data from an unstructured natural language document; and d) Data Mining, the process of identifying patterns in large sets of data.45, 46 In future this approach may be helpful in automatically screening and ranking large numbers of potentially eligible studies prior to assessment by the researchers.

There are a variety of text mining tools available, for example TerMine and Acromine47 are tools dealing with term extraction and variation. Also of interest are KLEIO,48 which provides advanced searching facilities across MEDLINE and FACTA, which finds associated concepts using text analysis.49 Further information about text mining and the use of these tools can be found on the National Centre for Text Mining website (www.nactem.ac.uk/). Updating literature searches

Depending on the scope and timescale of the review, an update of the literature searches towards the end of the project may be required. If the initial searches were carried out some time before the final analysis is undertaken (e.g. six months) it may be necessary to re-run the searches to ensure that no recent papers are missed. To do this successfully the date the original search was conducted and the years covered by the search must have been recorded.

When doing update searches the update date field should be used rather than the actual date. This ensures that anything added to the database since the original search was conducted will be identified. If the database has added a lot of older material (e.g. from 1967) this will be removed by using the original date limits (e.g. 1990-2008) in combination with the update date field. For databases that do not include an update date field it may be better to run the whole search again and then use reference management software to remove those records that have already been identified and assessed. Current awareness

If a review is covering an area where there is rapid change or if a major study is expected to report its findings in the near future, setting up current awareness alerts can ensure that new papers are identified as soon as they become available. Options for current awareness include e-mail alerts from journals and RSS feeds from databases or websites. Managing references

To ensure the retrieved records are managed efficiently the team should agree working practices. For example, who will screen the references and record decisions about which documents to obtain and how to code these decisions; whether decisions about rejecting or obtaining documents should be made blind to others’ decisions; and how to store documents received. In addition, one member of the team should be responsible for identifying and removing duplicate references, ordering inter-library loans, recording the receipt of documents, and following up non-arrivals.

Using bibliographic software such as EndNote, Reference Manager or ProCite to record and manage references will help in documenting the process, streamline document management and make the production of reference lists for reports and journal papers easier. EPPI-Reviewer, a web-based review management programme, also incorporates reference management functions.4, 50 Alternatively it is possible to construct a database of references using a database package such as Microsoft Access or a word processing package. By creating a 'library' (database) of references, information can be shared by the whole review team, duplicated references can be identified and deleted more easily, and customised fields can be created where ordering decisions can be recorded.42 Specialised bibliographic management software packages have the facility to import references from electronic databases into the library and interact with word processing packages so bibliographies can be created in a variety of styles.

When an electronic library of references is used, it is important to establish in advance clear rules about which team members can add or amend records in the library, and that consistent terminology is used to record decisions. It is usually preferable to have one person from the team responsible for the library of references. Obtaining documents

Obtaining a large number of papers in a short space of time can be very labour intensive. The procedure for acquiring documents will vary according to organisational arrangements and will depend on issues such as cost, what resources are available, and whether access to an inter-library loan network is available. Most libraries in the United Kingdom will be able to obtain articles from the British Library Document Supply Centre’s collection although membership is required and there is a charge per article. Many journals are available in full text on the Internet, although a subscription may be required before articles can be downloaded. It may be cost-effective to travel to a particular library to obtain material if a large number of references are required and are available. The information specialist on the team is likely to know about networks of associated libraries and electronic resources that can be used for obtaining documents.51 Documenting the search

The search process should be reported in sufficient detail so that it could be re-run at a later date. The easiest way to document the search is to record the process and the results contemporaneously. The decisions reached during development and any changes or amendments made should be recorded and explained. It is important to record all searches, including Internet searches, handsearching and contact with experts.

Providing the full detail of searches helps future researchers to re-run or update the searches and enables readers to evaluate the thoroughness of searching. The write up of the search should include information about the databases and interfaces searched (including the dates covered), full detailed search strategies (including any justifications for date or language restrictions) and the number of records retrieved.

When systematic reviews are reported in journal articles, limits on the word count may make it impossible to provide full details of the searches. In these circumstances as much information as possible should be provided within the available space. For example, ‘We searched MEDLINE, EMBASE and CINAHL’ is more helpful to the reader than ‘We conducted computer searches’. Many journals now have an electronic version of the publication where the full search details can be provided. Alternatively, the published report can include the review team’s contact details so full details of the search strategies can be requested. If a detailed report is being written for the commissioners of the review, the full search details should be included.

The use of flow charts to demonstrate how relevant papers are identified is detailed in Section 1.3.2 Study selection. Guidance on documenting the different aspects of the searching process is given in Appendix 3.

Summary: Identifying research evidence for systematic reviews

  • The search for studies should be comprehensive.

  • The extent of searching is determined by the research question and the resources available to the research team.

  • Thorough searching is best achieved by using a variety of search methods (electronic and manual) and by searching multiple, possibly overlapping resources.

  • Most of the searching is likely to take place at the beginning of the review with an update search towards the end.

  • Using bibliographic software to record and manage references will help in documenting the process, streamline document management and make the production of reference lists for reports and journal papers easier.

  • The search process should be documented in full or details provided of where the strategy can be obtained.

1.3.2 Study selection

Literature searching may result in a large number of potentially eligible records that need to be assessed for inclusion against predetermined criteria, only a small proportion of which may eventually be included in the review. The process for selecting studies should be explicit and conducted in such a way as to minimize the risk of errors and bias. This section explains the steps involved and the issues to be considered when planning and conducting study selection. Process for study selection

The process by which decisions on the selection of studies will be made should be specified in the protocol, including who will carry out each stage and how it will be performed. The aim of selection is to ensure that only relevant studies are included in the review.

It is important that the selection process should minimise biases, which can occur when the decision to include or exclude certain studies may be affected by pre-formed opinions.52, 53, 54, 55, 56 The process for study selection therefore needs to be explicit, objective and minimize the potential for errors of judgement. It should be documented clearly to ensure it is reproducible (see Figure 1.1). The selection of studies from electronic databases is usually conducted in two stages:

Stage 1: a first decision is made based on titles and, where available, abstracts. These should be assessed against the predetermined inclusion criteria. If it can be determined that an article does not meet the inclusion criteria then it can be rejected straightaway. It is important to err on the side of over-inclusion during this first stage. The review question and the subsequent specification of the inclusion and exclusion criteria are likely to determine ease of rejection in this first stage. Where the question and criteria are tightly focused then it is usually easier to be confident that the rejected studies are not relevant. Rejected citations fall into two main categories; those that are clearly not relevant and those that address the topic of interest but fail on one or more criteria such as population. For those in the first category it is usually adequate to record as an irrelevant study, without a reason why. For those in the second category it is useful to record why the study failed to meet the inclusion criteria, as this increases the transparency of the selection process. Where abstracts are available the amount and usefulness of the information to the decision-making process often varies according to database and journal. Structured abstracts such as those produced by the BMJ are particularly useful at this stage of the review process.

Stage 2: for studies that appear to meet the inclusion criteria, or in cases when a definite decision cannot be made based on the title and/or abstract alone, the full paper should be obtained for detailed assessment against the inclusion criteria.

Some searching methods provide access to full papers directly, for example handsearching journals and contact with research groups, in which case assessment for inclusion is a one stage process.

Even when explicit inclusion criteria are specified, decisions concerning the inclusion of individual studies can remain subjective. Familiarity with the topic area and an understanding of the definitions being used are usually important.

The reliability of the decision process is increased if all papers are independently assessed by more than one researcher, and the decisions shown to be reproducible. One study found that on average a single researcher is likely to miss 8% of eligible studies, whereas a pair of researchers working independently would capture all eligible studies.57 Assessment of agreement is particularly important during the pilot phase (described later in this section), when evidence of poor agreement should lead to a revision of the selection criteria or an improvement of their coding. Agreement between assessors (inter-assessor reliability) may be formally assessed mathematically using a Kappa statistic (a measure of chance-corrected agreement).58

The process for resolving disagreements between assessors should be specified in the protocol. Many disagreements may be simple oversights, whilst others may be matters of interpretation. These disagreements should be discussed and, where possible, resolved by consensus after referring to the protocol; if necessary a third person may be consulted.

If resources and time allow, the lists of included and excluded studies may be discussed with the advisory group. In addition, these lists can be posted on a dedicated website with a request for feedback on any missing studies, an approach used in a review of water fluoridation.59 For further information see Section 1.3.8 Disseminating the findings of systematic reviews.

Piloting the study selection process

The selection process should be piloted by applying the inclusion criteria to a sample of papers in order to check that they can be reliably interpreted and that they classify the studies appropriately. The pilot phase can be used to refine and clarify the inclusion criteria and ensure that the criteria can be applied consistently by more than one person. Piloting may also give an indication of the likely time needed for the full selection process.


Judgements about inclusion may be affected by knowledge of the authorship, institutions, journal titles and year of publication, or the results and conclusions of articles.60 Blind assessment may be possible by removing such identifying information, but the gain should be offset against the time and effort required to disguise the source of each article. Several studies have found that masking author, institution, journal name and study results is of limited value in study selection.61, 62 Therefore, the general opinion is that unmasked assessment by two independent researchers is acceptable.

Dealing with lack of information

Sometimes the amount of information reported about a study is insufficient to make a decision about inclusion, and it can be helpful to contact study authors to ask for more details. However, this requires time and resources, and the authors may not reply, particularly if the study is old. If authors are to be contacted it may be advisable to decide in advance how much time will be given to allow them to reply. If contacting authors is not practical then the studies in question could be excluded and listed as ‘potentially relevant studies’. If a decision is made to include such studies, the influence on the results of the review can be checked in a sensitivity analysis.

Dealing with duplication

It is important to look for duplicate publications of research results to ensure they are not treated as separate studies in the review. Multiple papers may be published for a number of reasons including: translations; results at different follow-up periods or reporting of different outcomes. However, it is not always easy to identify duplicates as they are often covert (i.e. not cross referenced to one another) and neither authorship nor sample size are reliable criteria for identification of duplication.63 Estimates of prevalence of duplicate publication range from 1.4% to 28%,64 and studies have been found to have up to five duplicate reports.63 Multiple reports from the same study may include identical samples with different outcomes reported or increasing samples with the same outcomes reported.

Multiple reporting can lead to biased results, as studies with significant results are more likely to be published or presented more frequently, leading to an overestimation of treatment effects when findings are combined.65 When multiple reports of a study are identified these should be treated as a single study but reference made to all the publications. It may be worthwhile comparing multiple publications for any discrepancies, which could be highlighted and the study authors contacted for clarification.

Documenting decisions

It is important to have a record of decisions made for each article. This may be in paper form, attached to paper copies of the articles, or the selection process may be partially or wholly computerised. If the search results are provided in electronic format, they can be imported into a reference management program such as EndNote, Reference Manager or ProCite which stores, displays and enables organisation of the records, and allows basic inclusion decisions to be made and recorded (in custom fields). For more complex selection procedures, where several decisions and comments need to be recorded, a database program such as Microsoft Access may be of use. There are also programs specifically designed for carrying out systematic reviews which include aids for the selection process, such as TrialStat SRS and EPPI-Reviewer.

Reporting study selection

A flow chart showing the number of studies/papers remaining at each stage is a simple and useful way of documenting the study selection process. Recommendations for reporting and presentation of a flow chart when reporting systematic reviews with or without a meta-analysis have been developed by the PRISMA group, formerly the QUOROM group. Publication of these guidelines is forthcoming.66, 67 In the meantime, the existing QUOROM guidelines for the reporting meta-analysis of RCTs,9 provide guidance that is equally applicable to all systematic reviews. Figure 1.1 is an example of a flow chart from a systematic review of treatments for childhood retinoblastoma.14

A list of studies excluded from the review should also be reported where possible, giving the reasons for exclusion. This list may be included in the report of the review as an appendix. In general, this list is most informative if it is restricted to ‘near misses’ (i.e. those studies that only narrowly failed to meet inclusion criteria and that readers might have expected to see included) rather than all the research evidence identified. Decisions to exclude studies may be reached at the title and abstract stage or at the full paper stage.

Figure 1.1: Flow chart of study selection process14

Summary: Study selection

  • In order to minimize bias, studies should be assessed for inclusion using selection criteria that flow directly from the review question and that have been piloted to check that they can be reliably applied.

  • Study selection is a staged process involving sifting through the citations located by the search, retrieving full reports of potentially relevant citations and, from their assessment, identifying those studies that fulfil the inclusion criteria.

  • Parallel independent assessments should be conducted to minimize the risk of errors. If disagreements occur between assessors, they should be resolved according to a predefined strategy using consensus and arbitration as appropriate.

  • The study selection process should be documented, detailing reasons for exclusion of studies that are ‘near-misses’.

1.3.3 Data extraction

Data extraction is the process by which researchers obtain the necessary information about study characteristics and findings from the included studies. Data extraction requirements will vary from review to review, and the extraction forms should be tailored to the review question. The first stage of any data extraction is to plan the type of analyses and list the tables that will be included in the report. This will help to identify which data should be extracted. General guidance on the process is given here, but the specific details will clearly depend on the individual review topic.

A sample data extraction form and details of the data extraction process should be included in the review protocol. A common problem at the protocol stage is that there may be limited familiarity with the topic area. This can lead to uncertainties, for example, about comparators and outcome measures. As a result, time can be wasted extracting unnecessary data and difficulties can arise when attempting to utilise and synthesise the data. Sufficient time early in the project should therefore be allocated to developing, piloting and refining the data extraction form.

The extraction of data is linked to assessment of study quality in that both processes are often undertaken at the same time.

Standardised data extraction forms can provide consistency in a systematic review, whilst reducing bias and improving validity and reliability.68 Use of an electronic form has the added advantage of being able to combine data extraction and data entry into one step, and to facilitate data analysis and the production of results tables for the final report. Design

Integral to the design of the form is the category of data to be extracted. It may be numerical, fixed text such as yes/no, a ‘pick list’, or free text. However, the number of free text fields should be limited as much as possible to simplify the analysis of data. The form should be unambiguous and easy to use in order to minimize discrepancies. Instructions for completion should be provided and each field should have decision rules about coding data in order to avoid ambiguity and to aid consistent completion. Piloting the form is essential. Paper forms should only be used where access to direct completion of electronic forms is impossible, to reduce risks of error in data transcription. Content

The nature of the data extracted will depend on the type of question being addressed and the types of study available. Box 1.4 gives an example of some of the information that might be extracted for a comparative study.

Box 1.4 Example information requirements for data extraction

General information

Researcher performing data extraction


Date of data extraction


Identification features of the study:

Record number (to uniquely identify study)


Article title


Type of publication (e.g. journal article, conference abstract)

Country of origin

Source of funding


Study characteristics

Aim/objectives of the study

Study design

Study inclusion and exclusion criteria

Recruitment procedures used (e.g. details of randomisation, blinding)

Unit of allocation (e.g. participant, GP practice etc.)


Participant characteristics

Characteristics of participants at the beginning of the study e.g.




Socio-economic status

Disease characteristics



Number of participants in each characteristic category for intervention and control group(s) or mean/median characteristic values (record whether it is the number eligible, enrolled, or randomised that is reported in the study)


Intervention and setting

Setting in which the intervention is delivered


Description of the intervention(s) and control(s) (e.g. dose, route of administration, number of cycles, duration of cycle, care provider, how the intervention was developed, theoretical basis (where relevant))


Description of co-interventions


Outcome data/results

Unit of assessment/analysis


Statistical techniques used


For each pre-specified outcome:

Whether reported

Definition used in study

Measurement tool or method used

Unit of measurement (if appropriate)

Length of follow-up, number and/or times of follow-up measurements


For all intervention group(s) and control group(s):

Number of participants enrolled

Number of participants included in analysis

Number of withdrawals, exclusions, lost to follow-up

Summary outcome data e.g.

Dichotomous: number of events, number of participants

Continuous: mean and standard deviation


Type of analysis used in study (e.g. intention to treat, per protocol)


Results of study analysis e.g.

Dichotomous: odds ratio, risk ratio and confidence intervals, p-value

Continuous: mean difference, confidence intervals


If subgroup analysis is planned the above information on outcome data or results will need to be extracted for each patient subgroup


Additional outcomes


Record details of any additional relevant outcomes reported




Resource use


Adverse events


NB: Notes fields can be useful for occasional pieces of additional information or important comments that do not easily fit into the format of other fields.


The results to be extracted from each individual study may be reported in a variety of ways, and it is often necessary for a researcher to manipulate the available data into a common format. Manipulations of the reported findings are discussed in further detail in Section 1.3.5 Data synthesis, but can include using confidence intervals to determine standard errors or estimating the hazard ratio from a survival curve. Data can be categorised at this stage; however, it is advisable to extract as much of the reported data as is likely to be needed, and categorise at a later stage, so that detailed information is not lost during data extraction. Software

EPPI-Reviewer is a web application that enables researchers to manage all stages of a review in a single location. RevMan and TrialStat SRS are other software packages that can be used in data extraction for systematic reviews. Other tools commonly used include general word processing packages, spreadsheets and databases.

Whichever software package is used, ideally it should have the ability to provide different types of question coding. Some software will also allow researchers to develop quality control mechanisms for minimising data entry errors, for example, by specifying ranges of valid values. Piloting data extraction

Data extraction forms should be piloted on a sample of included studies to ensure that all the relevant information is captured and that resources are not wasted on extracting data not required. The consistency of the data extracted should be assessed to make sure that those extracting the data are interpreting the forms, and the draft instructions and decision rules about coding data, in the same way. This will help to reduce data extraction errors. The exporting, analysis and outputs of the data extraction forms should also be pilot tested where appropriate, on a small sample of included studies. This will ensure that the exporting of data works correctly and the outputs provide the information required for data analysis and synthesis.

When using databases, piloting is particularly important as it becomes increasingly difficult to make changes once the template has been created and information has been entered into the database. Early production of the expected output is also the best way to check that the correct data structure has been set up. Process of data extraction

Data extraction needs to be as unbiased and reliable as possible, however it is prone to human error and often subjective decisions are required. The number of researchers that will perform data extraction is likely to be influenced by constraints on time and resources. Ideally two researchers should independently perform the data extraction (the level of inter-rater agreement is often measured using a Kappa statistic58). As an accepted minimum, one researcher can extract the data with a second researcher independently checking the data extraction forms for accuracy and completeness. This method may result in significantly more errors than two researchers independently performing data extraction but may also take significantly less time.69 Any disagreements should be noted and resolved by consensus among researchers or by arbitration by an additional independent researcher. A record of corrections or amendments to data extraction forms should be kept for future reference, particularly where there is genuine ambiguity (internal inconsistency) which cannot be resolved after discussion with the study authors. If using an electronic data extraction form that does not keep a record of amendments, completed forms can be printed and amendments recorded manually, before correcting the electronic version.

As with screening studies for inclusion, blinding researchers to the journal and author details has been recommended.70, 71 However this is a time-consuming operation, may not alter the results of a review and is likely to be of limited value.61

Reviews that include only published studies may be at risk of overestimating the treatment effect. Including data from unpublished studies (or unpublished outcomes) is therefore important in minimising bias. However, this can be time-consuming and the original data may no longer be available. Although those performing IPD meta-analyses,72 have generally been successful in obtaining data from the authors of unpublished studies, the same may not be true of other types of review. The practical difficulties of locating and obtaining information from unpublished studies may, for example, make the ideal of including relevant unpublished studies unachievable in the timescales available for many commissioned reviews. When information from unpublished studies is obtained, the published and unpublished material should be subjected to the same methodological evaluation.

Summary: Data extraction

  • Standardised data extraction forms provide consistency in a systematic review, thereby potentially reducing bias, improving validity and reliability.

  • Data extraction forms should be designed and developed with both the review question and subsequent analysis in mind. Sufficient time should be allocated early in the project for developing and piloting the data extraction forms.

  • The data extraction forms should contain only information required for descriptive purposes or for analyses later in the systematic review. Information on study characteristics should be sufficiently detailed to allow readers to assess the applicability of the findings to their area of interest.

  • Data extraction needs to be unbiased and reliable, however it is prone to human error and often subjective decisions are required. Clear instructions and decision rules about coding data should be used.

  • As a minimum, one researcher should extract the data with a second researcher independently checking the data extraction forms for accuracy and detail. If disagreements occur between assessors, they should be resolved according to a predefined strategy using consensus and arbitration as appropriate.

1.3.4. Quality assessment Introduction

Research can vary considerably in methodological rigour. Flaws in the design or conduct of a study can result in bias, and in some cases this can have as much influence on observed effects as that of treatment. Important intervention effects, or lack of effect, can therefore be obscured by bias.

Recording the strengths and weaknesses of included studies provides an indication of whether the results have been unduly influenced by aspects of study design or conduct (essentially the extent to which the study results can be ‘believed’). Assessment of study quality gives an indication of the strength of evidence provided by the review and can also inform the standards required for future research. Ultimately, quality assessment helps answer the question of whether the studies are robust enough to guide treatment, prevention, diagnostic or policy decisions.

Many useful books discuss the sources of bias in different study designs in detail, or provide an in-depth guide to critical appraisal.73, 74, 75 No single approach to assessing methodological quality is appropriate to all systematic reviews. The best approach will be determined by contextual, pragmatic and methodological considerations. However, the following sections describe the underlying principles of quality assessment and the key issues to consider. Defining quality

Quality is a complex concept and the term is used in different ways. For example, a project using the Delphi consensus method with experts in the field of quality assessment of RCTs was unable to generate a definition of quality acceptable to all participants.76

Taking a broad view, the aim of assessing study quality is to establish how near the ‘truth’ its findings are likely to be and whether the findings are of relevance in the particular setting or patient group of interest. Quality assessment of any study is likely to consider the following:

The importance of each of these aspects of quality will depend on the focus and nature of the review. For example, issues around statistical analysis are less important if the study data are to be re-analysed in a meta-analysis, and the quality of reporting is irrelevant where data (either individual patient or aggregate) and information are obtained directly from those responsible for the study.

Appropriateness of study design

As discussed previously, types of study used to assess the effects of interventions can be arranged into a hierarchy, based broadly on their susceptibility to bias (Box 1.3). Although the RCT is considered the best study design to evaluate the effect of an intervention, in cases where it is unworkable or unethical to randomise participants (e.g. when evaluating the effects of smoking on health), researchers may instead have to use a quasi-experimental or an observational design. Simply grading studies using this hierarchy does not provide an adequate assessment of study quality, because it does not take into account variations in quality among studies of the same design. Even RCTs can be implemented in such a way that findings are likely to be seriously biased and therefore of little value in decision-making.

It should be noted that the terminology used to describe study designs (e.g. cohort, prospective, retrospective, historical controls, etc.) can be ambiguous and used in different ways by different researchers. Therefore it is important to consider the individual aspects of the study design that may introduce bias rather than focussing on the descriptive label used. This is particularly important for the description of non-randomised studies.

Risk of bias

Bias refers to systematic deviations from the true underlying effect brought about by poor study design or conduct in the collection, analysis, interpretation, publication or review of data. Bias can easily obscure intervention effects, and differences in the risk of bias between studies can help explain differences in findings.

Internal validity is the extent to which an observed effect can be truly attributed to the intervention being evaluated, rather than to flaws in the design or conduct of the study. Any such flaws can increase the risk of bias.

The types of bias, and the ways in which they can be minimised by each type of study design, are described below.

Randomised controlled trials

The RCT is generally considered to be the most appropriate study design for evaluating the effects of an intervention. This is because, when properly conducted, it limits the risk of bias. The simplest form of RCT is known as the parallel group trial which randomises eligible participants to two or more groups, treats according to assignment, and compares the groups with respect to outcomes of interest.

Participants are allocated to groups using both randomisation (allocation involves the play of chance) and concealment (ensures that the intervention that will be allocated cannot be known in advance of assignment). When appropriately implemented, these aspects of design should ensure that the groups being compared are similar in all respects other than the intervention. The groups should be balanced for both known and unknown factors that might influence outcome, such that any observed differences should be attributable to the effect of the intervention rather than to intrinsic differences between the groups.

Allocation in this way avoids the influence of confounding, where an additional factor is associated both with receiving the intervention and with the outcome of interest. For example, babies who are breast fed are less likely to have gastrointestinal illnesses than those who are bottle fed. Though this might suggest evidence for the protective effect of breastfeeding, mothers who breast feed also tend to be of higher socio-economic status, which in itself is associated with a range of health benefits to the baby. Therefore, when evaluating any possible protective effects of breastfeeding socio-economic status should be considered as a potential confounding factor. In some cases, the possible confounding factor(s) may not be known or measurable. In an RCT, so long as a sufficient number of participants are assigned then the groups should be balanced with respect to both known and unknown potential confounding factors.

Selection bias or allocation bias occurs where there are systematic differences between comparison groups in terms of prognosis or responsiveness to treatment. Concealed assignment prevents investigators being able to predict which intervention will be allocated next and using that information to select which participant receives which treatment. For example, clinicians may want to ’try out‘ the new intervention in patients with a poorer prognosis. If they succeed in doing this by knowing or correctly ‘guessing’ the order of allocation, the intervention group will eventually contain more seriously ill participants than the comparison group, such that the intervention will probably appear less effective than if the two groups had been properly balanced.

The most robust method for concealing the sequence of treatment allocation is a central telephone randomisation service, in which the care provider calls an independent trial service, registers the participant’s details and then discovers which intervention they are to be given. Similarly, an on-site computer-based randomisation system that is not readable until the time of allocation might be used. Envelope methods of randomisation, where allocation details are stored in pre-prepared envelopes, are less robust and more easily subverted than centralised methods. Where this method is adopted, sealed opaque sequentially numbered envelopes that are only opened in front of the participant being randomised should be used. Unfortunately, the methods which are used to ensure that the randomisation sequence remains concealed during implementation (frequently referred to as concealment of allocation) are often poorly reported making it difficult to discern whether the methods were susceptible to bias.

Some studies, which may describe themselves as randomised, may allocate participants to groups on an alternating basis, or based on whether their date of birth is an odd or even number. Allocation in these studies is neither random nor concealed.

Performance bias refers to systematic differences (apart from the intervention of interest) in the treatment or care given to comparison groups during the study and detection bias refers to systematic differences between groups in the way that outcomes are ascertained. The risk of these biases can be minimized by ensuring that people involved in the study are unaware of which groups participants have been assigned to (i.e. they are blinded or masked). Ideally, the participants, those administering the intervention, those assessing outcomes and those analysing the data should all be blinded. If not, the knowledge of which comparison group is which may consciously or unconsciously influence the behaviour of any of these people. The feasibility and/or success of blinding will partly depend on the intervention in question. There are situations where blinding is not possible owing to the nature of the intervention, for example where a particular intervention has an obvious physiological effect whereas the comparator does not, and others where it may be unethical (e.g. sham surgery carries risks with no intended benefit). Methods of blinding for studies of drugs involve the use of pills and containers of identical size, shape and number (placebos). Sham devices can be used for many device interventions and for some procedural interventions sham procedures can be used (e.g. sham acupuncture). Blinding of outcome assessors is particularly important for more subjective outcome measures such as pain, but less important for objective measures such as mortality. Implementation of a blinding process does not however guarantee successful blinding in practice. In study reports, terms such as double-blind, triple-blind or single-blind can be used inconsistently77 and explicit reporting of blinding is often missing.78 It is important to clarify the exact details of the blinding process.

A well-conducted RCT should have processes in place to achieve complete and good quality data,79 in order to avoid attrition bias. Attrition bias refers to systematic differences between the comparison groups in terms of participants withdrawing or being excluded from the study. Participants may withdraw or drop-out from a study because the treatment has intolerable adverse effects, or on the other hand, they may recover and leave for that reason. They may simply be lost to follow-up, or they may be withdrawn due to a lack of data on outcome measures. Other reasons that participants may be excluded include mistaken randomisation of participants who, on review, did not meet the study inclusion criteria, and participants receiving the wrong intervention due to protocol violation. The likely impact of such withdrawals and exclusions needs to be considered carefully; if the exclusion is related to the intervention and outcome then it can bias the results (for example, not accounting for high numbers of withdrawals due to adverse effects in one intervention arm will unduly favour that intervention). Serious bias can arise as a result of participants being withdrawn for apparently ad hoc reasons that are related to the success or failure of an intervention. There is evidence from the field of cancer research that exclusion of patients from the analysis may bias results,80 though how this may apply to other fields is unclear. An intention to treat (ITT) analysis is generally recommended in order to reduce the risk of bias.

An ITT analysis includes outcome data on all trial participants and analyses them according to the intervention to which they were randomised, regardless of the intervention(s) they actually received. Complete outcome data are often unavailable for participants who drop-out before the end of the trial, so in order to include all participants, assumptions need to be made about their missing outcome data (for example by imputation of missing values). ITT analysis generally provides a more conservative, and potentially less biased, estimate in trials of effectiveness (see Section Quantitative synthesis of comparative studies). However, ITT analyses are often poorly described and applied81 and if assessing methodological quality associated with statistical analysis, care needs to be taken in judging whether the use of ITT analysis has minimized the risk of attrition bias and whether it was appropriately applied. If an ITT analysis is not used, then the study should at least report the proportion of participants excluded from the analysis to allow a researcher to judge whether this is likely to have led to bias.

The minimum criteria for assessment of risk of bias in RCTs are set out in Box 1.5. While all these criteria are relevant to assessing risk of bias, their relative importance can be context specific. For example, the importance of blinded outcome assessment will vary depending on whether the outcomes involve subjective judgement (this may vary between different outcomes measured within the same trial). Therefore, when planning which criteria to use it is important to think carefully about what characteristics would realistically be considered ideal. The Cochrane handbook provides a detailed assessment tool for use when assessing risk of bias in an RCT.82

Box 1.5: Criteria for assessment of risk of bias in RCTs

  • Was the method used to generate random allocations adequate?

  • Was the allocation adequately concealed?

  • Were the groups similar at the outset of the study in terms of prognostic factors, e.g. severity of disease?

  • Were the care providers, participants and outcome assessors blind to treatment allocation?  If any of these people were not blinded, what might be the likely impact on the risk of bias (for each outcome)?

  • Were there any unexpected imbalances in drop-outs between groups? If so, were they explained or adjusted for?

  • Is there any evidence to suggest that the authors measured more outcomes than they reported?

  • Did the analysis include an intention to treat analysis?  If so, was this appropriate and were appropriate methods used to account for missing data?

Other randomised study designs

In addition to parallel group RCTs, there are other randomised designs where further quality criteria may need to be considered. These are described below.

Randomised cross-over trials

In randomised cross-over trials all participants receive all the interventions. For example in a two arm cross-over trial, one group receives intervention A before intervention B, and the other group receive intervention B before intervention A. It is the sequence of interventions that is randomised. The advantage of cross-over trials is that they are potentially more efficient than parallel trials of a similar size, in which each participant receives only one of the interventions. The criteria for assessing risk of bias in RCTs also apply to cross-over trials, but there are some additional factors that need to be taken into consideration.

The cross-over design is inappropriate for conditions where the intervention may provide a cure or remission, where there is a risk of spontaneous improvement or resolution of the condition, where there is a risk of deterioration over the period of the trial (e.g. degenerative conditions) or where there is a risk that patients may die.83 This is because these outcomes lead either to the participant being unable to enter the second period or, on entering the second period, their condition is systematically different from that in the first period.

The possibility of a ‘carryover’ of the effect of the intervention provided in the first period into the second intervention period is an important concern in this study design.83 This risk is dealt with by building in a treatment-free or placebo ‘washout period’ between the intervention periods.83 The adequacy of the washout period length will need to be considered as part of the assessment of risk of bias.

The statistical analysis appropriate to cross-over trials are discussed in the synthesis section and statistical advice is likely to be required (see Section 1.3.5 Data synthesis).

Cluster randomised trials

A cluster randomised trial is a trial where clusters of people rather than single individuals are randomised to different interventions.84 For example, whole clinics or geographical locations may be randomised to receive particular interventions, rather than individuals.

The distinctive feature of cluster trials is that the outcome for each participant within a cluster may not be independent, since individuals within the cluster are likely to respond in a similar way to the intervention. Underlying reasons for this intra-cluster correlation include individuals in a cluster being affected in a similar manner due to shared exposure to a common environment such as specific hospital policies on discharge times; or personal interactions between cluster members and sharing of attitudes, behaviours and norms that may lead to similar responses.84 This has implications for estimating the sample size required (i.e. the sample needs to be larger than for an individually randomised trial) and the statistical analysis.

When assessing the risk of selection bias in cluster randomised trials there are two factors that need to be considered: the randomisation of the clusters and how participants within clusters are selected into the study.85 The first can be dealt with by using an appropriate randomisation method with concealed allocation (clusters are often allocated at the outset). However, where the trial design then requires selection of participants from within a cluster, the risk of selection bias should also be assessed. There is a clear risk of selection bias when the person recruiting participants knows in advance the clinical characteristics of a participant and which intervention they will receive. Also, potential participants may know in advance which intervention their cluster will receive, leading to different participation rates in the comparison groups.85 Two key methods for reducing bias in the selection of individuals within clusters have been identified: recruitment of individuals prior to the random allocation of clusters and, where this is not possible, use of an impartial individual to recruit participants following randomisation of the clusters.86

The statistical analyses appropriate to cluster randomised trials are discussed in Section 1.3.5 Data synthesis and statistical advice is likely to be required.

Wider reading is recommended prior to conducting a quality assessment of cluster randomised trials. Several texts discuss the design, analysis and reporting of this trial design.75, 84, 87, 88

Quasi-experimental studies

The main distinction between randomised and quasi-experimental studies is the way in which participants are allocated to the intervention and control groups; quasi-experimental studies do not use random assignment to create the comparison groups.

In non-randomised controlled studies, individuals are allocated to concurrent comparison groups, using methods other than randomisation. The lack of concealed randomised allocation increases the risk of selection bias.

Before-and-after studies evaluate participants before and after the introduction of an intervention. The comparison is usually made in the same group of participants, thus avoiding selection bias, although a different group can be used. In this type of design however, it can be difficult to account for confounding factors, secular trends, regression to the mean, and differences in the care of the participants apart from the intervention of interest.

An alternative to this is a ‘time series’ design. Interrupted time series studies are multiple observations over time that are ‘interrupted’, usually by an intervention or treatment and thus permit separating real intervention effects from other long-term trends. It is a study design used where others, such as RCTs, are not feasible, for example in the evaluation of a screening service or a mass media campaign. It is also frequently used in policy evaluation, for example to measure the effect of a smoking ban.

The circumstances in which, and extent to which, studies without randomisation are at risk of bias are not fully understood.89 A key influencing factor may be the extent to which prognosis influences selection for a particular intervention as well as eventual outcome.89 Because of the risk of bias, careful consideration should be given to the inclusion of quasi-experimental studies in a review to assess the effectiveness of an intervention. If included, researchers should think carefully about the strength of this evidence and how it should be interpreted.

A review of quality assessment tools designed for or used to assess studies without randomisation identified key aspects of quality as being particularly pertinent:89

Other quality issues identified were similar to those for assessing performance, detection and attrition bias in RCTs: blinding of participants and investigators; the level of confidence that the participants received the intervention to which they were assigned and experienced the reported outcome as a result of that intervention; the adequacy of the follow-up; and appropriateness of the analysis.

Observational studies

In observational studies the intervention(s) that individuals receive are determined by usual practice or ‘real-world’ choices, as opposed to being actively allocated as part of the study protocol.

Observational studies are usually more susceptible to bias than experimental studies, and the conclusions that can be drawn from them are necessarily more tentative and are often hypothesis generating, highlighting areas for further research.

Observational designs such as cohort studies, case-control studies and case series are often considered to form a hierarchy of increasing risk of bias. However, such a hierarchy is not always helpful because, as noted before, the same label can be used to describe studies with different design features and there is not always agreement on the definitions of such studies. Attention should focus on specific features of the studies (e.g. participant allocation, outcome assessment) and the extent to which they are susceptible to bias.

In a cohort study design, a defined group of participants is followed over time and comparison is made between those who did and did not receive an intervention (e.g. a study may follow a cohort of women who choose to use oral contraceptives and compare them over time with women who choose other forms of contraception). Prospective cohort studies are planned in advance and define their participants before the intervention of interest and follow them into the future. These are less likely to be susceptible to bias than retrospective cohort studies, which identify participants from past records and follow them from the time of that record.

Case-control studies compare groups from the same population with (cases) and without (controls) a specific outcome of interest, to evaluate the association between exposure to an intervention and the outcome. The risk of selection bias in such studies will be dependent on how the control group was selected. Groups may be matched to make them comparable for potential confounding factors. However, since analysis cannot be performed on matched variables, the matching criteria must be selected carefully, as this can give rise to ‘over-matching’ when the factors are related to allocation to the intervention.

Case series are observations made on a number of individuals (with no control group) and are not comparative. They can, however, provide useful information, for example about the unintentional effects of an intervention (see Chapter 4) and in such situations it is important to assess their quality.

Other issues related to study quality

Choice of outcome measure

As well as using blinding to minimise bias when assessing outcomes, it is usually necessary to consider the reliability or validity of the actual outcome measure being used (e.g. several different scales can be used to measure quality of life or psychological outcomes). It is important that the scales are fully understood to enable comparison, (e.g. a high score implies a favourable outcome in some scales and an unfavourable one in others).

The outcome should also be relevant and meaningful to both the intervention and the evaluation (i.e. a treatment intended to reduce mortality should measure mortality, not merely a range of biochemical indicators).

Statistical issues

Although, issues around statistical analysis are less important if the study data are to be combined in a meta-analysis, when studies are not being quantitatively pooled it is also important to assess statistical issues around design and analysis. For example, assessing whether a study is adequately powered to detect an effect of the intervention.90 The assessment of statistical power may involve relying on the sample size calculation in the primary study, where reported. However, defining population parameters for sample size calculations is a subjective judgement which may vary between investigators;91 for some review topics it may be appropriate to define a priori an adequate sample size for the purposes of the review.

Quality of reporting

Inadequate reporting of important aspects of methodological quality such as allocation concealment, blinding and statistical analysis is common,92 as is failure to report detail about the intervention and its implementation. Quality of reporting does not necessarily reflect the quality of the underlying methods or data, but when planning quality assessment it is important to decide how to deal with poor reporting. One approach is to assume that if an item is not reported then the criterion has not been met. While this may often be justifiable,93, 94 there is evidence to suggest that failure to report a method does not necessarily mean it has not been used.95, 96, 97 Therefore it is important to be accurate and distinguish between failure to report a criterion and failure to meet a criterion. For example, a criterion can be described as being met, not met, or unclear due to inadequate reporting.

There have been a number of initiatives aimed at improving the quality of reporting of primary research. The CONSORT statement contains a set of recommendations for the reporting of RCTs,98 the TREND statement provides guidelines for the reporting of non-randomised evaluations of behavioural and public health interventions,99 and the STROBE statement is an initiative to improve reporting of observational studies.100 The EQUATOR network promotes the transparent and accurate reporting of health research in a number of ways, including the use of these consensus reporting guidelines.101 It is anticipated that implementation of these guidelines will help improve the standard of reporting, which should make quality assessment more straightforward.

Quality of the intervention

In addition to study design, it is often helpful to assess the quality of the intervention and its implementation. At its most simplistic, the quality of an intervention refers to whether it has been used appropriately. This is a fairly straightforward assessment where, for example drug titration studies have been conducted. It is more problematic where there is no preliminary research suggesting that an intervention should be administered in a particular way,102 or where the intervention requires a technical skill such as surgery or physiotherapy.103 It is important to establish to what extent these are standardised, as this will affect how the results should be interpreted.

The quality of the intervention is particularly relevant to complex interventions made up from a number of components, which act independently and inter-dependently.104 These include clinical interventions such as physiotherapy as well as public health interventions such as community-based programmes. The quality of an intervention can be conceptualised as having two main aspects: (i) whether the intervention has been appropriately defined and (ii) whether it has been delivered as planned (the integrity or fidelity of the intervention).

If the quality of the intervention is relevant, the review should assess whether the intervention was implemented as planned in the individual studies (i.e. how many participants received the intervention as planned, whether consistency of implementation was measured, and whether it is likely that participants received an unintended intervention/contamination of the intervention that may influence the results). In some topic areas, for example when a sham device or procedure is being used, it may also be relevant to assess the quality of the comparator. When an intervention relies on the skill of the care provider it may be useful to assess whether the performance of those providing the intervention was measured. For more detailed information on complex interventions see Chapter 3.


Generalisability, also known as applicability or external validity, is not considered in detail in this section. In addition to assessing the risk of bias (internal validity), researchers may also consider how closely a study reflects routine practice or the usual setting where the intervention would be implemented. However, this is not an inherent characteristic of a study as the extent to which a study is ‘generalisable’ depends also on the situation to which the findings are being applied.105 Therefore the issue of generalisability is also raised in Section 1.3.3 Data extraction, in the context of defining inclusion criteria for the review, Section 1.2 The review protocol, and Section 1.3.6 Report writing. The impact of study quality on the estimate of effect

Several empirical studies have explored how quality can influence the results of clinical trials (and therefore the results of reviews of trials). Trials with double-blinding and adequate concealment of allocation have been found to indicate less beneficial treatment effects than trials without these features.106 Similarly, exclusion of lower quality studies has led to less beneficial effects in meta-analyses.106 In meta-analyses of subjectively assessed outcomes (e.g. patient reported outcomes), inadequate allocation concealment and lack of blinding have been associated with substantially more beneficial treatment effects, whereas for objective outcomes (e.g. mortality) there was a modest effect of inadequate allocation concealment and no effect of lack of blinding.107 There is some evidence about the relationship between study quality and the estimate of effect that is contradictory to the above,108, 109 though this may be due to the data sets used and how specific quality criteria were defined. The process of quality assessment in systematic reviews

There are two main approaches towards assessing quality. One involves the use of checklists of quality items and the other of scales which provide an overall numerical quality score for each study.110

Tools for assessing quality

Checklists can be a reliable means of ensuring that all the studies assessed are critically appraised in a standardised way. There are many different checklists and scales readily available, 75, 111, 112, 113, 114, 115, 116 which can be modified to meet the requirements of the review, or a new detailed checklist, specific to the review, may be developed.

Because some items included may require a degree of subjective judgement, it is important to pilot the use of the checklist and to ensure that the quality assessment is undertaken independently by two researchers.

The use of scales with summary scores to distinguish high and low quality studies is questionable and not recommended.117, 118 Very few scales have been developed using standard techniques to establish their validity and reliability.113 The weighting assigned to methodological items varies considerably between scales,117 and does not usually take into account the direction of bias.119 An investigation comparing low-molecular-weight heparin (LMWH) with standard heparin for thromboprophylaxis in general surgery found that trials identified as ‘high quality’ by some of the 25 scales investigated indicated that LMWH was not superior to standard heparin, whereas trials identified as ‘high quality’ by other scales led to the opposite conclusion, that LMWH was beneficial.117 It is therefore preferable that aspects of quality such as blinding and treatment allocation (and their potential impact on study results) should be considered individually.117

Checklists by type of study design

In general checklists tend to be specific to particular study designs, and where reviews include more than one type of study design, separate lists can be used or a combined list selected or developed. Checklists have also been developed for use with both randomised and non-randomised studies such as that by Downs and Black.111

There are multiple systems available for the evaluation of RCTs,112, 113 in addition to the Cochrane handbook assessment tool for assessing risk of bias.82 In a review of checklists for the assessment of non-randomised studies, nearly 200 tools were identified. From these, six were recommended as being suitable for use in systematic reviews including non-randomised studies.89 The Cochrane Effective Practice and Organisation of Care Group (EPOC) have developed guidelines to assist researchers in making decisions about when to include studies that use interrupted time series designs and how to assess their methodological quality.115, 116 A useful checklist for observational studies was published as part of the US Agency for Healthcare Research and Quality’s (AHRQ) ‘Systems to Rate the Strength of Scientific Evidence’.112 The most recent version of the Cochrane Handbook also contains guidance on dealing with non-randomised studies in systematic reviews of interventions, from the protocol to synthesis stages.75

How will the quality assessment information be used?

Simply reporting which quality criteria were met by studies included in a systematic review is not sufficient. The implications of the quality assessment for interpreting results need to be explicitly considered.

Study quality can be incorporated into the synthesis either quantitatively through subgroup or sensitivity analyses (see Section Quantitative synthesis), or in a narrative synthesis. In the latter, the quality assessment can be used to help interpret and explain differences in results across studies (e.g. unblinded studies with subjective outcomes may have consistently larger effects than blinded studies) and inform a qualitative interpretation of the risk of bias (see Section Narrative synthesis).

Summary: Quality assessment

  • An important part of the systematic review process is to assess the risk of bias in included studies caused by inadequacies in study design, conduct or analysis that may have led to the treatment effect being over or underestimated.

  • Various tools are available but there is no single tool that is suitable for use in all reviews. Choice should be guided by:

    • Study design

    • The level of detail required in the assessment

    • The ability to distinguish between internal validity (risk of bias) and external validity (generalisability)

  • Using quality scores is problematic; it is preferable to consider individual aspects of methodological quality in the quality assessment and synthesis.

  • Where appropriate, the potential impact that methodological quality had on the findings of the included studies should be considered.

  • Detailed quality assessment can be time consuming if a review includes a large number of studies and may require considerable expertise in critical appraisal. If resources are limited, priority should be given to assessment of the key sources of bias.

1.3.5 Data synthesis

Synthesis involves the collation, combination and summary of the findings of individual studies included in the systematic review. Synthesis can be done quantitatively using formal statistical techniques such as meta-analysis, or if formal pooling of results is inappropriate, through a narrative approach. As well as drawing results together, synthesis should consider the strength of evidence, explore whether any observed effects are consistent across studies, and investigate possible reasons for any inconsistencies. This enables reliable conclusions to be drawn from the assembled body of evidence.

Deciding what type of synthesis is appropriate

Many systematic reviews evaluating the effects of health interventions focus on evidence from RCTs, the results of which, generally, can be combined quantitatively. However, not all health care questions can be addressed by RCTs, and systematic reviews do not automatically involve statistical pooling. Meta-analysis is not always possible or sensible. For example, pooling results obtained from diverse non-randomised study types is not recommended.120 Similarly, meta-analysis of poor quality studies could be seriously misleading as errors or biases in individual studies would be compounded and the very act of synthesis may give credence to poor quality studies. However, when used appropriately, meta-analysis has the advantage of being explicit in the way that data from individual studies are combined, and is a powerful tool for combining study findings, helping avoid misinterpretation and allowing meaningful conclusions to be drawn across studies.

The planned approach should be decided at the outset of the review, depending on the type of question posed and the type of studies that are likely to be available. There may be topics where it can be decided a priori that a narrative approach is appropriate. For example, in a systematic review of interventions for people bereaved by suicide, it was anticipated there would be such diversity in the included studies, in terms of settings, interventions and outcome measures, that a narrative synthesis alone was proposed in the protocol.121

Narrative and quantitative approaches are not mutually exclusive. Components of narrative synthesis can be usefully incorporated into a review that is primarily quantitative in focus and those that take a primarily narrative approach can incorporate some statistical analyses such as calculating a common outcome statistic for each study.

Initial descriptive synthesis

Both quantitative and narrative synthesis should begin by constructing a clear descriptive summary of the included studies. This is usually done by tabulating details about study type, interventions, numbers of participants, a summary of participant characteristics, outcomes and outcome measures. An indication of study quality or risk of bias may also be given in this or a separate table (see Section 1.3.2 Study selection and Section 1.3.4 Quality assessment). An example is given in Table 1.1. If the review will not involve re-calculating summary statistics, but will rather rely on the reported results of the author’s analyses, these may also be included in the table. The descriptive process should be both explicit and rigorous and decisions about how to group and tabulate data should be based on the review question and what has been planned in the protocol. This initial phase will also be helpful in confirming that studies are similar and reliable enough to synthesise, and that it is appropriate to pool results.

Table 1.1: Example table describing studies included in a systematic review of the effectiveness of drug treatments for attention deficit hyperactivity disorder in children and adolescents. 122



Intervention#– N

Age (years)

Duration (weeks)

Core outcomes


Administered once daily

Rapport, 1989

C (5x)

MPH (5 mg/day, o.d.) – 45

MPH (10 mg/day, o.d.) – 45

MPH (15 mg/day, o.d.) – 45



Core: no hyp; Abbreviated CTRS: total score

QoL: not reported

AE: not reported

DuPaul, 1993


C (5x)

MPH (5 mg/day, o.d.) – 31

MPH (10 mg/day, o.d.) – 31

MPH (15 mg/day, o.d.) – 31



Core: No hyp; Abbreviated CTRS: total score

QoL: not reported

AE: not reported

Werry, 1980


C (3x)

MPH (0.40 mg/kg, o.d.) – 30




Core: Conners’ Teacher Questionnaire: hyperactivity; Conners’ Parent Questionnaire: hyperactivity

QoL: CGI (physician)AE: weight

Administered two or more times daily

Brown, 1988


C (4x)

MPH (8.76 mg/day, b.d.) –11



Core: CPRS: Hyperactivity Index; Conners’ Teacher Hyperactivity Index; ACTeRS: hyperactivity

QoL: not reported

AE: SERS (parents); weight

Fischer, 1991


C (3x)

MPH (0.40 mg/kg/day, b.d.) – 161



Core: CPRS-R: Hyperactivity Index; CTRS-R: hyperactivity index; CTRS-R: hyperactivity

QoL: not reported

AE: CPRS-R: psychosomatic; SERS (parents, teachers): number of side-effects, mean severity rating

Fitzpatrick, 1992


C (4x)

MPH (10–15 mg/day, b.d.) – 19



Core: Conners’ Hyperactivity Index (parents and teacher); TOTS: hyperactivity (parents and teachers)

QoL: no CGI; comments ratings (parent/teacher)

AE: STESS (parents); weight

Fine, 1993

C (3x)

MPH [0.30 mg/kg/day (unclear), b.d.] – 12



Core: not reported

QoL: not reported

AE: side-effects questionnaire

Hoeppner, 1997


C (3x)

MPH (0.30 mg/kg/day, b.d.) – 50



Core: CPRS: Hyperactivity Index; CTRS: Hyperactivity Index

QoL: not reported

AE: not reported

Handen, 1999

C (3x)

MPH (12–15 mg/day, max. 3x) – 11



Core: CTRS: Hyperactivity Index; CTRS: hyperactivity

QoL: not reported

AE: Side Effects Checklist (teachers, parents); mean severity rating 0–6

Manos, 1999


C (4x)

MPH (10 mg/day, b.d.) – 42



Core: no hyp; ASQ (parents and teachers); ARS (parent)

QoL: no CGI; composite ratings (clinician)

AE: Side Effects Behaviour Monitoring Scale (parents)

Barkley, 2000

C (5x)

MPH (10 mg/day, b.d.) – 38



Core: no hyp; ADHD Total Parent/Teacher rating

QoL: not reported

AE: number and severity of side-effects (teachers, parents, self)

Tervo, 2002


C (3x)

MPH (0.10 mg/kg/day, b.d.) – 41

M=9.9 (2.9)


Core: no hyp; CBCL (parent)

QoL: not reported

AE: not reported

ACTeRS, ADD-H Comprehensive Teachers’ Rating Scale; AE, adverse effects; ARS, ADHD Rating Scale; ASQ, Abbreviated Symptoms Questionnaire; b.d., twice daily; C, cross-over trial (number of cross-overs); CBCL, Child Behaviour Checklist; CGI, Clinical Global Impression; CPRS, Conners’ Parent Rating Scale; CTRS, Conners’ Teacher Rating Scale; MPH, methylphenidate hydrochloride;N, number of participants;, o.d., once daily; P, parallel trial; hyp, hyperactivity; PACS, Parental Account of Childhood Symptoms; SERS, Side Effects Rating Scale. Narrative synthesis

All systematic reviews should contain text and tables to provide an initial descriptive summary and explanation of the characteristics and findings of the included studies. However simply describing the studies is not sufficient for a synthesis. The defining characteristic of narrative synthesis is the adoption of a textual approach that provides an analysis of the relationships within and between studies and an overall assessment of the robustness of the evidence.

A narrative synthesis of studies may be undertaken where studies are too diverse (either clinically or methodologically) to combine in a meta-analysis, but even where a meta-analysis is possible, aspects of narrative synthesis will usually be required in order to fully interpret the collected evidence.

Narrative synthesis is inherently a more subjective process than meta-analysis; therefore, the approach used should be rigorous and transparent to reduce the potential for bias. The idea of narrative synthesis within a systematic review should not be confused with broader terms like 'narrative review', which are sometimes used to describe reviews that are not systematic.

A general framework for narrative synthesis

How narrative syntheses are carried out varies widely, and historically there has been a lack of consensus as to the constituent elements of the approach or the conditions for establishing credibility. A project for the Economic and Social Research Council (ESRC) Methods Programme has developed guidance on the conduct of narrative synthesis in systematic reviews.123, 124, 125, 126 The guidance offers both a general framework and specific tools and techniques that help to increase the transparency and trustworthiness of narrative synthesis.

The general framework consists of four elements:

Though the framework is divided into these four elements, the elements themselves do not have to be undertaken in a strictly sequential manner, nor are they totally independent of one another. A researcher is likely to move iteratively among the activities that make up these four elements.

For each element of the framework, this guidance presents a range of practical tools and techniques. It is not mandatory (or indeed appropriate) to employ each one of these for every narrative synthesis, but the appropriate tools/techniques should be selected depending upon the nature of the evidence being synthesised. The reason for the choice of tool or technique should be specified in the methods section of the review.

A fuller description of these tools and techniques and narrative synthesis in general can be found in the ESRC guidance report.125, 126 It should be noted that the list given here is not comprehensive and other tools and techniques may be appropriate in certain circumstances.

The four elements of the narrative synthesis framework (and some of their related tools and techniques) are described below (Figure 1.2).





Figure 1.2: Example of applying the narrative synthesis framework

Developing a theory of how the intervention works, why and for whom

The extent to which theory will play a role will partly depend upon the type of intervention(s) being evaluated. For example, theory may only play a minor role in a systematic review looking at the effects of a single therapeutic drug on patient outcomes because many aspects of the ‘mechanism of action’ will have been established in early studies investigating pharmacodynamics, dose-finding etc. Alternatively, in a systematic review evaluating the effects of a psychosocial or educational programme, theories about the causal chain linking the intervention to the outcomes of interest will be of crucial importance and might be presented descriptively or in diagrammatic form, as displayed in Figure 1.3.

Figure 1.3: Interventions to increase use and function of smoke alarms: implicit theory of change model

Developing a preliminary synthesis of findings of included studies

Once the relevant studies have been data extracted, the first step is to bring together, organise and describe their findings. The direction and size of the reported effects may be the starting point. Or, for example, a collection of studies evaluating one kind of intervention might be divided into subgroups of studies with distinct populations, such as children and adults.  It is important to remember that this is only the first step of the synthesis. The remaining elements of the framework need to be taken into account before it can be considered adequate as a narrative synthesis.

Table 1.2 describes a range of tools and techniques that might be employed at this stage of the synthesis.

Table 1.2: Developing a preliminary synthesis of findings of included studies

Textual descriptions of studies

A descriptive paragraph on each included study. These descriptions should be produced in a systematic way, including the same type of information for all studies if possible and in the same order. It may be useful for recording purposes to do this for all excluded studies as well.

Groupings and clusters

The included studies might be grouped at an early stage of the review, though it may be necessary to refine these initial groups as the synthesis develops. This can also be a useful way of aiding the process of description and analysis and looking for patterns within and across groups. It is important to use the review question(s) to inform decisions about how to group the included studies.


A common approach, used to represent data visually. The way in which data are tabulated may affect readers’ impressions of the relationships between studies, emphasising the importance of a narrative interpretation to supplement the tabulated data.

Transforming data into a common measure

In both narrative and quantitative synthesis it is important to ensure that data are presented in a common measure to allow an accurate description of the range of effects.

Vote-counting as a descriptive tool

Simple vote-counting might involve the tabulation of findings according to direction of effect. More complex approaches can be developed both in terms of the categories used and by assigning different weights or scores to different categories. However, vote-counting can disregard sample size and be misleading. So, the interpretation of the results must be approached with caution and subjected to further scrutiny.

Translating data: thematic analysis

A technique used in the analysis of qualitative data in primary research can be used to systematically identify the main, recurrent and/or most important (based on the review question) themes and/or concepts across multiple studies.127

Translating data: content analysis

A technique for compressing many words of text into fewer content categories based on explicit rules of coding.128 Unlike thematic analysis, it is essentially a quantitative method, since all the data are eventually converted into frequencies.

Exploring relationships within and between studies

Patterns emerging from the data during the preliminary synthesis need to be rigorously scrutinised in order to identify factors that might explain variations in the size/direction of effects.  At this stage there is a clear attempt to explore relationships between: (a) characteristics of individual studies and their reported findings; and (b) the findings of different studies.

However, when exploring heterogeneity in this way, it is necessary to be wary of uncovering associations between characteristics and results that are based on comparisons of many subgroups – some of these may simply have occurred by chance. Subgroup comparisons which are specified in advance (i.e. as part of the review protocol) are more likely to be plausible than those which are not.129, 130

The extent to which these factors can be explored in the review depends on how clearly they are reported in the primary research studies. The amount of detail may depend on the type of publication and the nature of the intervention being reviewed (e.g. highly standardised interventions may not be described as fully as more unusual ones).

Tools and techniques that might be employed at this stage of the synthesis are described in Table 1.3.

Table 1.3: Exploring relationships within and between studies

Graphs, frequency distributions, funnel plots, forest plots and L’Abbe plots

There are several visual or graphical tools that can help reviewers explore relationships within and between studies. These include: presenting results in graphical form; plotting findings (e.g. effect size) against study quality; plotting confidence intervals; and/or plotting outcome measures.

Moderator variables and subgroup analyses

This refers to the analysis of variables which can be expected to moderate the main effects being examined in the review. This can be done at the study level, by examining characteristics that vary between studies (such as study quality, study design or study setting) or by analysing characteristics of the sample (such as subgroups of participants).

Idea webbing and conceptual mapping

Involves using visual methods to help to construct groupings and relationships. The basic idea underpinning these approaches is (i) to group findings that are empirically and/or conceptually similar and (ii) to identify (again on the basis of empirical evidence and/or conceptual/theoretical arguments) relationships between these groupings.

Qualitative case descriptions

Any process in which descriptive data from studies included in the systematic review are used to try to explain differences in statistical findings. For example why one intervention outperforms another apparently similar intervention or why some studies are statistical outliers.

Investigator/methodological/ conceptual triangulation

Triangulation makes use of a combination of different perspectives and/or assessment methods to study a particular phenomenon. This could apply to the methodological and theoretical approaches adopted by the researchers undertaking primary studies included in a systematic review, e.g. investigator triangulation explores the extent to which heterogeneity in study results may be attributable to the diverse approaches taken by different researchers. Triangulation involves analysing the data in relation to the context in which they were produced, notably the disciplinary perspectives and expertise of the researchers producing the data.

Assessing the robustness of the synthesis

Towards the end of the synthesis process, the analysis of relationships as described above should lead into an overall assessment of the strength of the evidence. This is essential when drawing conclusions based on the narrative synthesis.

Robustness can relate to the methodological quality of the included studies (such as risk of bias), and/or the credibility of the product of the synthesis process. Obviously, these are related. The credibility of a synthesis will depend on both the quality and the quantity of the evidence base it is built on, and the method of synthesis and the clarity/transparency of its description. If primary studies of poor methodological quality are included in the review in an uncritical manner then this will affect the integrity of the synthesis. Attempts to minimize the introduction of bias might include ‘weighting’ the findings of studies according to technical quality (i.e. giving greater credence to the findings of more methodologically sound studies) and providing a clear justification for this. Similarly, a clear description of the potential sources of bias within the synthesis process itself helps establish credibility with the reader.

Table 1.4 describes the tools and techniques that might be employed at this stage of the synthesis.

Table 1.4: Assessing the robustness of the synthesis

Use of validity assessment

Use of specific rules to define weak, moderate or good evidence. An example is the approach used by the US Centers for Disease Control and Prevention131  although there are many other evidence grading systems available. Decisions about the strength of evidence are explicit although the criteria used are often debated. 

Reflecting critically on the synthesis process

Use of a critical discussion to address methodology of the synthesis used132 (especially focusing on its limitations and their potential influence on the results); evidence used (quality, validity, generalisability) – with emphasis on the possible sources of bias and their potential influence on results of the synthesis; assumptions made; discrepancies and uncertainties identified; expected changes in technology or evidence (e.g. identified ongoing studies); aspects that may have an influence on implementation and effectiveness in real settings. Such a discussion would provide information on both the robustness and generalisability of the synthesis.

Checking the synthesis with authors of primary studies

It is possible to consult with the authors of included primary studies in order to test the validity of the interpretations developed during the synthesis and the extent to which they are supported by the primary data.133 The authors of the primary studies may have useful insights into the possible accuracy and generalisability of the synthesis; this is most likely to be useful when the number of primary studies is small. This is a technique that has been used with qualitative evidence. Quantitative synthesis of comparative studies

As with narrative synthesis, quantitative synthesis should be embedded in a review framework that is based on a clear hypothesis, should consider the direction and size of any observed intervention effects in relation to the strength of evidence, and should explore relationships within and between studies. The requirements for a careful and thoughtful approach, the need to assess the robustness of syntheses, and to reflect critically on the synthesis process, apply equally but are not repeated here.

This section aims to outline the rationale for quantitative synthesis of comparative studies and to focus on describing commonly used methods of combining study results and exploring heterogeneity. A more detailed overview of quantitative synthesis for systematic review is given in The Cochrane Handbook.75 Comprehensive accounts are also given by Whitehead134 and Cooper and Hedges,135 and a discussion of recent developments and more experimental approaches is given in a paper by Sutton and Higgins.136

Decisions about which comparisons to make, and which outcomes and summary effect measures to use, should have been addressed as part of the protocol development. However, as synthesis depends partly on what results are actually reported, some planned analyses may not be possible, and others may have to be adapted or developed. Any departures from the analyses planned in the protocol should be clearly justified and reported.

Decisions about what studies should and should not be combined are inevitably subjective and require careful discussion and judgement. As far as possible a priori consideration at the time of writing the protocol is desirable. There will always be differences between studies that address a common question. Reserving meta-analyses for only those studies that evaluate exactly the same interventions in near identical participant populations would be severely limiting and seldom achievable in practice. For example, whilst it may not be sensible to average the results of studies using different classes of experimental drugs or comparators, it may be reasonable to combine results of studies that use analogues or drugs with similar mechanisms of action. Likewise, it will often be reasonable to combine results of studies that have used similar but not identical comparators (e.g. placebo and no treatment). Where there are substantial differences between studies addressing a broadly similar question, although combining their results to give an estimate of an average effect may be meaningless, a test of whether an overall effect is present might be informative. It can be useful to calculate summary statistics for each individual study to show the variability in results across studies. It may also be helpful to use meta-analysis methods to quantify this heterogeneity, even when combined estimates of effect are not produced.

Reasons for meta-analysis  

Combining the results of individual studies in a meta-analysis increases power and precision in estimating intervention effects. In most areas of health care, ‘breakthroughs’ are rare and we may reasonably expect that new interventions will lead to only modest improvements in outcomes; such improvements can of course be extremely important to individuals and of significant benefit in terms of population health. Large numbers of events are required to detect modest effects, which are easily obscured by the play of chance, and studies are often too small to do so reliably. Thus, in any group of small trials addressing similar questions, although a few may have demonstrated statistically significant results by chance alone, most are likely to be inconclusive. However, combining the results of studies in a meta-analysis provides increased numbers of participants, reduces random error, narrows confidence intervals, and provides a greater chance of detecting a real effect as statistically significant (i.e. increases statistical power). Meta-analysis also allows observation and statistical exploration of the pattern of results across studies and quantification and exploration of any differences.

Combining comparative study results in a meta-analysis 

Most meta-analyses take a two-step approach in that they first analyse the outcome of interest and calculate summary statistics for each individual study. In the second stage, these individual study statistics are combined to give an overall summary estimate. This is usually calculated as a weighted average of the individual study estimates. The greater the weight awarded to a study, the more it influences the overall estimate. Studies are usually, at least in part, weighted in inverse proportion to their variance (or standard error squared), a method which essentially gives more weight to larger studies and less weight to smaller studies. It is also possible to weight studies according to other factors such as trial quality, but such methods are very seldom implemented and not recommended.

Two main statistical models are used. Fixed-effect models weight the contribution of each study proportional to the amount of information observed in the study. This considers only variability in results within studies and no allowance is made for variation between studies. Random-effect models allow for between-study variability in results by weighting studies using a combination of their own variance and the between-study variance. Where there is little between-study variability, the within-study variance will dominate and the random-effects weighting will tend towards that of the fixed-effect weighting. If there is substantial between-study variability, this dominates the weighting factor and within-study variability contributes little to the analysis. In this way, all trials will tend towards contributing equally towards the overall estimate and it can be argued that small studies will unduly influence the estimate. Those in favour of random-effects argue that it formally allows for between-study variability and that the fixed-effect approach unrealistically assumes a single effect across trials and gives over-precise estimates. In practice, with well-defined questions, the results of both approaches are often very similar and it is common to run both to test robustness of the choice of statistical model.

Generic inverse variance method of combining study results  

The generic inverse variance method is a widely used and easy to implement method of combining study results that underlies many of the approaches that are described later. It is very flexible and can be used to combine any type of effect measure provided that an effect estimate and its standard error is available from each study. Effect estimates may include adjusted estimates, estimates corrected for clustering and repeat measurements, or other summaries derived from more complex statistical methods.

A fixed-effect meta-analysis using the generic inverse variance method calculates a weighted average of study effect estimates (EEIV) by summing individual effect estimates (EEi), for example, the log odds ratio or the mean difference, and weighting these by the reciprocal of their squared standard errors (SEi) as follows:137

A random-effects approach involves adjusting the study specific standard errors to incorporate between-study variation, which can be estimated from the effects and standard errors associated with the included studies.138

Types of data

Other ways to combine studies of effectiveness are available, some of which are specific to the nature of the data that have been collected, analysed and presented in the included studies.

Box 1.6: Illustration of how to calculate risk ratio, relative and absolute risk reduction, and odds ratios and their standard errors


Individuals with event

Individuals without event








Experimental group







Control group














Risk ratio

Relative risk reduction

Odds ratio

Peto odds ratio

Dichotomous/binary outcomes

Dichotomous outcomes are those that either happen or do not happen and an individual can be in one of only two states, for example having an acute myocardial infarction or not having an infarction. Dichotomous outcomes are most commonly expressed in terms of risks or odds. Although, in everyday use, the terms risk and odds are often used to mean the same thing, in the context of statistical evaluation they have quite specific meanings.

Risk describes the probability with which a health outcome will occur and is often expressed as a decimal number between 0.0 and 1.0, where 0.0 indicates that there is no risk of the event occurring, and 1.0 indicating certainty that the event will take place. A risk of 0.4 indicates that about four in ten people will experience the event. Odds describe the ratio of the probability that an event will happen to the probability that it will not happen and can take any value between zero and infinity. Odds are sometimes expressed as the ratio of two integers such that 0.001 can be written 1:1000 indicating that for every one individual who will experience the event, one thousand will not.

Risk ratios (RR), also known as relative risks, indicate the change in risk brought about by an intervention and are calculated as the probability of an event in the intervention group divided by the probability of an event in the control group (where the probability of an event is estimated by the total number of events observed in the group divided by the total number of individuals in that group). A risk ratio of 2.0 indicates that the intervention leads to the risk becoming twice that of the comparator. A risk ratio of 0.75 indicates that the risk has been reduced to three quarters of that of the comparator. This can also be expressed in terms of a reduction in risk whereby the relative risk reduction (RRR) is given as one minus the risk ratio multiplied by 100. For example, a risk ratio of 2.0 corresponds to a relative risk reduction of -100% (a 100% increase), while a risk ratio of 0.75 corresponds to a relative risk reduction of 25%. Box 1.6 illustrates the calculation of these measures and further details of the formulae can be found elsewhere.137

Risk ratios can be combined using the generic inverse variance method applied to the log risk ratio and its standard error (either in a fixed-effect or a random-effects model). Odds ratios (OR) describe the ratio of the odds of events occurring on treatment to the odds of events occurring on control, and therefore describes the multiplication of the odds of the outcome that occur with use of the intervention. Box 1.6 illustrates how to calculate the odds ratio for a single study. Odds ratios can be combined using the generic inverse variance method applied to the log odds ratio and its standard error as described above.

The Mantel-Haenszel method for combining risk ratios or odds ratios, which uses a different weighting scheme, is more robust when data are sparse, but assumes a fixed-effect model.137

The Peto odds ratio139 (ORPeto) is an alternative estimate of a combined odds ratio in a fixed-effect model, and is based on the difference between the observed number of events and the number of events that would be expected (O-E) if there was no difference between experimental and control interventions (see Box 1.6). Combining studies using the Peto method is straightforward, and it may be particularly useful for meta-analysis of dichotomous data when event rates are very low, and where other methods fail.

This approach works well when the effect is small (that is when the odds ratio is close to 1.0), events are relatively uncommon, and there are similar numbers in the experimental and control groups. The approach is commonly used to combine data from cancer trials which generally conform to these expectations. Correction for zero cells is not necessary (see below) and the method appears to perform better than alternative approaches when events are very rare. It can also be used to combine time-to-event data by pooling log rank observed minus expected (O-E) events and associated variance. However, the Peto method does give biased answers in some circumstances, especially when treatment effects are very large, or where there is a lack of balance in treatment allocation within the individual studies.140 Such conditions will not usually apply to RCTs but may be particularly important when combining the results of observational studies which are often unbalanced.

Although both risk ratios and odds ratios are perfectly valid ways of describing a treatment effect, it is important to note that they are not the same measure, cannot be used interchangeably and should not be confused. When events are relatively rare, say less than 10%,141 differences between the two will be small, but where the event rate is high, differences will be large. For treatments that increase the chance of events, the odds ratio will be larger than the risk ratio and for interventions that reduce the chance of events, the odds ratio will be smaller than the risk ratio. Thus if an odds ratio is misinterpreted as a risk ratio it will lead to an overestimation of the effect of intervention. Unfortunately, this error in interpretation is quite common in published reports of individual studies and systematic reviews. Although some statisticians prefer odds ratios owing to their mathematical properties (they do not have inherent range limitations associated with high baseline rates and naturally arise as the antilog of coefficients in mathematical modelling, making them more suitable for statistical manipulation), they have been criticised for not being well understood by clinicians and patients.142, 143 It may therefore be preferable, even when calculations have been based on odds ratios, to transform the findings to describe results as changes in the more intuitively understandable concept of risk.

Neither the risk ratio nor the odds ratio can be calculated for a trial if there are no events in the control group (as calculation would involve division by zero)' and so in this situation it is customary to add 0.5 to each cell of the 2x2 table.137 If there are no events (or all participants experience the event) in both groups, then the trial provides no information about relative probability and so it is omitted from the meta-analysis. These situations are likely to occur when the event of interest is rare, and in such situations the choice of effect measure requires careful thought. A simulation study has shown that when events are rare, most meta-analysis methods give biased estimates of effect,144 and that the Peto odds ratio (which does not require a 0.5 correction) may be the least biased. 

Continuous outcomes

Continuous outcomes are those that take any value in a specified range and can theoretically be measured to many decimal places of accuracy, for example, blood pressure or weight. Many other quantitative outcomes are typically treated as continuous data in meta-analysis, including measurement scales. Continuous data are usually summarized as means and presented with an indication of the variation around the mean using the standard deviation (SD) or standard error (SE). The effect of an intervention on a continuous outcome is measured by the absolute difference between the mean outcome observed for the experimental intervention and control, termed the mean difference (MD). This estimates the amount by which the treatment changes the outcome on average and is expressed:

Study mean differences and their associated standard errors can be combined using the generic inverse variance method.

Where studies assess the same outcome but measure it using different scales (for example, different quality of life scales), the individual study results must be standardised before they can be combined. This is done using the standardised mean difference (SMD), which considers the effect size in each study relative to the variability in the study and is calculated as the mean difference divided by the standard deviation among all participants. Where scales differ in direction of effect (i.e. some increase with increasing severity of outcome whilst others decrease with increasing severity), this needs to be accounted for by assigning negative values to the mean of one set of studies thereby giving all scales the same direction of measurement. There are three commonly used methods of recording the effect size in the standardised mean difference method, Cohen’s d,145 Hedges adjusted g,145 and Glass’ delta.146 The first two differ in whether the standard deviation is adjusted for small sample bias. The third differs from the other two by standardizing by the control group standard deviation rather than an average standard deviation across both groups. The standardised mean difference assumes that differences in the standard deviation between studies reflect differences in the measurement scale and not differences between the study populations. The summary intervention effect can be difficult to interpret as it is presented in abstract units of standard deviation rather than any particular scale.

Note that in social science meta-analyses, the term ‘effect size’ usually refers to versions of the standardised mean difference.

Time-to-event outcomes

Time-to-event analysis takes account not only of whether an event happens but when it happens. This is especially important in chronic diseases where even although we may not be able to ultimately stop an event from happening, slowing its occurrence can be beneficial. For example, in cancer studies in adult patients we rarely anticipate cure, but hope that we can significantly prolong survival. Time-to-event data are often referred to as ‘survival’ data since death is often the event of interest, but can be used for many different types of event such as time free of seizures, time to healing or time to conception. Each study participant has data capturing the event status and the time of that status. An individual may be recorded with a particular elapsed time-to-event, or they may be recorded as not having experienced the event by a particular elapsed time or period of follow-up. When the event has not (yet) been observed, the individual is described as censored, and their event-free time contributes information to the analysis up until the point of censoring.

The most appropriate way to analyse time-to-event data is usually to use Kaplan Meier analysis and express results as a hazard ratio (HR). The HR summarises the entire survival experience and describes the overall likelihood of a participant experiencing an event on the experimental intervention compared to control. Meta-analyses that collect individual participant data are able to carry out such analysis for each included study and then pool these using a variant of the Peto method described above. Alternatively a modelling approach can be used.

Meta-analyses of aggregate data often treat time-to-event data as dichotomous and carry out analyses using the numbers of individuals who did or did not experience an event by a particular point in time. However, using such dichotomous measures in a meta-analysis of time-to-event outcomes is discarding information and can pose additional problems. If the total number of events reported for each study is used to calculate an odds ratio or risk ratio, this can involve combining studies reported at different stages of maturity, with variable follow-up, resulting in an estimate that is both unreliable and difficult to interpret. This approach is not recommended. Alternatively, ORs or RRs can be calculated at specific points in time. Although this makes estimates comparable, interpretation can still be difficult, particularly if individual studies contribute data at different time points. In this case it is unclear whether any observed difference in effect between time points is attributable to the timing or to the analyses being based on different sets of contributing studies. Furthermore, bias could arise if the time points are subjectively chosen by the researcher or selectively reported by the study author at times of maximal or minimal difference between intervention groups.

A preferable approach is to estimate HRs by using and manipulating published or other summary statistical data or survival curves.147, 148 This approach has also been described in non-technical step-by-step terms.149 Currently, such methods are under-used in meta-analyses,149 which may reflect unfamiliarity with the methods and that study reports do not always include the necessary statistical information150, 151 to allow the methods to be used.

Ordinal outcomes

Outcomes may be presented as ordinal scales, such as pain scales (where individuals’ rate their pain as none, mild moderate or severe). These are sometimes analysed as continuous data, with each category being assigned a numerical value (for example, 0 for none, 1 for mild, 2 for moderate and 3 for severe). This is usual when there are many categories, as is the case for many psychometric scales such as the Hamilton depression scale or the Mini-Mental State Examination for measuring cognition. However, a mean value may not be meaningful. Thus, an alternative way to analyse ordinal data is to dichotomise them (e.g. none or mild versus moderate or severe) to produce a standard 2´2 table. Methods are available for analysing ordinal data directly, but these typically require expert input.

Counts and rates

When outcomes can be experienced repeatedly they are usually expressed as event counts, for example, the number of asthma attacks. When these represent common events, they are often treated and analysed as continuous data (for example, number of days in hospital) and where they represent uncommon events they are often dichotomised (for example, whether or not each individual had at least one stroke).

When events are rare, analyses usually focus on rates expressed at the group level, such as the number of asthma attacks per person, per month. Although these can be combined as rate ratios using the generic inverse variance method, this is not always appropriate as it assumes a constant risk over time and over individuals, and is not often done in practice. It is important not to treat rate data as dichotomous data because more than one event may have arisen from the same individual.

Presentation of quantitative results

Results should be expressed in formats that are easily understood, and in both relative and absolute terms.

Where possible, results should be shown graphically. The most commonly used graphic is the forest plot (see Box 1.7), which illustrates the effect estimates from individual studies, and the overall summary estimate. It also gives a good visual summary of the review findings, allowing researchers and readers to get a sense of the data. Forest plots provide a simple representation of the precision of individual and overall results and of the variation between study results. They give an ‘at a glance’ identification of any studies with outlying or unusual results and can also indicate whether particular studies are driving the overall results. Forest plots can be used to illustrate results for dichotomous, continuous and time-to-event outcomes.152

Individual study results are shown as boxes centred on their estimate of effect, with extending horizontal lines indicating their confidence intervals. The confidence interval expresses the uncertainty around the point estimate, describing a range of values within which it is reasonably certain that the true effect lies; wider confidence intervals reflect greater uncertainty. Although intervals can be reported for any level of confidence, in most systematic reviews of health interventions, the 95% confidence interval is used. Thus, on the forest plot, studies with wide horizontal lines represent studies with more uncertain results. Different sized boxes may be plotted for each of the individual studies, the area of the box representing the weight that the study takes in the analysis providing a visual representation of the relative contribution that each study makes to the overall effect.

The plot shows a vertical line of equivalence indicating the value where there is no difference between groups. For odds ratios, risk ratios or hazard ratios this line will be drawn at an odds ratio/risk ratio/hazard ratio value of 1.0, while for risk difference and mean difference it will be drawn through zero. Studies reach conventional levels of statistical significance where their confidence intervals do not cross the vertical line. Summary (meta-analytic) results are usually presented as diamonds whose extremities show the confidence interval for the summary estimate. A summary estimate reaches conventional levels of statistical significance if these extremities do not cross the line of no effect. If individual studies are too dissimilar to calculate an overall summary estimate of effect, a forest plot that omits the summary value and diamond can be produced.

Odds ratios, risk ratios and hazard ratios can be plotted on a log-scale to introduce symmetry to the plot. The plot should also incorporate the extracted numerical data for the groups for each study, e.g. the number of events and number of individuals for odds ratios, the mean and standard deviation for continuous outcomes. Other forms of graphical displays have also been proposed.153

Box 1.7: Effects of four trials included in a systematic review

a) Presented without meta-analysis

b) Presented with meta-analysis (fixed effect model)

c) Presented with meta-analysis (random-effects model)

Example forest plots taken from a systematic review of endovascular stents for abdominal aortic aneurism (EVAR).154

Relative and absolute effects

Risk ratios, odds ratios and hazard ratios describe relative effects of one intervention versus another, providing a measure of the overall chance of the event occurring on the experimental intervention compared to control. These relative effects do not provide information on what this comparison means in absolute terms. Although there may be a large relative effect of an intervention, if the absolute risk is small, it may not be clinically significant because the change in absolute terms is minimal (a big percentage of a small amount may still be a small amount). For example, a risk ratio of 0.8 may represent a 20% relative reduction in events from 50% to 40% or it could represent a 20% relative reduction from 5% to 4% corresponding to absolute differences of 10% and 1% respectively. There may be situations where the former is judged to be clinically significant whilst the latter is not. Meta-analysis should use ratio measures; for example, dichotomous data should be combined as risk ratios or odds ratios and pooling risk differences should be avoided. However, when reporting results it is generally useful to convert relative effects to absolute effects. This can be expressed as either an absolute difference or as a number needed to treat (NNT). Absolute change is usually expressed as an absolute risk reduction which can be calculated from the underlying risk of experiencing an event if no intervention were given and the observed relative effect as shown in Box 1.8.

Box 1.8: Calculation of absolute risk reduction and number needed to treat from relative risks, odds ratios and hazard ratios

Absolute risk reduction from relative risk

Absolute risk reduction from odds ratio155

Absolute risk reduction from hazard ratio156

ARR = Scontrol HR – Scontrol at chosen time point

Number needed to treat


RR = relative risk                  

Scontrol = proportion event free on control treatment     


ARR = absolute risk reduction

HR = hazard ratio

Consideration of absolute effects is particularly important when considering how results apply to different types of individuals who may have different underlying prognoses and associated risks. Even if there is no evidence that the relative effects of an intervention vary across different types of individual (see Subgroup analyses, Meta-regression below), if the underlying risks for different categories of individual differ, then the effect of intervention in absolute terms will be different. It is therefore important when reporting results to consider how the absolute effect of an intervention varies for different types of individual and a table expressing results in this way, as shown in Table 1.5, can be useful. The underlying risk for different types of individual can be estimated from the studies included in the meta-analysis, or generally accepted standard estimates can be used. Confidence intervals should be calculated around absolute effects.

Table 1.5: Example table expressing relative effects as absolute effects for individuals with differing underlying prognoses. 


2 year survival rate 

HR = 0.84

95% CI (0·78–0·92) 


Absolute increase 

(95% CI) 





5%    (3% – 8%) 

From 50 to 55% 




5%    (2% – 8%) 

From 14 to 19% 




2%    (1% - 4%) 

From 4 to 6% 




6%    (3% - 9%) 

From 31 to 37% 




4%    (2% - 6%) 

From 9 to 13% 




5%    (3% - 8%) 

From 52 to 57% 




6%    (3% - 9%)

From 22 to 28%




4%    (2% - 6%) 

From 9 to 13% 

Baseline survival and equivalent absolute increases in survival calculated from a meta-analysis of chemotherapy in high-grade glioma.157
AA = anaplastic astrocytoma, GBM = glioblastoma multiforme.

The NNT, which is derived from the absolute risk reduction as shown in Box 1.8, also depends on both relative effect and the underlying risk. The NNT represents the number of individuals who need to be treated to prevent one event that would be experienced on the control intervention. The lower the number needed to treat, the fewer the patients that need to be treated to prevent one event, and the greater the efficacy of the treatment. For example a meta-analysis of antiplatelet agents for the prevention of pre-eclampsia found an RR of 0.90 (0.84 – 0.97) for pre-eclampsia.158 Plausible underlying risks of 2%, 6% and 18% had associated NNTs of 500 (313-1667), 167 (104-556) and 56 (35-185) respectively.

Sensitivity analyses

Sensitivity analyses explore the robustness of the main meta-analysis results by repeating the analyses having made some changes to the data or methods.159 Analyses run with and without the inclusion of certain trials will assess the degree to which particular studies (perhaps those with poorer methodology) affect the results. For example, analyses might be carried out on all eligible trials and a sensitivity analysis restricted to only those that used a placebo in the control group. If results differ substantially, the final results will require careful interpretation. However care must be taken in attributing reasons for differences, especially when a single or small numbers of trials are included/excluded in the sensitivity analysis, as a study may differ in additional ways to the issue being explored in the sensitivity analysis. Some sensitivity analyses should be proposed in the protocol, but as many issues suitable for exploration in sensitivity analyses only come to light whilst the review is being done, and in response to decisions made or difficulties encountered, these may have to change and/or be supplemented.

Exploring heterogeneity

There will inevitably be variation in the observed estimates of effect from the studies included in a meta-analysis. Some of this variation arises by chance alone, reflecting the fact that no study is so large that random error can be removed entirely. Statistical heterogeneity refers to variation other than that which arises by chance. It reflects methodological or clinical differences between studies. Exploring statistical heterogeneity in a meta-analysis aims to tease out the factors contributing to differences, such that sources of heterogeneity can be accounted for and taken into consideration when interpreting results and drawing conclusions.

There is inevitably a degree of clinical diversity between the studies included in a review,160 for example because of differing patient characteristics and differences in interventions. If these factors influence the estimated intervention effect then there will be some statistical heterogeneity between studies. Methodological differences that influence the observed intervention effect will also lead to statistical heterogeneity. For example, combining results from blinded and unblinded studies may lead to statistical heterogeneity, indicating that they might best be analysed separately rather than in combination. Although it manifests itself in the same way, heterogeneity arising from clinical differences is likely to be because of differences in the true intervention effect, whereas heterogeneity arising from differences in methodology is more likely to be because of bias.


An idea of heterogeneity can be obtained straightforwardly by visually examining forest plots for variations in effects. If there is poor overlap between the study confidence intervals, then this generally indicates statistical heterogeneity.

More formally, a chi-squared test (see Box 1.9), often also referred to as Q-statistic, can assess whether differences between results are compatible with chance alone. However, care must be taken in interpreting the chi-squared test as it has low power, consequently a larger P value (P<0.1) is sometimes used to designate statistical significance. Although a statistically significant test result may point to a problem with heterogeneity, a nonsignificant test result does not preclude important between-study differences, and cannot be taken as evidence of no heterogeneity. Conversely, if there are many studies in a meta-analysis, the test has high power to detect a small amount of heterogeneity that, although statistically significant, may not be clinically important.


Accepting that diversity is likely to be inherent in any review, methods have also been developed to quantify the degree of inconsistency across studies, shifting the focus from significance testing to quantifying heterogeneity. The I2 statistic160, 161 describes the percentage of variability in the effect estimates that can be attributed to heterogeneity rather than chance (see Box 1.9).


Box 1.9: Chi-squared test (or Q-statistic) and test for interaction


Chi-squared test:  





Where Q is the chi-squared statistic, and df its degrees of freedom.


To examine differences across subgroups, either Q or I2 can be applied to meta-analytic results from each subgroup rather than to individual studies (i.e. the sum in Q is across subgroups rather than across studies).


Although the I2 statistic often has wide confidence intervals and it is difficult to provide hard and fast rules on what level of inconsistency is reasonable in a meta-analysis, as a rough guide it has been suggested that I2 values of up to 40% might be unimportant, 30% to 60% might be moderate, 50 to 90% may be substantial and 75% to 100% considerable.75 

If statistical heterogeneity is observed, then the possible reasons for differences should be explored162 and a decision made about if and how it is appropriate to combine studies. A systematic review does not always need to include a meta-analysis and, if there are substantial differences between study estimates of effect, particularly if they are in opposing directions, combining results in a meta-analysis can be misleading. One way of addressing this is to split studies into less heterogeneous groups according to particular study level characteristics (e.g. by type of drug), and perform separate analyses for each group. Forest plots can be produced to show subsets of studies on the same plot. Each subset of studies can have its own summary estimate, and if appropriate an overall estimate combined across all studies can also be shown. Showing these groupings alongside each other in this way provides a good visual summary of how they compare. This approach allows the consistency and inconsistency between subsets of studies to be examined. Differences can be summarised narratively, but where possible they should also be evaluated formally. A chi-squared test for differences across subgroups can be carried out (see Box 1.9).

The influence of patient-level characteristics (e.g. age, gender) or issues related to equity (e.g. ethnicity, socioeconomic group) can also be explored through subgroup analyses, meta-regression or other modelling approaches. However, there is generally insufficient information in published study reports to allow full exploration of heterogeneity in this way and this can usually only be addressed satisfactorily when IPD are available. Such exploration of heterogeneity may enable additional questions to be addressed, such as which particular treatments perform best or which types of patient will benefit most, but is unlikely to be helpful when there are few studies. Wherever possible, potential sources of heterogeneity should be considered when writing the review protocol and possible subgroup analyses pre-specified rather than trying to explain statistical heterogeneity after the fact.

Subgroup analyses

Subgroup analyses divide studies (for study level characteristics) or participant data (for participant level characteristics) into subgroups and make indirect comparisons between them. These analyses may be carried out to explore heterogeneity (see above) as well as to try to answer particular questions about patient or study factors. For example a subgroup analysis for study level characteristics might examine whether the results of trials carried out in primary health care settings are the same as trials carried out in a hospital setting. A participant level subgroup analysis might examine whether the effect of the intervention is the same in men as in women.

In individual studies it is unusual to have sufficient numbers and statistical power to permit reliable subgroup analyses of patient characteristics. However, provided that such data have been collected uniformly across studies, a meta-analysis may achieve sufficient power in each subgroup to permit a more reliable exploration of whether the effect of an intervention is larger (or smaller) for any particular type of individual. Although, owing to the multiplicity of testing, these analyses are still potentially misleading, subgroup analysis within the context of a large meta-analysis may be the only reasonable way of performing such exploratory investigations. Not only do the greater numbers give increased statistical power, but consistency across trials can be investigated. Indeed, the possibility of undertaking such analyses is a major attraction of IPD meta-analyses as dividing participant data into groups for subgroup analysis is seldom possible in standard reviews of aggregate data.163 Subgroup analyses in most (non IPD) systematic reviews focus on grouping according to trial attributes.

The interpretation of the results of subgroup analyses must be treated with some caution. Even where the original data have come from RCTs, the investigation of between-study differences is indirect and equivalent to an observational study.164, 165 There may be explanations for the observed differences between groups, other than the attributes chosen to categorise groupings. Comparisons which are planned in advance on the basis of a plausible hypothesis and written into the protocol are more credible than findings that are found through post hoc exploratory analyses. Furthermore, the likelihood of finding false negative and false positive significance tests rises rapidly as more subgroup analyses are done. Subgroups should therefore be restricted to a few potentially important characteristics where it is reasonable to suspect that the characteristic will interact with or modify the effect of the intervention. Note that there is often confusion between prognostic factors and potential effect modifiers; just because a characteristic is prognostic does not mean that it will modify the effect of an intervention. For example, whilst gender is prognostic for survival (women live longer than men) it does not necessarily mean that women will benefit more than men will from a drug to treat lung cancer.


Meta-regression can be used to investigate the effects of differences in study characteristics on the estimates of the treatment effect,140 and can explore continuous as well as categorical characteristics. In principle it can allow for the simultaneous exploration of several characteristics and their interactions, though in practice this is seldom possible because of small numbers of studies.166 As in any simple regression analysis, meta-regression aims to predict outcome according to explanatory variables or covariates of interest. The covariates may be constant for the entire trial, for example, the protocol dose of a drug, or a summary measure of attributes describing the patient population, for example, mean age or percentage of males. The regression is weighted by precision of study estimates such that larger studies have more influence than smaller studies. The regression coefficient is tested to establish whether there is an association between the intervention effect and the covariate of interest. Provided that enough data are available (at least 10 studies),82 the technique may be a useful exploratory tool. However, there are limitations. Not all publications will report on all the covariates of interest (and there could be potential bias associated with selective presentation of data that have shown a positive association within a primary study). If a study is missing a covariate it drops out of the regression, limiting the power and usefulness of the analysis, which is already likely to be based on relatively few data points.

Meta-regression is not a good way to explore differences in treatment effects between different types of individual as summary data may misrepresent individual participants.167 What is true of a study with a median participant age of 60 may not necessarily be true for a 60-year-old patient. Potentially all the benefit could have been shown in the 50-year-olds and none in the 60 and 70-year-olds. Comparison of treatment effects between different types of individual, for example between men and women, should be done using subgroup analyses and not by using meta-regression incorporating the proportion of women in each trial. It should always be borne in mind that finding a significant association in a meta-regression does not prove causality and should rather be regarded as hypothesis generating.

Assessing the possibility of publication bias

Although thorough searches should ensure that a systematic review captures as many relevant studies as possible, they cannot eliminate the risk of publication bias. As publication and associated biases can potentially influence profoundly the findings of a review, the risk of such bias should be considered in the review’s conclusions and inferences.24 The book by Rothstein et al provides a comprehensive discussion of publication bias and associated issues.168

The obvious way to test for publication bias is to compare formally the results of published and unpublished studies. However, more often than not unpublished studies are hidden from the reviewer, and more ad hoc methods are required. A common technique to help assess potential publication bias is the funnel plot.

This is a scatter plot based on the fact that precision in estimating effect increases with increasing sample size. Effect size is plotted against some measure of study precision – of which standard error is likely to be the best choice.169 A wide scatter in results of small studies, with the spread narrowing as the trial size increases, is expected. If there is no difference between the results of small and large studies, the shape of the plot should resemble an inverted funnel (see Box 1.10). If there are differences, the plot will be skewed and a gap where the small unfavourable studies ought to be is often cited as evidence of publication bias. However, the shape of a funnel plot can also depend on the measures selected for estimating effect and precision169, 170 and could be attributable to differences between small and large studies other than publication bias. These differences could be a result of other types of methodological bias, or genuine clinical differences. For example, small studies may have a more selected participant population where a larger treatment effect might be expected. Funnel plots are therefore more accurately described as a tool for investigating small study effects.

Box 1.10: Example funnel plots from a systematic review of dressings and topical agents used in the healing of chronic wounds183



□traditional vs. dressing/topical agent other than hydrocolloid ■ traditional vs. hydrocolloid dressing only

This funnel plot, of all the studies that compared traditional treatments with modern dressing or topical agents for the treatment of leg ulcers and pressure sores, showed little evidence of asymmetry.

This funnel plot, of trials that compared traditional treatments with hydrocolloid dressings for the treatment of leg ulcers and pressure sores, showed clear asymmetry. This was considered likely to be the result of publication bias.

Although visual inspection of funnel plots has been shown to be unreliable,170, 171 this might be improved if contour zones illustrating conventional levels of significance are overlaid on the plot to illustrate whether ‘missing’ studies are from zones of statistical significance or not. If the ‘missing’ studies are from nonsignificant zones, this may support a publication bias. On the other hand if ‘missing’ studies are from statistically significant zones, the asymmetry may be more likely to be attributable to other causes.172 Over time a range of statistical and modelling methods have been developed to test for asymmetry, the most frequently used of which are those based on rank correlation173 or linear regression174, 175 and complex modelling176 methods. Some methods (for example, the trim and fill method177, 178) attempt to adjust for any publication bias detected.176 However, all methods are by nature indirect and the appropriateness of many methods is based on some strict assumptions that can be difficult to justify in practice.

Although frequently used to help assess possible publication bias, funnel plots and associated statistical tests are often used and interpreted inappropriately,179, 180 potentially giving false assurance where a symmetrical plot overlooks important bias or undermining important valid evidence because of an asymmetric plot.179 The methods are inappropriate where there is statistical heterogeneity; have low power and are of little use where there are few studies; and are meaningless where studies are of similar size. Consequently, situations where they are helpful are few and their use is not generally a good way of dealing with publication bias.181 Therefore use of these methods to identify or adjust for publication bias in a meta-analysis should be considered carefully and generally be restricted to sensitivity analyses. Results should be interpreted with caution. Statistical tests will not resolve bias and avoidance of publication bias is preferable. In time this may become easier with more widespread registration of clinical trials and other studies at inception.182

Dealing with special study designs and analysis issues

Intention to treat analyses

ITT is usually the preferred type of analysis as it is less likely to introduce bias than alternative approaches. True intention to treat analysis captures two criteria: (i) participants should be analysed irrespective of whether or not they received their allocated intervention and irrespective of what occurred subsequently, for example, participants with protocol violations or those subsequently judged ineligible should be included in the analysis; (ii) all participants should be included irrespective of whether outcomes were collected. Although the first criterion is generally accepted, there is no clear consensus on the second81 as it involves including participants in the analyses whose outcomes are unknown, and therefore requires imputation of data. Many authors describe their analyses as ITT when only the first criterion has been met. Alternative analysis of all participants for whom outcome data are available is termed available case analysis. Some studies present analysis of all participants who completed their allocated treatment, this is per protocol or treatment received analysis which may be seriously biased.

Imputing missing data

Although statistical techniques are available to impute missing data, this cannot reliably compensate for missing data184 and in most situations imputation of data is not recommended. It is reasonable for most systematic reviews to aim for an available case analysis and include data from only those participants whose outcome is known. Achieving this may require making contact with the study author if individuals for whom outcome data were recorded have been excluded from the published analyses. The extent and implications of missing data should always be reported and discussed in the review. If the number of participants missing from the final analysis is large it will be helpful to detail the reasons for their exclusion. 

In some circumstances, it might be informative to impute data in sensitivity analyses to explore the impact of missing data.185 For missing dichotomous data the analysis can assume that either all participants with missing data experienced the event, or that they all did not experience the event. This generates the theoretical extremes of possible effect. Data could also be imputed using the rate of events observed in the control group, however this does not add information, gives inflated precision and is not recommended. Where missing data are few, imputation will have little impact on the results. Where missing data are substantial, analysis of worst/best case scenarios will give a wide range of possible effect sizes and may not be particularly helpful. Approaches to imputing missing continuous data have received little attention. In some cases it may be possible to use last observation carried forward, or to assume that no change took place. However, this cannot be done from aggregate data and the value of such analysis is unclear. Any researcher contemplating imputing missing data should consult with an experienced statistician.

Cluster randomised trials 

In cluster randomised trials, groups rather than individuals are randomised, for example clinical practices or geographical areas. Reasons for allocating interventions in this way include evaluating policy interventions or group effects such as in immunisation programmes, and avoiding cross-contamination, for example, health promotion information may be easily shared by members of the same clinic or community. In many instances clustering will be obvious, for example where primary care practices are allocated to receive a particular intervention. In other situations the clustering may be less obvious, for example where multiple body parts on the same individual are allocated treatments or where a pregnant woman has more than one fetus. It is important that any cluster randomised trials are identified as such in the review.

As participants within any one cluster are likely to respond in a manner more similar to each other than to other individuals (owing to shared environmental exposure or personal interactions), their data cannot be assumed to be independent. It is therefore inappropriate to ignore the clustering and analyse as though allocation had been at the individual level. This unit of analysis error would result in overly narrow confidence intervals and straightforward inclusion of trials analysed in this way would give undue weight to that study in a meta-analysis. Unfortunately, many primary studies have ignored clustering and analysed results as though from an individual randomised trial.186,187 One way to avoid the problem of inappropriately analysed cluster trials is to carry out meta-analyses using a summary measure for each cluster as a single observation. The sample size becomes the number of clusters (not the number of individuals) and the analysis then proceeds as normal. However, depending on the size and number of clusters, this will reduce the statistical power of the analysis considerably and unnecessarily. Indeed the information required to do this is unlikely to be available in many study publications.

A better approach is to adjust the results of inappropriately analysed primary studies to take account of the clustering, by increasing the standard error of the estimate of effect.75 This may be achieved by multiplying the original standard error by the square root of the ‘design effect’. The design effect can be calculated from the intracluster correlation coefficient, which, although seldom reported, can use external values from similar studies such as those available from the University of Aberdeen Health Services Research Unit (www.abdn.ac.uk/hsru/epp/iccs-web.xls). A common design effect is usually adopted across the intervention groups.


DE = design effect
M = mean cluster size
ICC = intracluster correlation coefficient
SE = standard error of the effect estimate

These values can then be used in a generic inverse variance meta-analysis alongside unadjusted values from appropriately analysed trials.

Cross-over trials

Cross-over trials allocate each individual to a sequence of interventions, for example one group may be allocated to receive treatment A followed by treatment B, and the other group allocated to receive B followed by A. This type of trial has the advantage that each participant acts as their own control, eliminating between participant variability such that fewer participants are required to obtain the same statistical power. They are suitable for evaluating interventions that have temporary effects in treating stable conditions. They are not appropriate where an intervention can have a lasting effect that compromises treatment in subsequent periods of the trial, or where a condition has rapid evolution, or the primary outcome is irreversible. The first task of the researcher is to decide whether the cross-over design is appropriate in assessing the review question.

Appropriate analysis of cross-over trials involves paired analysis, for example using a paired t-test to analyse a study with two interventions and two periods (using experimental measurement – control measurement) for each participant, with standard errors calculated for these paired measurements. These values can then be combined in a generic inverse variance meta-analysis. Unfortunately, cross-over trials are frequently inappropriately analysed and reported.

A common naive analysis of cross-over data is to treat all measurements on experimental and control interventions as if they were from a standard parallel group trial. This results in confidence intervals that are too wide and the trial receives too little weight in the meta-analysis. However, as this is a conservative approach, it might not be unreasonable in some circumstances. Where the effect of the first intervention is thought to have influenced the outcome in subsequent periods (carry-over), a common approach is to use only the data from the first period for each individual. However, this will be biased if the decision to analyse in this way is based on a test of carry-over and studies analysed in this way may differ from those using paired analyses. One approach to combining studies with differing types of reported analyses is to carry out an analysis grouped by type of study i.e. grouped by cross-over trial paired analysis, cross-over trial with first period analysis, parallel group trials and explore whether their results differ (see Subgroup analyses above).

Alternatively, the researcher can carry out their own paired analysis for each trial if (i) the mean and standard deviation or standard error of participant differences are available; (ii) the mean difference plus a t-statistic, p-value or confidence interval from a paired analysis is available; (iii) a graph from which individual matched measurements can be extracted; or (iv) if individual participant data are available.188 Another approach is to attempt to approximate a paired analysis by imputing missing standard errors by ‘borrowing’ from other studies that have used the same measurement scale or by a correlation coefficient obtained from other studies or external sources.75 Researchers will need to decide whether excluding trials is preferable to inferring data. If imputation is thought to be reasonable, advice should be sought from an experienced statistician. Authors should state explicitly where studies have used a cross-over design and how this has been handled in the meta-analysis.

Mixed treatment comparisons

Mixed treatment comparisons (MTC), or network meta-analyses, are used to analyse studies with multiple intervention groups and to synthesise evidence across a series of studies in which different interventions were compared. These are used to rank or identify the optimal intervention. They build a network of evidence that includes both direct evidence from head to head studies and indirect comparisons whereby interventions that have not been compared directly are linked through common comparators. A framework has been described that outlines some of the circumstances in which such syntheses might be considered.189 Methods for conducting indirect comparisons190, 191 and more complex mixed treatment methods192, 193 require expert advice. Researchers wishing to undertake such analyses should consult with an appropriately experienced statistician.

Bayesian methods

Unlike standard analysis techniques, Bayesian analyses allow for the combination of existing information with new evidence using established rules of probability.194 A simple Bayesian analysis model includes three key elements:

  1. Existing knowledge on the effect of an intervention can be retrieved from a variety of sources and summarised as a prior distribution

  2. The data from the studies are used to form the likelihood function

  3. The prior distribution and the likelihood function are formally combined to provide a posterior distribution which represents the updated knowledge about the effect of the intervention

Bayesian approaches to meta-analysis may be useful when evidence comes from a diverse range of sources particularly when few data from RCTs exist.195,196 They can also be used to account for the uncertainty introduced by estimating the between-study variance in the random-effects model, which can lead to reliable estimates and predictions of treatment effects.197 While there are several good texts available,198, 199, 200 if a Bayesian approach is to be used, the advice of a statistical expert is strongly recommended.

Describing results

When describing review findings, the results of all analyses should be considered as a whole, and overall coherence discussed. Consistency across studies should be considered and results interpreted in relation to biological and clinical plausibility. Where there have been many analyses and tests, care should be taken in interpreting unexpected or implausible findings as among a large number of tests the play of chance alone is likely to generate spurious statistically significant results.

Quantitative results of meta-analyses should be expressed as point estimates together with associated confidence intervals and exact p-values. They should not be presented or discussed only in terms of statistical significance. This is particularly important where results are not statistically significant as non-significance can arise both when estimates are close to no effect with narrow confidence intervals, or when estimates of effect are large with wide confidence intervals. Whilst in the former, we can be confident that there is little difference between the interventions compared, in the latter there is insufficient evidence to draw conclusions. Researchers should be aware that describing a result as ‘there is no statistical (or statistically significant) difference between the two interventions’ can be (mis)read as there being no difference between interventions.

It is important that inconclusive results are not interpreted as indicating that an intervention is ineffective and estimates with wide confidence intervals that span no effect should be described as showing no clear evidence of a benefit or harm rather than as there being no difference between interventions. Demonstrating lack of sufficient evidence to reach a clear conclusion is an important finding in its own right.

Similarly, care should be taken not to overplay results that are statistically significant, as with large enough numbers, even very modest differences between interventions can be statistically significant. The size of the estimated effect, and its confidence intervals, should be considered in view of how this relates to current or future practice (see Section 1.3.6 Report writing).

It is usually helpful to present findings in both relative and absolute terms and in particular to consider how relative effects may translate into different absolute effects for people with differing underlying prognoses (see Relative and absolute effects section above). Where a number of outcomes or subgroup analyses are included in a review it can be helpful to tabulate the main findings in terms of effect, confidence intervals and inconsistency or heterogeneity statistics.

Summary: Data synthesis

  • Synthesis involves bringing the results of individual studies together and summarising their findings.

  • This may be done quantitatively or, if formal pooling of results is inappropriate, through a narrative approach.

  • Synthesis should also explore whether observed intervention effects are consistent across studies, and investigate possible reasons for any inconsistencies.

Initial descriptive synthesis

All syntheses should begin by constructing a clear descriptive summary of the included studies.

Narrative synthesis is frequently an essential part of a systematic review, and as with every other stage of the process, bias must be minimised.

Narrative synthesis has typically not followed a strict set of rules. However, a general framework can be applied in order to help maintain transparency and add credibility to the process. The four elements of this framework are:

  • Developing a theory of how the intervention works, why and for whom

  • Developing a preliminary synthesis of findings of included studies

  • Exploring relationships within and between studies

  • Assessing the robustness of the synthesis

Each element contains a range of tools and techniques that can be applied. A researcher is likely to move iteratively among the four elements, choosing those tools and techniques that are appropriate to the data being synthesised and providing justifications for these choices.

Quantitative synthesis

  • Meta-analysis increases power and precision in estimating intervention effects.

  • Results of individual studies are combined statistically to give a pooled estimate of the ‘average’ intervention effect.

  • Most meta-analysis methods are based on calculating a weighted average of the effect estimates from each study.

  • The methods used to combine results will depend on the type of outcome assessed.

  • Quantitative results should be expressed as point estimates together with associated confidence intervals and exact p-values.

  • Variation in results across studies should be investigated.

  • Sensitivity analyses give an indication of the robustness of results to the type of study included and the methods used.


1.3.6 Report writing

Report writing is an integral part of the systematic review process. This section deals with the primary scientific report of the review which often takes the form of a comprehensive report to the commissioning body. Many commissioners have their own guidance for production and submission of the report. Alternatively the primary report may take the form of a journal article, where space limitations may mean that important details of the review methods have to be omitted. These can be made available through the journal’s or the review team’s website. Whatever the format, it is important to take as much care over report preparation as over the review itself. The report should describe the review methods clearly and in sufficient detail that others could, if they wished, repeat them. There is evidence that the quality of reporting in reports of primary studies may affect the readers’ interpretation of the results, and the same is likely to be true of systematic reviews.201 It has also been argued that trials and reviews often provide incomplete or omit the crucial ‘how to’ details about interventions, limiting a clinicians’ ability to implement findings in practice.202, 203, 204

The QUOROM statement9 has set standards for how reviews incorporating meta-analysis should be reported, and many journals require articles submitted to adhere to these standards. The QUOROM checklist and flow chart are useful resources for all authors of systematic review reports. However, recognising that the quality of reporting of many systematic reviews is disappointing,205 the QUOROM group have broadened their remit, been renamed PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses),206 and developed a flow chart and checklist for the reporting of systematic reviews with or without a meta-analysis.66, 67 General considerations

Resources for writers

There are many resources for writers available in both printed and electronic form. These include guides to technical writing and publishing,207, 208, 209 style manuals210, 211 and guides to use of English.212 The EQUATOR Network is an initiative that seeks to improve the quality of scientific publications by promoting transparent and accurate reporting of health research.101 It provides an introduction to reporting guidelines, and information for authors of research reports, editors and peer reviewers as well as those developing reporting guidelines.

Style and structure

Commissioning bodies and journals usually have specific requirements regarding presentation and layout that should be followed when preparing a report or article. Some organisations offer detailed guidance while others are less specific. In the absence of guidance, a layered approach such as a one page summary of the research ‘actionable messages’, three-page executive summary and a 25-page report is advocated as the optimal way to present research evidence to health service managers and policy-makers.213 Box 1.11 presents a suggested outline structure for a typical report of a systematic review.

Many journals publish papers electronically ahead of print publication and electronic publishing often allows additional material, such as large tables, or search strategies to be made available through the journal’s website. There is no specific word limit for reports published in electronic format only, for example in the Cochrane Library, although Cochrane reviews ‘should be as succinct as possible’.75

Box 1.11: Suggested structure of a systematic review report



Contents list



Executive summary or structured abstract



Methods (data sources, study selection, data extraction, quality assessment, data synthesis)




Main text


Review question(s)

Review methods

Identification of studies

Study selection (inclusion and exclusion criteria; methods)

Data extraction

Quality assessment

Data synthesis

Results of the review

Details of included and excluded studies

Findings of the review

Secondary analyses (sensitivity analyses etc.)

Discussion (interpretation of the results)


Recommendations/implications for practice/policy

Recommendations/implications for further research


Acknowledgements or list of contributors and contributions


Conflicts of interest



Researchers should familiarise themselves with the conventions favoured by their commissioning body or ‘target’ journal. Many journals now prefer a clear and active style that is understandable to a general audience. Weaknesses in the use of grammar and spelling constitute obstacles to clear communication and should be eliminated as far as possible. The field of scientific and technical communication predominantly uses English as its common language, so those who are unsure of their ability in written English may find it helpful to have their report checked by an accomplished speaker/writer who is familiar with the subject matter before submission.

Contents lists, headings and indexes are essential for guiding the reader through longer documents. The numbering of sections can also be helpful. It is particularly important to adopt a consistent style (e.g. font, point size, font style ) for different levels of main headings and sub-headings.


Time spent preparing a brief outline covering the main points to be included in the report can save time overall. The outline should focus on who the intended audience is and what they need to know. The review team will need to agree the outline and, if the report is to be written by multiple authors, allocate writers for each section. Dividing the work amongst a number of people reduces the burden on each individual but there is a risk of loss of consistency in style and terminology. In addition, completion of the report relies on all the team members working to the agreed schedule. It is essential for the lead author (corresponding author for journal articles) to monitor progress and take responsibility for accuracy and consistency.

Authorship and contributorship

The report of a systematic review will usually have a number of authors. According to the International Committee of Medical Journal Editors (ICMJE),214 authorship credit should be based on:

  1. Substantial contributions to conception and design, or acquisition of data, or analysis and interpretation of data

  2. Drafting the article or revising it critically for important intellectual content; and

  3. Final approval of the version to be published

All authors should meet all of these conditions. The review team should agree amongst themselves who will be authors and the order of authorship. Order of authorship is often taken to reflect an individual’s contribution to the report and methods are available for scoring contributions to determine authorship.215 Alternatively authors can simply be listed alphabetically. Contributions that do not meet the criteria for authorship (for example, data extraction or membership of an advisory group) should be included in the acknowledgements.

Some journals, for example the BMJ, favour a system of contributorship.216 In addition to the standard list of authors, there is a list of all those who contributed to the paper with details of their contributions. One contributor (occasionally more than one) is listed as guarantor and accepts overall responsibility for the work. This system gives some credit to those who do not meet the ICMJE criteria for authorship and provides accountability for each stage of the review.

Peer review and feedback

Most systematic reviews have an expert advisory group assembled at the beginning of the project and members of this group should be asked to review the draft report and comment on its scientific quality and completeness. The commissioning body may also organise its own independent peer review of the draft report before publication.

Medical journals almost invariably seek external peer review of manuscripts submitted for publication. Draft manuscripts may also be posted on institutional websites or electronic preprint servers, allowing an opportunity for feedback from a wide range of interested parties, although for reports intended for journals it is important to ensure that such posting will not be considered as prior publication.

In addition to scientific peer review, end users may also be asked to assess the relevance and potential usefulness of the review. They may recommend changes that would help in identifying the main messages for dissemination and important target audiences as well as possible formats and approaches.

When feedback from external reviewers has been received, a final report can be prepared. A record of the comments and the way in which they were dealt with should be kept with the archive of the review.

Conflict of interests

The ICMJE state that a conflict of interests exists if ‘an author (or the author’s institution), reviewer, or editor has financial or personal relationships that inappropriately influence (bias) his or her actions’.214 Relationships that might constitute a conflict of interests are common and there is nothing wrong with having such relationships. However, it is important that they are declared so that readers are aware of the possibility that authors’ judgements may have been influenced by other factors. Review authors need to be explicit about any potential conflict of interests because such transparency is important in maintaining the readers’ confidence. Executive summary or abstract

The executive summary (for full-length reports) or abstract (for journal articles) is the most important part of the report because potentially it is the only section that many readers will actually read (perhaps in conjunction with the discussion and conclusions). It should present the findings of the review clearly and concisely and allow readers to quickly judge the quality of the review and the generalisability of its findings. Providing a good balance between detail of the intervention and how the review was conducted, and the results and conclusions is always a challenge, and may require several iterations across the whole review team. The summary is usually the last section to be written so that full consideration can be given to all relevant aspects of the project. However the process of summary writing may help in the further development of the recommendations by forcing review teams to identify the one or two most important findings and the conclusions which flow from them. It should be remembered that revisions to the report or article following peer review may also need to be reflected in the summary. Assistance from outside parties and medical writers may be helpful in developing a good summary. Formulating the discussion

The purpose of the discussion section of a report is to help readers to interpret the results of the review. This should be done by presenting an analysis of the findings and outlining the strengths and weaknesses of the review. The discussion should also place the findings in the context of the existing evidence base, particularly in relation to any existing relevant reviews. It has been suggested that more could and should be done in discussion sections to contextualise both the nature of the research and the findings to the existing evidence base.217 There should be a balance between objectively describing the results, and subjectively speculating on their meaning.218 It is important to present a clear and logical train of thought and reasoning, supported by the findings of the review and other existing knowledge. For example although statistically significant results and clear evidence of effectiveness may have been demonstrated, without an exploration of the impact on clinical practice, it may not be clear whether they are clinically significant. Information on the interpretation of the analysis is given throughout Section 1.3.5 Data synthesis.

Some commissioners and most journals have a set format or structure for the report. This may require the discussion section to incorporate the conclusions and any implications or recommendations, or may require these as separate sections. In the absence of a structured format for the discussion section, the framework given in Box 1.12 may be helpful.

Box 1.12: Framework for the discussion section of a review

Statement of principal findings


Strengths and weaknesses of the review

          Appraisal of methodological quality of the review

          Relation to other reviews, in particular considering any differences

Meaning of the review’s findings

          Strengths and weaknesses of the evidence included in the review

          Direction and magnitude of effects observed in the included studies

          Applicability of the findings of the review


          Practical implications for clinicians and policy-makers

          Unanswered questions and implications for further research


Based on Docherty and Smith (1999)219 Conclusions, implications, recommendations

Faced with the need to make decisions and limited time to read the whole report, many readers may go directly to the conclusions. Therefore, whether incorporated in the discussion section or presented separately, it is essential that the conclusions be clearly worded and based solely on the evidence reviewed. The conclusions should summarise the evidence and draw out the implications for health care, and preferably be worded to show how they have been derived from the evidence.

Conclusions are generally a standard requirement, however, many commissioners and journals have their own conventions about implications and recommendations. For example, the NIHR HTA programme require the conclusions section of reports to include the implications for health care and specify recommendations for future research, in order of priority. They specifically exclude making recommendations for policy or clinical practice.220 Authors’ conclusions from Cochrane reviews are presented as the implications for practice and research; recommendations are not made.130

In the absence of guidance from the commissioner, it is generally advisable to avoid making recommendations about policy or practice, unless this is the focus of the review. The nature of the review question should therefore guide whether it is appropriate to include recommendations or focus on the implications for policy, practice and/or further research, and how these are best presented. Whether recommendations are made or implications drawn, it is important to ensure that these are supported by the evidence and to avoid making any statements that are outside the defined scope of the review. The way in which a recommendation or implication is phrased can considerably influence the way in which it is interpreted and implemented (or ignored). Hence, it is important to make all statements as precise as possible.221, 222, 223

Recommendations for practice are usually only made in guidelines, and are formulated from a variety of sources of information in addition to review findings. There are a number of schemes available for grading practice recommendations according to the strength of the evidence that supports them.224, 225, 226, 227, 228, 229, 230 Systematic review reports should aim to provide the information required to implement any of these systems if used. It should be noted that not all the schemes take into account the generalisability of the findings of the review to routine clinical practice. This should always be a consideration when drawing up the implications or if making recommendations.

A clear statement of the implications or recommendations for future research should be made; vague statements along the lines of ‘more research is needed’ are not helpful and should be avoided. Specific gaps in the evidence should be highlighted to identify the research questions that need answering. Where methodological issues have been identified in existing studies, suggestions for future approaches may be made. Where possible, research recommendations should be listed in order of priority, and an indication of how rapidly the knowledge base in the area is developing should be included. This can assist in planning an update of the review and help guide commissioners when allocating funding.

The DUETs initiative (Database of Uncertainties about the Effects of Treatments; www.duets.nhs.uk), recommends the presentation of research recommendations in a structured format represented by the acronym EPICOT (Evidence, Population(s), Intervention(s), Comparison(s), Outcome(s), Time stamp). Timeliness (duration of intervention/follow-up), disease burden and suggested study design are considered as optional additional elements of a structured research recommendation. Further details and an example of how to formulate research recommendations using the EPICOT format can be found in an article published by the DUETS Working Group.231 It is worth noting that there is some debate about the applicability of the EPICOT format for some reviews, particularly those of complex interventions.232

Summary: Report writing

  • Report writing is an integral part of the systematic review process.

  • Reviews may be published as a report for the commissioner, as a journal article or both. Researchers should be aware of the requirements of commissioning bodies and journals and adhere to them.

  • Readability is a key aspect of reporting; a review’s findings will not be acted on if they are not clearly presented and understood.

  • The executive summary (for full-length reports) or abstract (for journal articles) is the most important part of the report, because it is potentially the only section that many readers will actually read (perhaps in conjunction with the discussion and conclusions).

  • A structured framework can be helpful for preparing the discussion section of the report.

  • Implications for practice or policy and recommendations for further research should be based solely on the evidence contained in the review.

  • The findings from systematic reviews are frequently used to inform guideline development. Guideline recommendations are often formulated using a grading scheme. Systematic review reports should therefore aim to provide the information required for such grading schemes.

  • A structured format for the presentation of research recommendations has been developed as a result of the DUETS initiative.

1.3.7 Archiving the review

There are published guidelines relating to the retention of primary research data.233 While these do not currently relate to systematic reviews, they do represent appropriate good practice. Where policies on retention, storage and protection are not specified by a commissioner, researchers might consider including this information in research proposals so that it is clear from the outset what will be kept and for how long.

Decisions need to be made about which documents are vital to keep and which can be safely disposed of. Extracted data and quality assessment information should be preserved. In addition, records of decisions made during protocol development, inclusion screening and data extraction, are unique and should be kept. Minutes of meetings, correspondence as well as peer review comments and responses might also be held for a specific period of time as further records of the decision-making process. It is always advisable to permanently store a copy of the final report, particularly if the only other copy in existence is the one submitted to the commissioners.

Some information used in the review such as conference abstracts, additional information from authors, and unpublished material may be particularly difficult to obtain at a later stage so hard copies should be archived. This also applies to material retrieved from the Internet, which should be printed for the archive, as links to web pages are not permanent.

Whilst it may be easy and space saving to archive material electronically, paper records are often preferable as the equipment used to access documents stored in electronic formats can become obsolete after a relatively short period of time.

1.3.8 Disseminating the findings of systematic reviews

In recent years, there has been substantial investment in the commissioning of systematic reviews assessing the effects of a range of different health care interventions. To improve the quality of health care, and ultimately health outcomes, the review findings need to be effectively communicated to practitioners and policy-makers. The transfer of knowledge obtained through research into practice has long been acknowledged as a complex process234, 235, 236, 237, 238 that is highly dependent on context and the interaction of a multitude of interconnected factors operating at the level of the individual, group, organisation and wider health system.

A number of conceptual frameworks have attempted to represent the complexity of knowledge translation processes.234, 236, 238, 239, 240, 241, 242, 243, 244 One recent framework,244 whilst recognising the importance of non-linear diffusion, highlights a pivotal role for the direct or planned dissemination of contextualised, actionable messages derived from systematic reviews to inform practice and policy decision-making processes.

CRD’s experience of direct dissemination has led to the development of a framework, which is supported by both theoretical and empirical research into the ways by which different audiences become aware of, receive, access, read and use research findings (Figure 1.4). This involves targeting the right people with a clear and relevant message, communicating via appropriate and often multiple channels (any medium used to convey a message to an audience or audiences), whilst taking account of the environment in which the message will be received.

Detailed information about this framework is provided here; case studies showing the framework in use can be found on the CRD website.(www.york.ac.uk/inst/crd) The framework provides a basic structure that enables researchers to consider carefully the appropriateness of their plans for dissemination, simple or complex, and could be used by anyone seeking to promote the findings of a review. What is dissemination?

As interest in enhancing the impact of health research has increased, so too has the terminology used to describe the approaches employed.241, 245 Terms like dissemination, diffusion, implementation, knowledge transfer, knowledge mobilisation, linkage and exchange and research into practice are all being used to describe overlapping and interrelated concepts and practices. Given this, it is helpful to explain how the term dissemination is used here.

Dissemination is a planned and active process that seeks to ensure that those who need to know about a piece of research get to know about it and can make sense of the findings. As such it involves more than making research accessible through the traditional mediums of academic journals and conference presentations. It requires forethought about the groups who need to know the answer to the question a review is addressing, the best way of getting the message directly to that audience, and doing so by design rather than chance. Hence an active rather than passive process.

The term dissemination is often used interchangeably with implementation but it is more appropriate to see the terms as complementary. Dissemination and implementation are part of a continuum.239, 246, 247 At one end are activities that focus on making research accessible, raising awareness of new findings and encouraging consideration of practice alternatives and policy options. At the other end of the continuum are activities that seek to increase the adoption of research findings into practice and policy and that facilitate, reinforce and maintain changes in practice.

CRD’s primary focus is very much at the awareness raising end of the continuum, though there is no clear cut off point, and there is evidence for the positive effects of planned dissemination on the implementation of research evidence in practice.237 For example, there is some evidence that the centre’s own Effective Health Care and Effectiveness Matters series of bulletins had a positive impact on the quality of health care delivered. Empirical studies have suggested that the dissemination of these bulletins contributed to reductions in the prophylactic extraction of wisdom teeth,248, 249 in the use of surgical interventions for glue ear,250, 251 and impacted on the prescribing of selective serotonin reuptake inhibitors for depression.252, 253

Dissemination should not be viewed as an adjunct to the review process or as something to be considered at the end when thoughts turn to publication. Nor should it be seen as separate from the wider social context in which the review findings are expected to be used. It is an integral part of the review process and should be considered from an early stage to allow adequate time for planning and development, for the allocation of responsibilities and to ensure that the proposed activities are properly resourced. The CRD framework (Figure 1.4) offers a sequential approach to considering, developing and implementing appropriate dissemination strategies for individual systematic reviews. The framework has been utilised for a wide range of topics and audiences for over a decade and the example below highlights the key elements of the framework in practice.



Figure 1.4: CRD Dissemination framework CRD approach to dissemination

Traditionally, research on dissemination and implementation has tended to focus on the use of research knowledge, rather than on the effects of dissemination activities. However, a number of conceptual frameworks have been put forward which consistently suggest that the effectiveness of dissemination activities is determined by careful consideration of a number of key attributes.234, 237, 254, 255, 256, 257, 258 These are:

Assuming that all research has an audience (but not that all research should be widely disseminated), whether the message provides an unequivocal answer or simply highlights the need for further research, our approach is structured around six key attributes which are interlinked and difficult to consider in isolation (see Figure 1.4). The key messages from the review are the starting point for determining the audience to be targeted.

Characteristics of the research message and the setting in which it will be received

The literature on communication259 and diffusion239 (i.e. how, why, and at what rate ideas/innovations spread through social systems) highlights three types of messages that can impact on the knowledge and attitudes of target audiences: awareness, instruction (‘how to’) and persuasion (information that reduces uncertainty about expected consequences). Message characteristics to consider include the nature of the intervention, the strength of the evidence, its transferability, the degree of uncertainty and whether the findings confirm or reject existing predispositions or practices. Messages also have to be perceived as relevant and meaningful by the audiences being targeted. Knowledge about both the wider setting (economic, social, organisational and political environments) within which a target audience resides and the context (hostile or receptive) in which a message is to be received, should be used to inform the development of appropriate dissemination strategies.

Characteristics of the target audience(s)

Deciding who to target usually involves an element of prioritisation (segmentation) as resource constraints can make it difficult to reach all possible audiences. In prioritising, relevance (who needs to know about this research) and receptivity (who is most likely to be influenced and to influence others) need to be considered. The question of how best to reach target audiences can in part be answered by drawing upon the theoretical literature on research utilisation (the ways by which different audiences become aware of, access, read and make use of research findings).260, 261 This literature helps to inform the selection of the most appropriate or feasible communication channels for the audiences being targeted. Channels frequently used to promote review findings include paper and electronic publishing, email alerting services, direct and relationship marketing, mass media campaigns as well as engaging directly with target audiences.

Presentation of the research message and communication channel(s) used

The literature on diffusion239 makes a distinction between mass media channels and interpersonal (face to face) channels. The former are generally regarded as being more important for dissemination purposes whereas interpersonal channels are more important for activity at the implementation end of the continuum. CRDs experience is that a combination of communication channels is helpful in increasing the likelihood that target audiences will encounter the review messages being promoted.

The selection of communication channels may also inform the presentation (tailoring) of the research message itself. When tailoring messages, consideration is given to the target audience, language used, the format, structure and style of presentation, the types of appeal and the amount of repetition. It is generally appropriate to try to write for an educated but non-research specialist health professional or decision-maker. Lay terms are used rather than technical language and statistics presented in as simple a form as possible. The aim is to make information accessible to a broad range of readers and anyone who would like more details can access the full report. It has been advocated that a layered structure such as the ‘1:3:25’ format (i.e. one page of the research ’bottom lines’ or ‘actionable messages’, three-page executive summary and a 25-page report) is the optimal way to present research evidence to health service managers and policy-makers.213 This type of structuring involving a front page of key messages has become common place and reflects documented audience preferences for the ‘bottom line’ up front. There is some evidence that this order of presentation can increase overall understanding of the research findings but may also in some instances alienate those who are less receptive to or in disagreement with the conclusions presented.262, 263

Source of the research message

How the source (i.e. the research team or organisation) is perceived by a target audience in terms of its credibility (trustworthiness), attractiveness (likemindedness) or power, is an important consideration. For example, where the evidence base is contested (clinically or politically), and/or where audiences are less familiar with systematic review methods generally, promoting source credibility can be crucial from the outset. An approach CRD has used when encountering these issues, has been to create dedicated, publicly accessible websites that provide information about all aspects of the review. These websites enable external scrutiny of the review process, and include feedback facilities for interested parties to comment, ask questions or submit evidence for consideration. Our experience suggests it is important to make it clear that contributions of existing research evidence, including published/grey literature, are welcome, but that personal experience and anecdote, whilst important, does not usually form part of a systematic review. An example of a review dedicated website can be found at www.york.ac.uk/inst/crd/fluorid.htm. Considerable effort is required to set up, monitor and maintain a dedicated website and our experience of the benefit is varied. It is important therefore to consider the likely benefit to the review and the target audience before setting up a site.

Dissemination strategies

It has been proposed that there are four dissemination models that can be employed to link ‘research to action’.262, 263 These are:

In reality, push, pull and exchange strategies are not mutually exclusive; facilitating user pull often requires the application of a promotional push strategy (e.g. utilising email alerting services or RSS feeds) to inform and remind target audiences about review findings that are forthcoming or have been made available online for example. CRD favours the integrated approach that incorporates elements of all three strategies, but where the emphasis shifts according to the topic and the audiences to be targeted.

Evaluation of impact

There is an increasing requirement, particularly from funders, for the impact of research to be predicted in advance of the work and then assessed after completion.265, 266 There are a number of specialised research impact assessment approaches, but these usually require specialist skills and additional resources.267, 268 Taking the issue of whether academic quality or practical use and impact of research is most important, a pragmatic framework has been proposed which addresses both points.269 The framework is based on the assessment criteria used in UK universities. It provides a structure for a narrative description of the impact of the findings from why the research question was first posed and funded, to where the results were sent, discussed, and put into policy and/or practice.

Summary: Dissemination

  • Simply making research available does not ensure that those that need to know about it get to know about it or can make sense of the findings.

  • Dissemination is a planned and active process that can aid the transfer of research into practice.

  • Dissemination should not be viewed as an adjunct but rather as an integral part of the review process and should be considered from the outset.

  • CRD employs a topic-driven approach that involves targeting the right people with understandable and relevant messages, communicating via appropriate (often multiple) channels, whilst taking account of the environment in which the message will be received.

  • The CRD framework provides a basic structure for developing appropriate dissemination strategies and could be used by anyone seeking to promote the findings of a review.