Shortened Positive and Negative Symptom Scale as an Alternate Clinical Endpoint for Acute Schizophrenia Trials: Analysis from the US Food & Drug Administration
Abstract
Objective
To evaluate the performance of the individual Positive and Negative Symptom Scale (PANSS) items, and to assess the feasibility of using a shortened version of the PANSS as an alternative regulatory endpoint for evaluating the efficacy of drugs to treat schizophrenia.
Design
Data from 32 randomized, placebo‐controlled, multiregional trials from eight atypical antipsychotic programs (N=14,219) submitted to the US Food and Drug Administration were used in the analyses. Item response theory analysis on baseline PANSS item scores was used to identify the best performing items of the PANSS to derive the shortened, or modified, PANSS (mPANSS). Concordance rates of mPANSS total with the PANSS total trial results at week 6 were examined, and implications of using mPANSS on trial sample size evaluated.
Results
Five of the positive items, six of the negative items, and eight of the general items were assessed as sensitive to describe the underlying symptom severity and comprise mPANSS. The overall concordance rate between mPANSS and total PANSS results at week 6 was 97.6%. Using mPANSS resulted in a 32% reduction in samples size relative to using total PANSS.
Conclusions
Based on this research, mPANSS may be considered a potential alternative clinical endpoint for acute schizophrenia trials. However, it will need psychometric validation before it can be fully implemented in clinical trials in place of total PANSS. If such implementation occurs, the development of new drugs for schizophrenia, a public health imperative, may be considerably improved.
HIGHLIGHTS
The current research is a comprehensive analysis of one of the largest database of randomized, placebo‐controlled trials for drugs indicated for schizophrenia to evaluate the feasibility of a shortened version of PANSS (mPANSS) as a regulatory clinical endpoint.
The mPANSS consisting of 19 out of 30 PANSS items was identified to be sensitive to assess the schizophrenia symptom severity by item response analysis.
Based on this research, mPANSS may be considered a potential alternative clinical endpoint for acute schizophrenia trials.
Schizophrenia is a chronic, disabling mental disorder affecting approximately 2.4 million adults in the United States, with a prevalence rate of approximately 1.5% (1). Schizophrenia is one of the top 25 leading causes of disability worldwide, with an economic burden of more than $155 billion per year in the United States alone. Despite the growing economic and societal burden, the global pharmaceutical industry has significantly decreased its investment in new treatments for schizophrenia in recent years (2, 3). An internal review of new drug applications (NDAs) submitted to the US Food and Drug Administration (FDA) between 2001 and 2015 revealed that the failure rate in acute schizophrenia registration trials was 38%, with a decreasing trend in treatment effect, increased placebo response, and higher percentage (∼50%) of dropouts (4). The lack of clear understanding of disease pathophysiology has contributed to many failed trials, and there is therefore a need to increase the efficiency of schizophrenia drug development. One way to achieve this is through reevaluating clinical and regulatory endpoints used to assess the efficacy of schizophrenia drugs.
Registration trials for drugs indicated for acute schizophrenia are randomized, double‐blind, placebo‐controlled studies, typically 6–8 weeks in duration, using the mean change from baseline (CFB) in the Positive and Negative Syndrome Scale (PANSS) as the primary endpoint. The PANSS was developed by combining items from the Brief Psychiatric Rating Scale and the Psychopathology Rating Schedule and has been the most widely used endpoint in acute schizophrenia trials. The 30‐item PANSS instrument measures the severity of positive symptoms (7 items), negative symptoms (7 items), and general psychopathology symptoms (16 items) and is the most widely used measure of symptom severity in schizophrenia drug trials. PANSS has high internal validity and reliability, and excellent sensitivity to change in both short‐ and long‐term trials (5, 6, 7), PANSS is administered by trained clinicians who evaluate a patient's current severity level of each symptom (item) on a scale of 1 through 7, with increasing numbers corresponding to increasing severity. It is designed to take 30–60 minutes to administer, depending on the patient's level of cooperation and symptom severity (5, 7). Although the 30‐item PANSS is considered a comprehensive instrument for the assessment of schizophrenia symptom severity, several researchers (8, 9, 10) using item response theory (IRT) methodology have questioned the following: (1) how individual items in PANSS differ in their usefulness to assess symptom severity; and (2) whether a shorter version of PANSS can be derived by selecting the best performing items. They suggested modifications and improvements to the current version of the PANSS.
The objectives of this work are to (1) evaluate the performance of individual PANSS items in assessing symptom severity of schizophrenia using data from several NDAs in patients with schizophrenia and (2) to assess the feasibility of using a shortened, or modified, version of the PANSS (mPANSS) as an alternative regulatory endpoint for evaluating drug efficacy.
Methods
Data Collection
Data from 32 randomized, placebo‐controlled, multiregional registration trials of the oral formulations of eight atypical antipsychotics are included in these analyses. The created database includes data from a total of 14,219 patients in acute schizophrenia trials. The applications including these data were submitted to FDA between 2001 and 2015. Most trials (28/32: 87.5%) were of 6 weeks duration, and the remaining four (12.5%) were of 4 weeks duration. A summary of database characteristics is provided in Table 1. Individual PANSS item scores available at baseline were used for derivation of mPANSS items.
Characteristic | Number | % |
---|---|---|
Number of trials | 32 | |
6‐week trials | 28 | 87.5 |
4‐week trials | 4 | 12.5 |
Number of treatment arms (including active control) versus placebo | 86 | |
Total number of subjects | 14,219 | |
Subjects randomized to placebo | 3533 | 24.8 |
Subjects randomized to drug treatment (including active control) | 10,686 | 75.2 |
Demographics | Mean | (SD) or (%) |
Age, years | 39.0 | 11.0 |
Females | 4414 | 31.0% |
Caucasian | 7183 | 51.5% |
African Americans | 4346 | 30.5% |
Asians + others | 2690 | 18.0% |
Baseline total PANSS | 94.4 | 13.6 |
Derivation of mPANSS
IRT analysis on baseline PANSS item scores was used to identify the best performing items to derive mPANSS. IRT methodology relates the item responses and the symptom severity directly, quantifying how the performance of individual items and options (severity levels of 1 to 7; 1=absent, 2=minimal, 3=mild, 4=moderate, 5=moderate severe, 6=severe, and 7=extreme) change as a function of overall underlying (latent) standardized symptom severity. As schizophrenia is a multidimensional disorder consisting of various symptom clusters, IRT methodology was used to test each unidimensional subscale of PANSS, that is, the positive, negative, and general psychopathology subscales. A graded response IRT model (11) was used to link probability of an item response as a function of item parameter, such as discriminative ability, difficulty of the item, and underlying symptom severity and is described below.
The general form of a graded response IRT model specifies the cumulative probability of selecting each of the ordinal responses or higher and is shown below (Equation 1):
=item discrimination parameter for item . As the name suggests, the discrimination parameter is a measure of differential capability of an item. A high discrimination parameter value would suggest that the item has a high ability to discriminate the subscale symptom severity.
=item difficulty parameter is a measure of whether the item options cover the entire range of the underlying subscale symptom severity. It is also referred to as threshold (or intercept) parameter for item , and the number of threshold parameters for a given item is .
=subject‐specific underlying symptom severity.
The probability of selecting the response option of the item is obtained by taking the difference of the cumulative probability of the adjacent response options.
The probability of a response to an item is visualized using an item characteristic curve (ICC) obtained based on the graded response model. ICCs are generated for each item, in which the probability of choosing a particular response is plotted against a range of symptom severity. For example, for an item in the positive subscale such as delusion, an ICC will be generated with the probability of choosing the different options for delusion and is then plotted against the underlying positive subscale symptom severity. An ideal ICC has the characteristic shown in Figure S1 of the online supplement.
To investigate the usefulness of each item, an ICC based on the graded response model parameters was used. The ICCs (generated for each of the 30 items) are graphical representations of the probability of rating the different options for a given item across the range of symptom severity. Five criteria (8, 10) based on ICC were used to categorize whether a PANSS item was “Very Good,” “Good,” or “Poor.” The criteria are provided in Table S1 of the online supplement. mPANSS items were derived by considering only the items categorized as “Very Good” and “Good.”
Concordance Analysis
The longitudinal mPANSS total score for each subject was then calculated as the sum of the mPANSS items. The mean CFB in mPANSS total score by week and treatment was estimated using mixed‐model repeated measure (MMRM) analysis for each trial in the database. The MMRM model included baseline mPANSS total score, treatment, time, and time‐by‐treatment interaction. Based on the model‐estimated mean CFB in mPANSS total, the treatment outcome at week 6 was compared with the original pre‐specified analysis (6‐week duration) using PANSS total. Concordance and discordance rates of mPANSS total with PANSS total trial results at week 6 were examined. Concordance rate was defined as shown below (Equation 2):
Discordance rate was calculated as:
Sample Size Estimation
Based on the magnitude of treatment effect and the calculated standard deviation using mPANSS total, the implication of using mPANSS total score on the sample size requirements was assessed and compared with a typical 6‐week trial using the PANSS total score. The IRT analysis was carried out using the Proc IRT procedure in SAS V 9.4. The MMRM analysis was carried out using the Proc Mixed procedure in SAS V 9.4.
Results
Derivation of mPANSS
Summary of baseline PANSS item scores
The distribution of the item responses at baseline for the positive, negative, and general psychopathology subscales across the eight NDA's is shown in Figure 1. Of the 14,219 subjects in the efficacy database, the item responses were not available for 102 subjects at baseline. Mild to moderately severe ratings were most common for positive items P1, P2, P3, and P6. For P4 and P5, the most common ratings were absent to mild. Severe ratings were less commonly found. Similarly, mild to moderate ratings were most common for negative items N1 to N7. The general psychopathology items also had representation of mild to moderate ratings and severe and extreme ratings were minimal or not found.
Item response analysis to derive modified PANSS
Item parameter estimates obtained from the graded response IRT model of the positive, negative, and general psychopathology subscales are shown in Table S2 of the online supplement. The item parameters are the difficulty, or the threshold, parameter, and the discrimination parameter. The threshold parameter acknowledges the disparity between the item rating/responses (e.g., moderate) and the underlying symptom severity. The k−1 threshold parameters (k=7; number of item options) signify the probability of endorsing a higher severity rating than the previous one. For example, threshold 3 signifies the probability of endorsing moderate option as compared to the mild option. The discrimination parameter is a measure of an item's differentiating capability. In practice, a high discrimination value means that the probability of a response increases rapidly with a small increase in the underlying severity. Except for item P2, discrimination parameters for the positive subscale were greater than 0.5, one of the criteria for assessing the sensitivity of the item. All items in the negative subscale had a discrimination parameter greater than 0.5, as did all general psychopathology subscale items, except for G1, G2, G3, G6, and G7.
The ideal characteristics of an ICC are shown in Figure S1 of the online supplement. The ICCs, based on which of criteria (1–5) were assessed, of representative items (P1: Delusion, P5: Grandiosity, N2: Emotional withdrawal, N7: Stereotyped Thinking, G3: Guilt feelings, and G11: Poor attention) are shown in Figure 2, and Figure [Link], [Link], [Link] in the online supplement presents the ICCs for all items. In general, if an item is sensitive, the corresponding ICC depicts clear delineation for each of the item's rating options (1–7) and a relatively steep rise and fall for the probability of occurrence of each option. The x‐axis on the ICC indicates the underlying overall symptom severity (positive, negative, or general psychopathology) on a standardized normal scale. The values can range from –∞ to +∞; however, valid inferences are drawn between –4 and +4. Negative values indicate less severity in symptoms, whereas positive values indicate more severity.
Each of the 30 PANSS items were assessed using the five criteria specified in the methods section and termed as “Very Good,” “Good,” or “Poor.” The evaluation and grading of the items are shown in Table 2. Using this system, five of the positive items, six of the negative items, and eight of the general items were assessed as either “Very Good” or “Good,” indicating that these items are sensitive, and the responses obtained for the items can be closely related to the underlying symptom severity. For example, item 1 from the positive subscale, delusion, assesses suspiciousness, defined as “beliefs that are unfounded, unrealistic, and idiosyncratic.” The probability of rating 1 (Option 1) for the absence of symptoms decreases rapidly as the positive psychotic symptoms begin to increase. The probability of choosing more severe ratings of delusion increases rapidly with increases in the severity of the positive symptom. Rating 4, indicating moderate symptoms, is more frequently chosen when the standardized symptom severity is around −1 (corresponds to an expected positive total score of 20 and expected total score of 80), and the probability of choosing rating 4 decreases rapidly as the positive symptom severity increases. Other ratings for the item are similarly well defined for the corresponding severity and, hence, can be deemed a best performance item in the PANSS questionnaire.
Item | Criterion 1 | Criterion 2 | Criterion 3 | Criterion 4 | Criterion 5 | Evaluation |
---|---|---|---|---|---|---|
Positive subscale items | ||||||
P1: Delusions | Yes | Yes | Yes | Yes | Yes | Very good |
P2:Conceptual disorganization | Some | No | Some | Yes | No | Poor |
P3: Hallucination | Yes | Some | Yes | Yes | Yes | Good |
P4: Excitement | Some | Some | Yes | Yes | Yes | Good |
P5: Grandiosity | Some | No | Some | Yes | Yes | Poor |
P6: Suspiciousness | Yes | Yes | Yes | Yes | Yes | Very good |
P7: Hostility | Yes | Some | Yes | Yes | Yes | Good |
Negative subscale items | ||||||
N1: Blunted effect | Yes | Yes | Yes | Yes | Yes | Very good |
N2: Emotional withdrawal | Yes | Yes | Yes | Yes | Yes | Very good |
N3: Poor rapport | Yes | Yes | Yes | Yes | Yes | Very good |
N4: Passive social withdrawal | Yes | Yes | Yes | Yes | Yes | Very good |
N5: Difficulty in abstract thinking | Yes | Some | Yes | Yes | Yes | Good |
N6: Lack of spontaneity | Yes | Yes | Yes | Yes | Yes | Very good |
N7: Stereotyped thinking | Some | Some | Some | Yes | Yes | Poor |
General psychopathology subscale items | ||||||
G1: Somatic concern | Some | No | Some | No | No | Poor |
G2: Anxiety | Some | No | Some | No | No | Poor |
G3: Guilt feelings | No | No | No | No | No | Poor |
G4: Tension | Yes | Some | Yes | Yes | Yes | Good |
G5: Mannerisms & posturing | Some | Some | Some | Yes | Yes | Poor |
G6: Depression | Some | No | Some | No | No | Poor |
G7: Motor retardation | Some | Some | Some | Yes | No | Poor |
G8: Uncooperative | Yes | Yes | Yes | Yes | Yes | Very good |
G9: Unusual thought content | Yes | Some | Yes | Yes | Yes | Good |
G10: Disorientation | Some | Some | Some | Yes | Yes | Poor |
G11: Poor attention | Yes | Yes | Yes | Yes | Yes | Very good |
G12: Lack of judgment | Yes | Yes | Yes | Yes | Yes | Very good |
G13: Disturbance volition | Yes | Yes | Yes | Yes | Yes | Very good |
G14: Poor impulse control | Some | Some | Some | Yes | Yes | Poor |
G15: Pre‐occupation | Yes | Yes | Yes | Yes | Yes | Very good |
G16: Social avoidance | Yes | Some | Yes | Yes | Yes | Good |
Similar trends are seen with respect to P6 (suspiciousness), with all criteria satisfied, and P3, P4, and P7 satisfying most of the evaluation criteria. On the other hand, for P5 (grandiosity), the probability curves for each rating are not well defined for the symptom severity, as each of the option curves covers the entire symptom severity (x‐axis) indicating that the item is not sensitive for characterizing the underlying positive symptom. For all the negative items except N7, the probability curves for the various options define the underlying negative symptom severity well. On the other hand, eight of 16 general subscale items were evaluated as poor: these were G1 (somatic concern), G2 (anxiety), G3 (guilt feelings), G5 (mannerisms and posturing), G6 (depression), G7 (motor retardation), G10 (disorientation), and G14 (poor impulse control). Thus, based on the IRT analysis of the baseline PANSS item scores across the eight drugs, 19 of the 30 PANSS items were evaluated as “Very Good” or “Good,” and these items comprise the mPANSS.
Concordance Analysis
The mPANSS total score, which included the best performance items shown in Table 2, was calculated longitudinally (week 1–week 6) for all the subjects. Based on the 19 PANSS items, the maximum possible score was 133 points. The model‐estimated mean CFB in mPANSS total obtained by the MMRM analysis was used to calculate the placebo‐corrected mean change in mPANSS total (double delta) for each of the treatment arms in each study used to support approval of a given drug and was compared to total PANSS endpoint at week 6 (Table 3) from the original study. The overall treatment arm concordance rate for 6‐week mPANSS versus 6‐week total PANSS was 97.7%, with the discordance due to one false negative result and one false positive. False negatives occur when the treatment arm does not demonstrate a statistically significant difference from placebo at week 6 using mPANSS whereas, using the week 6 total PANSS, the treatment arm demonstrates superiority over placebo. False positives refer to a positive result with mPANSS at week 6 when the Week 6 total PANSS result was negative.
Concordance rate (6‐week mPANSS vs. 6‐week total PANSS) | |
---|---|
Overall | 97.6% (84/86) [false negative: 1, false positive:1] |
By trial design | |
Fixed | 97.1% |
Flexible | 100% |
By drug | |
1 | 100% |
2 | 100% |
3 | 100% |
4 | 100% |
5 | 100% |
6 | 100% |
7 | 90% [false positive:1] |
8 | 92.3% [false negative: 1] |
Sample Size Considerations
The model‐estimated placebo‐corrected mean CFB in mPANSS (double delta) at week 6 for the active treatment, considering all the 32 trials, was 4.9 units. The pooled standard deviation was 12.6 units at week 6. Hence, assuming a standard deviation of 13 units, the effect size using mPANSS total at week 6 was 0.38, compared with an effect size of 0.33 for PANSS total at week 6. Considering the effect size as described, the sample size estimates to detect a difference in CFB at week 6 with 90% power and a two‐sided alpha of 0.05 using overall mean double delta from 32 trials are 296 for mPANSS total and 380 for PANSS total, respectively. A reduction in variability of responses and resultant increase in effect size could allow for a 32% reduction in sample size in a trial using mPANSS compared with one using total PANSS.
Discussion
Improving the efficiency of a drug development program saves time and money and, more importantly, potentially provides faster access of effective drugs to patients. The main objective of the current research was to evaluate alternate clinical and regulatory endpoints for acute schizophrenia trials to improve trial design elements.
Recent research (8, 9, 10) and review (12) have suggested that, despite the established status of PANSS, shortening the scale while retaining the most sensitive items can result in a more reliable assessment of symptom severity, lessen administration and training time, and potentially reduce the sample size needed in clinical trials (due to decreased variability). Previous use of IRT on PANSS has identified slightly different items as having the best performance based on specific schizophrenia clinical trial data; however, the current study presents data derived from an FDA database comprised of 32 schizophrenia clinical trials from eight drug development programs (see Table S3 in the online supplement). As can be seen in Table S3 of the online supplement, there are some differences between the items deemed sensitive using the FDA database and the two literature reports. The IRT analysis using the FDA database identified 19 of the 30 items to be sensitive, that is, the items that can accurately assess the observed responses to the underlying symptom severity. Most of the negative items (6/7) and positive items (5/7) were assessed to be informative. In the general psychopathology subscale, only eight out of the 16 items were characterized as sensitive. It is worth noting that the data used by Santor et al (10). were comprised of observational studies, along with randomized trials of olanzapine. On the other hand, the Khan et al (8). database consisted of only randomized controlled trials of risperidone, paliperidone (including the depot formulation), whereas the FDA database included randomized controlled trials of only oral formulations. Therefore, the differences in derived items can be attributed to the differences in the patient population considered. Although only baseline PANSS items were used in the IRT analysis, the inclusion and exclusion criteria for each of the trials might have affected the results.
PANSS items not included in the mPANSS measure primary symptoms of diseases other than schizophrenia (e.g., P5: Grandiosity, G2: Anxiety, G6: Depression, and G14: Poor Impulse Control), or advanced forms of motor illness seen in some patients with schizophrenia not likely to be enrolled in clinical trials (e.g., G5: Mannerisms and Posturing and G7: Motor Retardation). Therefore, mPANSS is useful in measuring symptom change in patients with acute symptoms of schizophrenia who are eligible to enroll in drug treatment trials.
The use of the FDA database to identify the best performing items of PANSS allows the use of data from the largest number of subjects available from randomized clinical trials (N=14,219), which is one of the main strengths of the research. Another main aspect of the study is demonstrating the implications of deriving mPANSS for designing subsequent acute schizophrenia clinical trials. The MMRM analysis of the 19 item mPANSS total across the eight drugs showed that the trial outcomes were similar to those obtained using total PANSS. The 6‐week mPANSS outcomes were 97.7% in agreement with 6‐week total PANSS analysis across the 86 treatment arms. The discordant results were due to one false negative result and one false positive result. The high concordance between total PANSS and mPANSS in the trial outcomes indicates that the items of the mPANSS and the total PANSS are correlated well and does not lead to any loss of information with the shortened scale. Moreover, the use of mPANSS reduces sample size requirements by 32%. Inclusion of only the most informative of the 30 PANSS items leads to a decrease in the variability in the data, thus indicating that the items which do not adequately assess the underlying symptom severity might not add any additional information with respect to the disease status. However, the analysis is subject to certain limitations. Due to the nature of the strict inclusion and exclusion criteria in randomized controlled trials, the data may not accurately represent all patients encountered in clinical practice. Given the long‐time span of studies (∼15 years) included in the analysis, large number of sites and investigators, the interrater reliability may not be consistent throughout. Secondly, subjects with very severe symptoms of schizophrenia were minimal or absent for almost all subscales. This may imply the mPANSS derived may be useful in some settings and not others. In addition, individual PANSS items identified as informative were considered as a composite metric (mPANSS) to assess drug effects post‐baseline, and individual ranking of items to detect treatment effects post‐baseline was not considered and could potentially be an interesting research question.
Considering the above discussed aspects, mPANSS may be considered a potential alternative clinical endpoint for acute schizophrenia trials. mPANSS will need psychometric validation before it can be fully implemented in clinical trials in place of the full PANSS. In the meantime, investigators may consider conducting the full PANSS interview; however, the mPANSS (derived from the full PANSS) can be used in the statistical analysis to determine sample size requirements and trial outcomes.
1
2
3
4
5
6
7
8
9
10
11
12