Not enough information in the public use databases to know that a claim is fraudulent.
“In God we trust; all others bring data.”
That is what the sign above my desk says, right next to my six-foot slide rule.
I was pondering the truth behind this tongue-in-cheek saying when I first saw the whistleblower complaint filed against Providence Health Services. In reading the complaint, I will admit that I was quite impressed with the degree to which the relator made use of both comparative statistics and benchmarking – yet even with all of that, what is missing is the burden of proof needed to accuse someone of fraud, which is a heavy burden for sure.
As most folks know, I am an obsessive-compulsive mathematician and an evangelist in the area of analytics. I have spent the majority of my adult life advocating for the use of analytics in healthcare – and while it is a useful tool, it is not a cure-all, and like any other tool, one needs to know how to use it to make it effective.
Even in this case, where it appears that the crux of the complaint is based on statistics and analytics, it is important for all involved to understand the limitations of analytics on this type of a case. I have been involved as an expert in other cases in which indictments were issued, wherein the government’s entire argument was dependent upon outside analytics, and each case, the defendant was acquitted – because math isn’t always enough.
As many will recall, when the Centers for Medicare & Medicaid Services (CMS) started publishing physician utilization public use files in 2014, media outlets and whistleblower wannabees used the data to call out physicians that they said were abusing Medicare. And in most of those cases, they were wrong.
But that didn’t stop them.
Looking at the data, we saw high-volume physicians with a 100 percent Medicare case mix alongside low-volume physicians with a very small Medicare mix. But with the assumption of guilt, which is what happens way too often, the benchmarking more often than not brought out accusations of overbilling for the high-use providers, while no one seemed to question why others were barely a blip on the radar. From the whistleblower’s perspective, just like that of a government auditor, under-billing doesn’t put food on the table, and most understand that these types of cases aren’t about fairness; they are about money.
As stated above, I have to admit that I was quite impressed with the density of the analytical work done by the relator. I take issue, however, with the overuse of the word “proprietary” when it came to her methods. This begins in paragraph one, where the complaint seems to be based on a “proprietary analysis of all claims submitted to Medicare nationwide since 2011.”
I have been doing this since 1987, and there isn’t a lot of “proprietary” analysis that can be done with limited data sets publicly published by CMS. Maybe the most egregious use of this term occurs in Paragraph 49, and I quote: “relator uncovered Providence and JATA’s (J.A. Thomas & Associates’) fraud by employing unique algorithms and statistical processes to analyze inpatient claims data for short-term acute-care hospitals from 2011 through June 2017, obtained from the CMS.” It goes on to say that “these proprietary methods have allowed relator to identify with specificity the false claims made by Providence to fraudulently inflate revenue on Medicare claims.”
I don’t care how good an analyst you are, there simply isn’t enough information available in the public use databases, such as the Medicare Provider and Analysis Review (MEDPAR) or Program for Evaluating Payment Patterns Electronic Report (PEPPER) to be able to know for certain that a claim is fraudulent.
While this is not the venue to get into depth regarding the process of coding for diagnosis-related groups (DRGs), interested folks should take the time to learn more about them. The basic idea is that a DRG can be modified based on whether there are complications and comorbidities (CCs) or major complications and comorbidities (MCCs).
In general, MCCs are paid the highest, CCs are paid in the middle, and those without any complications are paid the least, so it isn’t a stretch to suggest that some institutions might feel incentivized to upcode to take advantage of those higher payments. But looking only at these complication rates is not enough. It is also important to look at length of stay and discharge status.
For example, let’s say a patient gets a hip replacement. If there are MCCs, it would be reported with DRG 469. If there weren’t any complications, it might be reported with DRG 470. If this were billed with DRG 469 and the patient was discharged two days later, it might raise some red flags, as one would expect a patient with major complications following a total hip replacement to be in the hospital longer. This might be the case if the patient were discharged home, but if he or she were discharged to another facility or maybe a rehab, then it might not be an issue.
From a benchmarking perspective, it is not difficult, using MEDPAR or PEPPER data, to analyze the utilization of different levels of complications for different conditions at different facilities for any given set of regions or locations. I have done this in a number of engagements, and while uncovering anomalous patterns is not uncommon, it takes a whole lot more than that to infer incorrect coding or billing, much less fraud. Even adding in average length of stay and discharge status, while it is easy to see where problems may occur, assumptions of fraud require a much deeper dive, including a review of the medical records and even a conversation with the patients and their families.
In paragraph 83 of the suit, it states that “the relator developed a proprietary linear regression model to control for the possibility that there are certain patient characteristics which might indicate a higher likelihood a patient would have a MCC, allowing relator to isolate and calculate the specific impact defendants had on the abuse of a misstated MCC code after controlling for other characteristics.”
I found this particularly interesting because a) I have actually never heard of a “proprietary” linear regression model, since linear regression is about the most basic of statistical techniques; and b) even with the use of length of stay and discharge status, the relator would have had to have known, through a large set of probe samples, how to apply the results of the regression to illustrate fraud. And I may be wrong, but I doubt that there was enough data available, where fraud has been determined, to create a database sufficient to allow for that assumption.
Now, had the relator discussed the use of, for example, neural networks rather than linear regression, and had a substantive data set for training, I would be more likely to believe that there was a change of being able to predict the likelihood of fraud. I have spent the last six years building, managing, and supporting a model that uses predictive analytical algorithms to identify potential audit risk, and even with the tens of thousands of claims that I use to train those algorithms, it would still be impossible to go beyond a prediction of risk. Unless and until the documentation for the suspected claim were reviewed, there would be no way to assume an error, much less fraud.
The complaint said that the relator filtered her findings to only include those instances where “Providence used a misstated MCC code at two times the rate of comparable hospitals, or at least three percentage points of its entire patient population higher than other hospitals…” I was also curious about the use of the word “misstated,” as it assumes that there was some inappropriate upcoding. This word appears first in paragraph 51, where it states that “relator identified 271 combinations of principal diagnosis codes and misstated MCCs in which Providence excessively upcodes.” I am not a coder, but coders I know tell me that you cannot make anything close to this type of assumption without first reviewing the chart in its entirety.
If I were Providence, I would surely like to see the logic the relator used to come to the conclusion that exactly twice the average, or a rate of “three percentage points higher than in the other hospitals” were considered false claims. I was probably surprised more by this statement than the others. I have been involved in a log of false claims cases, and there is a lot of time and energy that goes into making such a conclusion. Granted, one can identify anomalous or suspicious events and flag them for review, but as stated above, until the medical record is opened, one cannot even begin to apply the word “fraud” or infer a false claim, which is what appears to have happened here. While there are discussions in the complaint concerning human intelligence gained by interviewing former employees, my best guess is that this “proprietary” method had to do with the relator gaining access to and reviewing the actual medical records. Maybe that person had legal access to those records and maybe they didn’t, but in my opinion, it would have been necessary to come to the conclusions stated in the complaint.
The complaint also states that “benchmarking has the advantage of allowing for very specific and comparative groupings. This avoids imposing specific linearity on the data, which in turn gives relator’s methodology more statistical power and precision.”
What?
First of all, linearity is characterized by an ordered and predictable system, a benefit for this type of research, so I can only believe that in the absence of survival analysis, true linearity would have been an advantage, not a disadvantage. Second, the relator talks about doing “linear regression” that depends on … wait for it … linear data! And, it would have nothing to do with statistical power and precision unless it was associated with a sample. And if it was, linearity would still be a benefit. Plus, benchmarking against unvalidated databases are useful for only one thing, and that is identifying potential anomalies. This part sounds like a bunch of double-speak.
I have worked on dozens of fraud and false claims cases and testified as a statistical expert in more than a few, and one thing I know for certain is that one cannot extrapolate or infer fraud. And that comes right from the judge. If you have ever heard the detailed explanations that a judge will give a jury on the highly specific requirements for findings of fraud for each individual claim, you would agree. In a recent case, the judge told the jury that in order to find the provider guilty of fraud, each of the claims in question had to meet five criteria. And when it was all done, the government attempted to extrapolate those findings to the universe of claims, and the judge agreed with my argument that fraud cannot be inferred or extrapolated since the burden of proof is so high and individualized.
In the end, I can’t comment as to Providence’s guilt or innocence. What I can say is that, in my expert opinion, the use of analytics alone cannot be used to determine fraud for a given claim or set of claims. I am not surprised to see this happening, because access to the public data cannot bring out both the best and worst in an analyst. As for me, I choose to use my powers for good.
And that’s the world according to Frank.