Big Data Hasn’t Changed Everything

  • Big Data Hasn’t Changed Everything

    Technology has a long way to go in mapping the variables of humanlife.



    The more I hear the term “big data,” the more suspicious I become. Not in an Edward Snowden, the evil government’s spying on us sort of way. If the curious of Fort Meade, Md., the National Security Agency’s home, wish to poke through my electronic sock drawers for signs of terror, they are more than welcome. Happy to do my bit for national security.

    No, the problem comes when the term becomes ubiquitous. It’s one thing for the NSA’s quants or scientists at the Large Hadron Collider or genome sequencers to talk about big data. Big nails need big hammers, and the phenomenon of big data is certainly real. The three Vs of volume, velocity and variety, coined by the techies at the Gartner GroupIT -0.47% have created a gusher of data that clever minds can use to great effect.

    But it’s quite another thing when you start to hear how big data is going to upend everything. In their essential new guide to the subject, “Big Data: A Revolution That Will Transform How We Live, Work and Think,” Viktor Mayer-Schönberger and Kenneth Neil Cukier write: “Society will need to shed some of its obsession for causality in exchange for simple correlations: not knowing why but only what. This overturns centuries of established practices and challenges our most basic understanding of how to make decisions and comprehend reality.” We need to learn to trust what the data is telling us before we fully understand why.

    They add that such a drastic change to how we weigh evidence and decide will take a lot of fine tuning. Ethics, morality, civil liberties, everything risks being thrown under the big-data bus, unless we are exceedingly careful.

    Getty Images

    Fortunately we have one giant data set around which to pivot this discussion. In the 1980s, the financial industry was transformed by the arrival of what at the time seemed like very big data indeed. Brokers were removed from trading floors and replaced by digital exchanges. Money which once bumped into borders now started to flow unimpeded. Myriad frictions were removed and the industry boomed. Computer scientists started popping up inside banks and hedge funds to identify trading opportunities within the great torrent of data. Many were and are truly brilliant. In a war, you would want them on your side breaking enemy codes.

    Things went awry when everyone started to think they could do it. When the decidedly adult pursuits facilitated by the new torrent of data fell into the hands of intellectual children, who had no real idea of what they were doing. When the likes of Citibank and Societe Generale GLE.FR -1.18% started to think they could play with the best of the hedge funds was the time to start stuffing your cash in a mattress. When all you do is watch data models for instructions, for whats, with no idea of the whys, you become easy prey for the monsters of risk.

    What seems blindingly obvious in retrospect, that there was no alchemy capable of turning subprime loans into sure-thing derivatives, was either missed or intentionally glossed over by Wall Street’s big-data thinking.

    Managers are constantly being told that they are one hardware or software installation away from business nirvana. But they should bear in mind the lessons of the financial crisis the next time a consultant waltzes into their office declaring big data the next big thing, or even worse, a “paradigm shift.”

    Because big data as a technological opportunity and big data as a management theory are two separate things. However much big data can yield, information will never be perfect. As efficient as these data models become, managers will still have to make decisions with limited certainty about the outcomes. Data helps and has since the scouts of ancient armies returned with reliable numbers. Eisenhower at D-Day had more data than Hannibal at Cannae, but waging war remained a beast of a task. The challenge for managers has always been the human mind and heart, which seems punier than ever in the shadow of the terabyte.

    Consultants are already whipping up a flurry of winking big-data dashboards for managers, with every organizational activity reduced to a few key numbers. But still there will be rogue traders, rats in restaurant kitchens, and drunk machine operators. If big data is to escape the graveyard of managerial fads—knowledge management anyone?—it has a lot to prove.

    Data has been big for a while now and is getting exponentially bigger. But we shouldn’t feel inadequate because we rely on our animal traits like gut, intuition and bias. Technology has a long way to go in mapping the variables of human life. And the moment it starts to feel like tyranny, we have one lethal weapon in our arsenal. It is called the off switch.

    Mr. Broughton is the author of “The Art of the Sale: Learning from the Masters about the Business of Life” (Penguin Press, 2012).

    • 20 hours ago

    If you torch the inputs by bribing the rating agencies, Big Data will give you garbage.

    1 Recommendation

    • 19 hours ago

    Big data creates the need for more powerful analytic methods able to crack increasing complexity. It’s a never ending chase where mere humans lag way behind the very few schooled in the new tools. The end result is collapse.


    • 19 hours ago

    <<Managers are constantly being told that they are one hardware or software installation away from business nirvana. >>

    Yep. I’ve been hearing that for decades. It works wonders for computer vendors’ hardware and software sales, though.

    For the record, “big data” is the industry’s latest buzzward to identify data analysis techniques that have been used for decades. What’s happened recently is that computer hardware has become fast enough to crunch millions and billions of data records in a reasonable amount of time, so the illusion has been spawned by IT sales reps that “crunching big data” is the latest magic bullet that will fix whatever ails dysfunctional businesses without their managements having to do any real work.

    The problem, of course, is there is a limit to the “resolution” of the data. Just crunching numbers faster does not always, or even usually, yield meangful results. An analog is that the usefulness of a telescope is limited by the size of its mirror. You can only add powers of magnification up to the limit of the mirror’s ability to resolve distant objects. It’s useless to add a magnification power of 1000x to a 3 inch mirror because all’s you’ll see is a big meaningless blur.

    “Big data” is like that too. You can “drill down” into the data all you want, but if the data isn’t useful enough to “resolve” anything, you’re not going to get meaningful results. You can ask Mitt Romney all about that one. He hired the best wonder boys money could buy to program his computers to “crunch big data” in order to turn out the vote, and, if anything, they caused him to lose by a larger margin than he would have lost without them.

    Not picking on Mitt Romney, but I am sure that his campaign people placed a false sense of security in relying on computers to do what their candidate COULDN’T do, which was to turn out the voters. They should have spent zero dollars on computer systems and a couple thou on a P.R. firm to hone the image.

    Business does that all the time too. They expect that a computer system is going to sort out their bogus procedures that are due to sloppy management. No, it doesn’t work that way. You fix your procedures FIRST, then invest in computer systems. Even “big data” won’t bail out a dysfunctional management.

    6 Recommendations

      • 17 hours ago

      Actually, there are a lot of new techniques out there, too. This company has something different, and I’ve been using their tools

      But in general I agree. You have to actually have data containing the signal you want to find in order to find anything. And looking for signals in the same old ways isn’t likely to help – if the signal wasn’t there when you had 1 million samples, why will it be there when you have a quadrillion?

      1 Recommendation

      • 2 hours ago

      I’m not sure your example of Mitt Romney in the most recent presidential election fits or enhances your argument. I recently heard a discussion involving a strategist for the Obama campaign, describing how their use of sophisticated data analysis techniques allowed them to identify the “persuadable” voters in the districts that could sway the election and concentrate their efforts where they would get the biggest payoff. This was of course using computing power to sift through the huge data bases to isolate specific instances, not general trends. But it was “Big Data” and modern information technology that made it possible.


    • 18 hours ago

    I’ve worked with small, medium, large, and big data my whole career. There are many problems with it, the biggest being correlation doesn’t mean causation, not to mention non-normally distributed samples and universes. I did a very simple lecture at a So Cal university to explain this using cars in one example. Take a 500 person lecture hall and ask all the people who drive a Honda to stand up. Then ask each their demographic and why they chose a Honda. You had the few that match Honda’s positioning (value, reliability, mileage), but the vast majority of those standing had no correlation to anyone else nor did they correlate to Honda’s positioning. While you can detect interesting things in the data, and there actually have been successes in NSA identification of terrorists looking at “outlier” data, Big Data has a long way to go, and understanding what the data means once you see the results is an art as much as a science. And I haven’t even talked about the meaning of missing or no data and its effect. Then, of course, there’s the issue of getting your organization to act or change based on it. If you don’t get people trained in critical thinking to manage the “art”, the science/Math behind it isn’t of much value.

    2 Recommendations

      • 18 hours ago

      The classic example of this for those of us who grew up in the ’60s is the “Input / Output Analysis” that LBJ’s “Whizzkid” Secretary of War Robert McNamera tried to apply to his computerized models of the Vietnam War. McNamara was confident, based on his computer analysis, that the U.S. would prevail in Vietnam if we killed off “x” number of NVA and Viet Cong.

      That theory didn’t survive contact with reality. McNamara was discredited after he and his boss LBJ got a lot of our best young men killed in nondecisve combat operations that killed all the requisite NVA/VC that McNamara’s model said needed to be killed, but without causing them to call off the war.

      McNamara learned too late that if you’re going to rely on data analysis to make your decisions, you’d better make sure that you have ALL the inputs and outputs to model. Human behavior is notoriously fickle to model in a computer as McNamara and many who followed him have learned to their chagrin.

      2 Recommendations

        • 7 hours ago

        And yet even today McNamara is viewed as a visionary… the left, at least.


        • 3 hours ago

        Having grown up in the Vietnam era of body counts on the news every night, If found Sorley’s 2011 “Westmoreland: The General Who Lost Vietnam” very interesting. Westmorland was the perfect enabler of LBJ’s and McNamara’s schemes and biases.


    • 14 hours ago

    Read Nate Silver’s “The Signal and the Noise”. Big data means more signals and more noise.

    1 Recommendation

    • 5 hours ago

    The problem isn’t big data. The problem is management fads. Big data, of course, retains its significant (though limited) usefulness. And if your management loves fads, that’s the real problem…


    • 5 hours ago

    The article seemed interesting until he gets to the part about banks trying to use big data to make sub-prime mortgages work by creating CMO’s. He implies that they were driven to this by something they saw in “Big Data”. The truth is sub-prime mortgages were jammed down their throats by Congress as a result of changes made to the Community Reinvestment Act when President Clinton signed the Graham, Leach, Bliley bill in 1998. Andrew Cuomo was Secretary of HUD at the time and he put the sub-prime lending program into warp drive once that bill was passed. All that the banks tried to do was find ways to layoff their risk using CMO’s and Credit Default Swaps. Guess what? That didn’t work!


    • 5 hours ago

    Can you trust the data when the desired meta-analysis is all skewed up:

    Tarnished Gold: The Sickness of Evidence-Based Medicine

    When the data generates statistics which become the average patient, then one can treat the average not the individual. This is anathema to P4 medicine which is Personalized, Predictive , Preventive and Participatory.

    See also, 50 studies every doctor should know:

    Has anyone met the average human. That is someone with one breast and one testicle?


    • 2 hours ago

    Great piece. All of business wants to embrace big data while forgetting that data does not equal insight. Data can point you to correlations that may or may not be worth looking into, but data will not yielded any insight into why those things are correlated which is what you really need to make it actionable.

    I’ve also seen teams and companies try to be “data driven” only to run smack into a wall when they realize that the data will not literally tell them what to do. Data can inform strategy, illuminate possible paths and outcomes, but it cannot tell you definitively which path to follow.

    One could make an argument that American business executives have unprecedented amounts of data at their fingertips, yet are making worse decisions than ever before because any notion of judgment, management decisions based on core principles, or old fashioned leadership are rapidly being lost.

For a Statistically Savvy 2013

My print column offers tips, shared by statistics professionals and readers responding to my blog post, for how to make 2013 a more numerically savvy year. Not all the great suggestions could fit in the column, so here are some more, starting with those from readers:

Michael Dean, a senior marketing analyst in Minneapolis, wanted more clarity in weather news and forecasts. “How accurate are the five-, seven-, and 10- day forecasts?” Dean asked. “Can’t someone collect data on the predicted temperature various days in advance, and then see what temperature it ends up being? What is the range of error by the number of days out for the forecast? What times of the year or what regions of the country does this range vary the most? I am thinking I should ignore anything longer than a five-day forecast, but those may be off a lot, too.” (Some of Dean’s questions are answered by the website Forecast Advisor.)

Harvey Bale, a retired economist in Washington, D.C., wants to see monthly labor data news reports to include information about the chronically unemployed and the labor participation rate. “The narrow unemployment rate highlighted each month is relatively unimportant,” Bale said. “It masks the serious harm being suffered” by discouraged workers and involuntary part-time workers.

Dave Fitzpatrick, who works in marketing analytics in New York, wants to see more context around other numbers: Percentage changes from the year before, for instance, instead of just presenting raw statistics. “Too often we see aggregate statistics such as simple percentages cited without any context as to their direction and composition,” Fitzpatrick said. “A much more insightful way of communicating statistical results is to cite the percentage change or, better yet, predictive modeling results that can tell us the impact of one variable on another.”

Jeremy Schneider, another marketing-analytics professional, doesn’t want to see averages falsely smoothed out to create arresting statistics. “My pet peeve is when ads or articles cite murder rates or death rates by saying ‘That’s one murder every 10 minutes,’ or, ‘Someone is dying from starvation every five seconds,’ ” Schneider said. “That certainly might be the average murder or death rate but its not like every 10 minutes on the dot someone is dropping dead.” These stats may be used for a good cause, Schneider said, “but the impact is marred in my eyes by making that claim.”

Kelly Jackson, who teaches at Camden County College in Blackwood, N.J., would like to see better practices in charting, where the Y-axis should always start at 0 when possible. “One of the problems my students have is interpreting data and graphs that don’t use ’0′ as the starting point,” Jackson said. “Imagine a graph that starts vertically at 500 and shows bars of height 550 and 600.”

Richard Hoffbeck, a research data analyst in Minneapolis, would like to see more mention of study design in reports about medical research. “I think a small population case-control study done 20 years ago should be weighted differently than a large-scale experimental trial,” Hoffbeck said.

Judea Pearl, director of the Cognitive Systems Laboratory at the University of California, Los Angeles, cites as a statistical pet peeve “the century-old confusion between correlation and causation,” a point he elaborated on in a recent interview with American Statistician News. (Pearl is the father of Daniel Pearl, the Wall Street Journal reporter who was murdered in Pakistan in 2002.)

Brad Carlin, professor and head of biostatistics at the University of Minnesota, mentioned a lesson from the success of Nate Silver, election forecaster for the New York Times: “Never believe in just one poll; always take some sort of average of all the polls you respect.” Other forecasters who also aggregated polls had success in this election cycle.

New York University mathematics professor Sylvain Cappell offered a few tips. Among them: “There’s a widespread disinclination to recognize how often choosing between alternative courses involves making a judicious balance between quantities, and thus making specific numerical formulations to be able to compute the advantageous tradeoff point,” Cappell said. “Recognizing that there are tradeoffs involves qualitative thinking but after that there’s just no short-cut to computing to see where the tradeoff point actually lies.”

Cappell added that sometimes simple computations, not complex ones, can suffice to aid in decisionmaking. “It’s amazing, even in our complex modern world, how many assertions fail simple ‘back of the envelope’ reasonable estimates with elementary computations,” he said.


        • 3:04 am December 29, 2012
        • Jonathan Seder wrote :

        Two more peeves:

        Relative risk needs to be framed with absolute risk – if a drug cuts a mortality rate “in half,” the improvement is not very interesting to a general audience if the absolute rate falls from 2 in fifty million to 1.

        Statistical significance should be distinguished from ordinary significance. One painkiller might be better than another with a high level of significance, but that statistically significant improvement might be minuscule, undetectable by consumers.

        • 10:43 am December 29, 2012
        • dqk wrote :

        Time and time again, journalists misuse the word “percent”. They write, for example, “The market fell 11 percent” when the market fell 11 percentage points.

        • 2:21 pm December 29, 2012
        • Prof Luis Pericchi wrote :

        Hello Carl
        interesting your article! insightful
        There are turning tides: “From, damm lies and Statistics” to
        “Statistics, the only manner to decipher reality, to disentagle its tricks, to separate signal
        from the ocean of noises”
        congratulations for the new tide in favor of statistitical thinking

        • 11:52 am December 30, 2012
        • SW16 wrote :

        We often read nonsense such as “A earns ten times less than B”, when what is usually meant is “B earns ten times as much as A”. If the first was true, A would be paying his/her employer for the privilege of going to work.

        Can anyone give me a real-world example of where “X times less than” is likely to be correct? If not, if journalists can’t be numerate, just ban the phrase “X times less than”

        • 1:56 pm December 30, 2012
        • Philip B. Stark wrote :

        Thank you for bringing a spectrum of perspectives about numeracy.

        Here are some basic concepts I see butchered all too frequently:

        1) The sample is not the population.

        2) The margin of error of is supposed to measure how far the sample-based result is likely to be from the results for the whole population, due to the luck of the draw in selecting the sample. The reported margin of error typically doesn’t take into account a variety of other sources of error, such as nonresponse and other biases. Such “non-sampling” errors can be much larger than the margin of error.

        3) “Random” is not the same as “haphazard” or “arbitrary.” It is a term of art. Generally, you have to work quite hard to make things random–it doesn’t happen “accidentally.” In most situations where people talk about probabilities, the probabilities are fictions: there really isn’t anything random.

        4) There is no such thing as “a statistically significant sampling” or “statistically significant sample size.” (I see this in legal documents frequently.)

        5) Don’t confuse “the chance of observing what was actually seen, assuming the hypothesis is true” with “the chance the hypothesis is true, given the observations.” (This is a common misinterpretation of p-values.) A related garble is saying “there’s only an X% chance that this result could be due to chance” in place of “on the assumption that a specific chance mechanism generated the data, the chance of observing those data would be small.”

        Here are some rules of thumb I’ve compiled for graduate students studying applied statistics:

        * Consider the underlying science. The interesting scientific questions are not always questions statistics can answer.
        * Think about where the data come from and how they happened to become your sample.
        * Think before you calculate. Will the answer mean anything? What?
        * The data, the formula, and the algorithm all can be right, and the answer still can be wrong: Assumptions matter.
        * Enumerate the assumptions. Check those you can; flag those you can’t. Which are plausible? Which are plainly false? How much might it matter?
        * A statistician’s most powerful tool is randomness—real, not supposed.
        * Errors never have a normal distribution. The consequence of pretending that they do depends on the situation, the science, and the goal.
        * Worry about systematic error. Constantly.
        * There’s always a bug, even after you find the last bug.
        * Association is not necessarily causation, even if it’s Really Strong association.
        * Significance is not importance. Insignificance is not unimportance.
        * Life is full of Type III errors.
        * Order of operations: Get it right. Then get it published.
        * The most important work is usually not the hardest nor the most interesting technically, but it often requires the most patience: a technical tour-de-force is usually worth less than persistence and shoe leather.

        • 10:14 am December 31, 2012
        • Steve D wrote :

        No more reporting of rates of change of rates of change. When someone’s tax rate changes from 4% to 5%, that should not be reported as a 25% increase. Some anti-environmentalists like to say that some fishery stocks have doubled recently. They don’t tell you they went fro 1% to 2% of what they were 50 years ago.

        More raw data. Don’t tell me someone pays only 10% of his income in taxes. Tell me how many dollars.

        When you tell me it will cost billions to address climate change, show me the economic forecasting model you used and its accuracy rate.

      • 3:02 pm January 2, 2013
      • Michael Dey wrote :

      Dear Mr. Bialik,

      I enjoyed your important article of 12/29-30/2012, entitled “Statistical Habits to Add, or Subtract, in 2013.” Mr. Rodriguez, President of the American Statistical Association makes an excellent point on using experimentation to establish cause and effect. Beyond the randomized control trial (RCT) noted (also known as an A-B split in other fields), there are actually more powerful variations on the same theme for the same sample size as employed in testing a single change.

      The power of evaluating 20-40 changes to status quo, in a live business environment, is enormous. Especially when applied to complex issues faced by healthcare and education. Large, orthogonal statistical design starts with unlocking the creative energy of organizations and ends with putting sacred cows and folklore to the test. End results are almost always surprising and leap to solutions re-proven in subsequent implementation. Advantages across industry are largely latent due to the greater popularity of small designs (such as A-B splits) with smaller return.

      Performance tends to increase during a study as a result of standardization, without which applications in healthcare and education would be more problematic. Statistical design in fact stops any roulette that might accidentally occur.

      It’s often though problems can be solved by data analysis. However many of the great inventions used no data. Statistical design structures innovation, then proves or disproves while providing data. To this extent it is simply the scientific method (induction-deduction) whereas much data analysis is heavy on deduction. For example, “root cause” analysis, even where successful, still leaves unanswered what the solution(s) are. Statistical design leaps to those solutions among which about half of expert ideas will be found to work.
      It is generally accepted that the first RCT appeared in the literature in 1948 though there was work prior). That was an important step in mainstream use of the scientific method and remains important in medical research as well as generally. With the ubiquity of computing power more sophisticated study designs become even more attractive (while easy for users to action once designed well).


      Michael Dey
      President & CEO
      Nobigroup, Inc.

Using Big Data for Recruiting

This Gild algorithm does not sound very impressive.  It includes 300 variables, but some of those are the skills a person lists for him/herself on LinkedIn and the ranking of the person’s school in US News & WR, which is a crap ranking.  Also, there is no discussion of how they determined that these 300 variables are drivers…what are the outcomes they are measuring against?  An employer’s rating of an employee over time, and then do a regression to what variables a highly-rated individual touts?  The approach is nice in theory, but the article doesn’t give me confidence in the accuracy of the output.