Nasdaq’s Alternative Data Insights

Vetting, curating, and testing data

Hamlin Lovell
Originally published in the November 2018 issue

Nasdaq Analytics Hub launched in May 2017 and soon after released a White Paper “Intelligent Alpha”. This highlights how harnessing certain alternative datasets: twitter sentiment; retail investor sentiment inspired-shorts; parsed central bank communiqués; and selected corporate fillings – could have historically generated substantial outperformance over multi-year periods. The analysis applies to both long only and hedge fund investors, based on back-tests.

The costs of using alternative data include both vendor fees, and the internal resources that asset managers might devote to cleaning, managing, analysing, storing, and testing the data. “Analytics Hub could be cheaper, more convenient and more efficient, as its structure, APIs, formats, identifiers, assumptions, legal support, and service level agreements, are all standardised. This means that operations and administration are more efficient, and the offering is at least as good as asset managers would get by dealing direct with a provider,” argues Bill Dague, Nasdaq’s Head of Alternative Data. Under the common delivery umbrella, clients can pick from a growing menu which datasets to subscribe to. The curated datasets alert asset managers to potential predictive power in data and Nasdaq can provide generic case studies showing how the data might be used. But it is for the investors to determine how to use the data for signal generation, portfolio construction, risk management and so on, in the context of their own strategies.

A growing team  

The mindset and culture of Nasdaq’s Analytics Hub have been shaped by Dague, who joined Nasdaq as a software engineer in 2014, and his team. “Having initially used third party consultants, we built our own team because we expected to do a better job. The team of 14 is growing and includes data scientists and data engineers, mainly located in a centre of excellence in Boston, where we have established strong partnerships with local universities. The team will grow as Analytics Hub grows, to work on crafting individual products, and building out automation for every dataset,” explains Dague.  

Dague’s academic background, studying physics at the University of Chicago, has brought three key qualities to his current role. “First, it taught me how to frame a problem in simple terms, and then in rigorous quantitative terms, contributing to problem-solving rigour. Second, how to program and work with large datasets, applying statistical measures to them. Third, scientific rigour, which involves figuring out what you can know, and what is needed for a high degree of confidence,” he recalls.

“When hiring, we look for a similar scientific mentality that demonstrates problem-solving rigour; curiosity, which is more of a personal trait; and domain or subject-matter expertise – so that people know what the objectives are and what questions to ask. Candidates with the right skillset will usually have studied maths, science or technology, or financial engineering, but we are open minded about those who have studied linguistics, which analyses the structure of language and builds models,” he explains. Since fewer women than men study STEM (Science, Technology, Engineering and Maths) subjects to the highest levels, Dague counts himself lucky to have interviewed some fantastic female candidates, of whom three or four have been hired. Among Bill’s co-workers in the early days of the project was Lisa Schirf, who later became COO, Data Strategies Group and AI Research at Citadel LLC, and featured in The Hedge Fund Journal’s 2017 50 Leading Women in Hedge Funds survey in association with EY.

Sourcing, vetting and storing data 

“Essentially any non-financial data is potentially in scope, so Analytics Hub casts the net as wide as possible,” says Dague. The group has so far seen 450 different providers of data, and researched a wide array of data sources. “We take care to see that data is sourced legally, is likely to have some longevity, and that the company has an incentive structure that makes sense. These criteria screen out a lot of companies!” says Dague.

Personally Identifiable Information (PII) is absolutely off limits. “We do not even want to see data that contains PII and nor do our clients. It is not worth the compliance risk, no matter what the edge is,” says Dague.  

Analytics Hub might obtain exclusive access to some datasets, but would sell them to multiple customers. The firm has in fact rarely sought to bind a data supplier to exclusivity, as Dague thinks that the information edge comes more from how the dataset is used, interpreted or combined with other data. Additionally, Dague has noticed clients are showing markedly less interest in exclusive datasets.

At the end of the tunnel, Analytics Hub delivers data to clients in a digestible and usable format, potentially in Excel. But the raw data is generally unstructured, messy and datasets can be gargantuan: for instance, satellite and GPS location data can add up to terabytes. Therefore, cheap and scalable cloud computing and storage is essential to use alternative data. “As building private cloud would be cost-prohibitive, data is stored in public cloud with major public cloud providers. We were early to invest in infrastructure and security techniques, such as encryption, to get comfortable with accessing public cloud in a secure way,” explains Dague.

Data hygiene 

After acquiring raw data, “we take a rigorous and scientific approach to check the data for errors, missing fields, changing schemas, and structural changes in the data such as retraining and changes in treatment,” says Dague. 

Once the data has been cleaned through this process, “it needs to be standardised, with common time stamps, and common symbology such as open identifiers mapped in a rigorous way that avoids historical biases. The structuring of the data is guided by its intended end use, so it will be manipulated to align with the investment thesis,” he continues. 

Much of this work is automated with engineering using a battery of tests, but some human intervention is needed as no two datasets are the same. Measured in terms of time, automation is now so fast that 99% of the input is manual. 

Testing and extracting insights 

After data has been validated and structured, Analytics Hub investigates its predictive power – either in raw format or reshaped in some way, aggregated, or disaggregated. 

Back-tests may identify strong performance, and of course this is most prized by hedge fund managers if it is pure alpha, and therefore well-suited to a market neutral strategy that might be long the top decile and short the bottom decile.  Back-tests allow for factors such as trading costs, slippage, borrowing costs, and liquidity. Various biases are automatically factored in, including hindsight, survivorship, and book-ahead bias, as well as Simpson’s Paradox, whereby trends apparent in individual datasets disappear upon aggregation.

“When testing for statistical significance, there is no fixed bar, as it depends on the use case, and what else is in the market. For instance, we have found that many datasets related to or derived from market activity, can show a high correlation to momentum. If a data source is really unique, tests might be more lenient on statistical measures,” points out Dague. 

When hiring, we look for a similar scientific mentality that demonstrates problem-solving rigour; curiosity, which is more of a personal trait; and domain or subject-matter expertise – so that people know what the objectives are and what questions to ask.

Bill Dague, Head of Alternative Data, Nasdaq

A toolbox of testing techniques 

“A lot of good data science is about knowing when to apply which technique. The hardest thing is not necessarily applying techniques, but more about clearly defining the problem, knowing the objective, and picking the model, technique or tool that best matches the data and business outcome. We may use recursive neural networks or convolutional neural networks for some tasks, but not others. There is no point using a chainsaw when you could use a butter knife. This is the real secret sauce,” reveals Dague. 

AI and machine learning techniques can be used to glean insights from the data, mainly using a framework of open source toolkits: TensorFlow; Pandas; Keras; Python; and R. Analytics Hub also has some patents pending for certain methods and algorithms.

“But we prefer the term ‘machine intelligence’ to ‘machine learning’ or ‘artificial intelligence’ because the latter two can be construed as meaning there is something inauthentic or artificial about the techniques, whereas we view them as being more about automating aspects of human cognition,” explains Dague. 

Most machine learning used is on a “supervised” basis but occasionally Analytics Hub will keep a more open mind about what relationships might lie in data, and “let the data speak”, with the approach being to some degree “unsupervised”.

Dague is well aware of the risk of computers data-mining or over-fitting to identify relationships that do not prove to be repeatable. “If you torture the data long enough it will confess. The best weapon to fight ‘spurious correlations’ is common sense. The second line of defence is rigorous, structured processes, with clearly definable outcomes,” he expounds. 

There are always trade-offs in a world of scarce resources. “The more empirical and rigorous you are, the fewer datasets you look at. We need to balance the explore versus exploit paradigm,” he says.

Percolating the funnel

Analytics Hub has researched datasets that include many well-known examples of alternative data: App store data; product prices; AIS-derived vessel data; corporate jets data; spatial information analytics; transaction data; social media sentiment; and communications data.

Geolocation data provides a case study of one sort of dataset with which Analytics Hub decided not to work. Dague explains why. “Geolocation data from cell phone pings is very hard to work with, especially in its most raw form. You would need to understand different devices, versions of software, geographic biases, demographic information, and have a places database with historically accurate data, such as the latitude and longitude location of every Starbucks store in 2012. You would then need to aggregate this into meaningful measures, by – say – merging geolocation data with credit card data. But ultimately, this is only a proxy for shopping, for which there are other better sources, such as email receipts and shopping emails. Overall, using geolocation data would involve a lot of work for something not as good as other datasets”.

One data provider that Analytics Hub has chosen to work with is Prattle, which is used for analysing earnings transcripts. “We first became familiar with Prattle when it used Natural Language Processing to gauge the hawkishness or dovishness of central bank statements. Then the same method was applied to corporate releases,” says Dague.

Nasdaq has its finger on the pulse of ESG trends, in terms of which stocks are owned by ESG funds. A recent project looked at a universe of 4,000 ESG funds in the eVestment database, to gauge overlap between their holdings, and how this has changed over time. There was a fair degree of overlap, with a long tail of non-overlapping holdings as well. In ESG Holdings data, the objective is not only alpha generation, but also to provide a resource to help ESG investors benchmark their investment universe against peers. The ESG Holdings data combines the data to create something greater than the sum of the parts, displayed in the charts in Fig.1. 

We prefer the term ‘machine intelligence’ to ‘machine learning’ or ‘artificial intelligence’ because the latter two can be construed as meaning there is something inauthentic or artificial about the techniques.

As of September 2018, Nasdaq offers 16 alternative datasets, listed below.

  • Financial Systemic Exposure
  • Global News Exposure
  • Supply Chain Network Exposure
  • Multi-Layered Exposure
  • Emerging Technology
  • Earnings Quality
  • ESG Holdings Intelligence
  • ETF Reference Data
  • Social Media Sentiment 
  • Technical analysis
  • G10 Communique (Central banks)
  • Premium alpha
  • Twitter sentiment
  • Corporate filings
  • Multi-Expert Long with Short Hedge
  • Corporate earnings

For now, the main focus is on providing insights for equity investors, with a bias towards US equities where there is more prevalent public data. 

The full spectrum of active buy side firms, ranging from the most sophisticated quant funds to much smaller discretionary shops, are existing or potential users of alternative datasets. “Highly diversified quantitative funds typically want some incremental knowledge on 3,000 to 10,000 stocks. It is harder to add value for concentrated, sector specialist funds that already have immersive knowledge of each holding, though longer term this heightened challenge of seeking out unique insights could be more rewarding,” says Dague.

No offering specifically for retail investors has been rolled out, though Dague envisages a market niche for serving prosumers – who buy services of a quality somewhere between consumers and professionals.  

Credit, fixed income and commodities could be added later, broadening out the offering to cater for other strategies, such as macro funds.

There are requests for customisation. “Though the service is not yet fully customisable, we take pride in being agile responding to client feedback,” says Dague.

He does not speculate on specific targets for numbers of datasets, but as the number of suppliers is exploding, Analytics Hub will grow the offering as fast as possible. Dague anticipates, “Analytics Hub might eventually offer hundreds or thousands of datasets, but will be disciplined in balancing quality with quantity”. 

Where Analytics Hub fits into GIS and Nasdaq 

Nasdaq’s investment data and analytics unit, Global Information Services (GIS), includes Analytics Hub, eVestment, acquired in 2017 for $705 million, and strategic advisory company Nasdaq Dorsey Wright, acquired in 2015 for $225 million. “The objective is to be focused on the buy side, but removed from the exchange, and closer to the investment decision making,” says Dague.

GIS has two other pillars – market data products and an index licensing business that can eventually be used to launch ETFs. A synergy within these sub-groups is the first alternative data index based on disruptive technology, Nasdaq Yewno Global Disruptive Tech Benchmark. Disruption is gauged by tallies of patents in 35 key disruptive technologies, including batteries, blockchain, AI and virtual reality. The Bloomberg ticker is NYDTB. There is no ETF (yet). Long term, Dague “sees a lot of opportunity to bring actionable data to the masses”.

GIS is one of four Nasdaq verticals. The other three are markets, enterprise software used to power 120 exchanges, and corporate services, helping companies to go public and advising them on investor relations and share price action. The natural big picture synergy amongst these is that healthy markets create more data and more demand for issuance.