The UK Data Service collection includes major UK government-sponsored surveys, cross-national surveys, longitudinal studies, UK census data, business data, and qualitative data. The Service provides access to these nationally and internationally significant data assets for research, teaching, skills development, and policymaking. The data reflects issues that affect human lives and experience on every continent, from birth through education, employment, social interaction, and old age. The Service has been designated a Place of Deposit by the UK National Archives since 2005, enabling the Service to curate public records.
UK Data Service’s Big Data Support Network—led by UK Data Service Associate Director Nathan Cunningham—provides users with access to big data for social and economic research, policymaking, and commercial use. “This data has significant value for the UK and internationally,” says Cunningham. “We want to broaden access as much as possible while meeting stringent privacy and security requirements. Some of our data is in the public domain, and some contains sensitive information. Controlling access is critical.”
UK Data Service sets access conditions based on the content of the data, always making it as accessible as possible and secure when necessary. An increasing number of collections are available as open data without the need to register. Many data collections are available under a standard end-user license with straightforward online registration.
To meet these needs, the Service chose a hybrid IT architecture, hosting some data and services in on-premises data centers and some in highly scalable, low-cost cloud technology. At the same time, the Service wanted to make identity authentication and access simple and transparent for end users regardless of data location.
As a publicly funded organization with its roots in academia, UK Data Service chose open-source technology in an effort to get the most value for its money. At the same time, Cunningham emphasizes the value of private-sector cloud services. “It’s not a good use of resources for us to try to replicate the millions of dollars in investments that vendors are making to stand up innovative, reliable, secure services,” he says. “I knew we could get much more for our money by working with a commercial cloud provider.”
The solution also needed to be highly scalable and elastic, supporting significant future growth in the most cost-effective manner possible. For example, the Service will eventually host data from the UK Smart Meter project, an ongoing initiative to deploy intelligent energy-metering devices in as many as 50 million UK homes.
Finally, the organization needed to accomplish all of this for a reasonable cost. “We have a pretty modest budget compared to a private enterprise,” says Cunningham. “Our goal was to respond to high-demand process calls with a predictable, sustainable cost, while delivering a high level of performance for users.”
“From the get go, I knew we needed to use a Hadoop Distributed File System, but I didn’t want to spend our money replicating high-quality services from commercial vendors,” says Cunningham. After evaluating a number of solutions, including Microsoft Azure, the UK Data Service chose Amazon Web Services (AWS) because of AWS’ highly scalable and cost-effective storage and compute services, as well as its ability to accommodate open-source software and hybrid architecture.
UK Data Service is implementing Amazon Elastic Compute Cloud (Amazon EC2) to run its data pipeline, Amazon Simple Storage Service (Amazon S3) to store the cloud-based portion of its data lake, and Amazon Relational Database Service (Amazon RDS) for SQL Server to handle database queries.
Designing the architecture has been a two-year journey because of the Service’s need to comply with complex security and privacy requirements, including ISO 27001 security audits. A close working relationship between the Service, AWS, Hadoop expert Hortonworks, and Cloudwick, an AWS Partner Network (APN) Advanced Consulting Partner, focused on open source and advanced analytics.
The data-ingest pipeline is based on Kylo, an open-source platform for data-lake management. It standardizes the data according to Service schemas and pushes it to the appropriate location based on security needs.
A consistent governance model is being applied to ensure that service calls are pushed to the right infrastructure assets, whether on AWS or in the organization’s data center. The data lake is hosted in Amazon S3 using Apache HBase, and Apache Spark is used to push data into data-search and visualization services for end users.
The solution will offer seamless, powerful search and analytics to users, enabling cell-level queries of any concept held in the data lake. High-value or high-usage data is pushed into user-friendly visualization tools such as Kibana. “Regardless of where data is held, the hybrid architecture is invisible to end users,” says Cunningham. “The service will be able to be picked up at any point, providing the same experience to users wherever the data is located.”
The Service’s architecture will also support enrichment of data for better insights. “With the smart-meter data, for example, even though there are hundreds of millions of rows, you only have a few variables,” says Cunningham. “To do interesting things like figuring out how much energy users are saving with efficient appliances, you have to add more variables. With AWS, we will be able to scale to accommodate these increasingly massive data sets, supporting deeper insights.” The system is highly elastic, giving the Service the flexibility to accommodate unpredictable workloads.
In addition, the Service wants to focus on delivering engaging and accessible data on par with the best, most innovative digital experiences. “Using AWS means we can deliver greater value for public investment than if we tried to build infrastructure on our own,” says Cunningham. “AWS has demonstrated its commitment to helping us innovate using disruptive technologies, with a clear pathway to core operational infrastructure.”
Learn more about massively scalable, open-source big-data solutions built on AWS.