Exploratory Datasets

In support of the Data-Driven Narrative assessment, here are a number of datasets that you may want to explore and visualise:

Tate Modern Collections Data

This dataset is freely available from the Tate Modern’s GitHub page. It contains a complete record of the metadata related to its collection. This page provides information about the contents, license and background relating to this rich dataset.

IMDB Datasets

The Internet Movie Database provides a number of exploreable datasets for download and play.

Primary Energy Production in Ireland

This dataset from the Sustainable Energy Association of Ireland documents from where Ireland has produced energy since 2004.

Education and Qualifications

This dataset from CSO details the education level and qualifications of persons aged 15 years and over, inside and outside the labour force, classified by occupational groups, industrial groups, regional and other geographic areas, socio-economic and group and social class, by age group, sex, nationality, place of birth, age fulltime education ceased, and branch of third level qualification. Information is supplied in the Small Area Population Statistics (SAPS) classified by Census Enumeration Areas, Dáil Constituencies, Electoral Divisions, Gaeltacht Areas, Garda Regions, Divisions and Districts, Local Electoral Areas, Towns, Urban and Rural areas of each county. So, it may be fun to combine this with some spatial exploration.

WHO and Disease Monitoring

Most Member States submit monthly reports on suspected and confirmed measles and rubella cases identified through their national disease surveillance systems to WHO. In general, the number of reported cases reflects a small proportion of the true number of cases occurring in the community. Many cases do not seek health care or, if diagnosed, are not reported. In addition, there is a one to two month lag time in reporting. For these reasons, the data provided on this page under-represents the true number of cases, particularly those occurring in the last one to two months.

Video Game Sales

This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com.

Fields include:

  • Rank – Ranking of overall sales
  • Name – The games name
  • Platform – Platform of the games release (i.e. PC,PS4, etc.)
  • Year – Year of the game’s release
  • Genre – Genre of the game
  • Publisher – Publisher of the game
  • NA_Sales – Sales in North America (in millions)
  • EU_Sales – Sales in Europe (in millions)
  • JP_Sales – Sales in Japan (in millions)
  • Other_Sales – Sales in the rest of the world (in millions)
  • Global_Sales – Total worldwide sales.

The script to scrape the data is available at https://github.com/GregorUT/vgchartzScrape. It is based on BeautifulSoup using Python. There are 16,598 records. 2 records were dropped due to incomplete information.

World Bank Data Indicators

A rich set of aggregated global datasets customisable for unload across a wide range of demographic and economic indicators.

Kaggle

A fascinating new repository of assorted, popularly contributed datasets and case studies.

css.php