Open Government Data: Fostering Innovation

The provision of public information contributes to the enrichment and enhancement of the data produced by the government as part of its activities, and the transformation of heterogeneous data into information and knowledge. This process of opening changes the operational mode of public administrations, leveraging the data management, encouraging savings and especially in promoting the development of services in subsidiary and collaborative form between public and private entities. The demand for new services also promotes renewed entrepreneurship centred on responding to new social and territorial needs through new technologies. In this sense we speak of Open Data as an enabling infrastructure for the development of innovation and as an instrument to the development and diffusion of Innovation and Communications Technology (ICT) in the public system as well as creating space for innovation for businesses, particularly SMEs, based on the exploitation of information assets of the territory. The Open Data Trentino Project has initiated and fosters the process of opening of public information and develops as a natural consequence of this process of openness, the creation of innovative services for and with the citizens. In this paper we present how our project acts on long-chain, from raw data till reusable meaningful and scalable knowledge base that leads to the production of data reuse through the implementation of services that will enhance and transform the data into information capable of responding to the specific questions of efficiency and innovation.


Introduction
The process of opening the data in Public Administrations (PA) as a process of enhancement of public information implies a radical change to the data approach and work inside the PA. This takes time, an improvement of the data management methodology, creation of operational tools and providing a reliable space for sharing. Governments of various countries and administrative divisions thereof worldwide have nowadays been starting to release a huge quantity of datasets in the context of the Open Government Data (OGD) movement (Ubaldi, 2013). The movement became a shared policy after the European Directive no. 2003/98/CE (Public Sector Information Directive), improved by the Directive no. 2013/37/UE. The first Directive has been transposed in the Italian regulatory system with the Legislative Decree no. 36/2006 and the second one still needs to be implemented. The main purpose of the PSI Directive is to enable the so called "data re-use", that means "the use by persons or legal entities of documents held by public sector bodies, for commercial or non-commercial purposes other than the initial purpose within the public task for which the documents were produced" (Directive 2003/98/EC).
In the context of the Autonomous Province of Trento (PAT), a large, diverse and interesting collection of datasets are already published as OGD. New datasets are slowly becoming available and the existing ones are updated whenever needed for the purposes such as correcting mistakes and adding new data horizontally (as instances) or vertically (as properties). The data catalogue is also linked with the website of the Department of Innovation of the PA 2 . This department governs the process of opening new data and the dissemination of the so called data culture. This is an important result to reach both inside and outside of the PAT. People started understanding the value of publishing high-quality data and the power in the reuse of them. Linking data will highly foster the value of sharing data. It will also lead to a new kind of data-centric public bodies that will empower citizens and generate innovative services.
Immense numbers of government datasets could open up new opportunities for application developers and trigger disruptive business models to come (Ferro & Osella, 2013;Manyika e al., 2013;Vickery, 2008). While quantity of such datasets is considered as satisfactory enough, quality (e.g., correctness and vertical completeness) is yet to be improved (Bohm et al., 2012). Moreover, loosely coupled nature of data is posing challenge in developing applications on top of them. Therefore, there is a pressing need to leverage this data before putting them in action. To overcome the issues and fulfil the demand, we made the following contributions in this paper: The description of the adopted opening procedure as an integral part of the change management in a public administration.
II. The implementation of a methodology for generating entity types  out of published datasets to model data as entities (Giunchiglia, 2012) for facilitating an integrated, combined and extensible representation.
III. The implementation of a procedure and the corresponding tool for dealing with unforeseen data (along with known ones) about an entity, taking into account the semantics.
IV. Description of our experience in handling Open Big Data for building life style changing unprecedented (in the region) applications.
The paper is structured as follows: in Section 2 we present the process that has been put in place within the PAT and the choices made. In Section 3 we describe the entity type methodology that we have adopted, helping creating integrated entities. Section 4 shows the automatic creation of entities matching dataset schemas to the entity types. Section 5 provides a brief description of open big data approach we are building. In Section 6, we present some applications developed on top of entities and open big data. Section 7 concludes the paper.

Open Data Trentino Overview
The Open Data Trentino 3 project was created under the push of the local government of opening their public information as expressed in the guidelines for the reuse of public data official document (DGP, 2012). From the legal point of view, the process started with the adoption of the Provincial Law no. 16/2012 (LP 16/2012), on the promotion of the information society and the digital administration and for the distribution of free software and open data formats.
The process continued by adapting and improving, for the local administration context, the state of the art of existing European good practices in matter of Public Sector Information (PSI) 4 . The result was the drafting of the Open Data Guidelines (Resolution 2858/2012).
The data hunting and publication process followed is a step by step, day by day, federated approach by involving the local authorities since the beginning by asking to every provincial department to open at least one dataset. Although the direct engagement of any department is time consuming it has proved of being successful and brought the desired side effect of creating enough awareness to the whole public administration and to spread the change paradigm. Many aspects are involved in this process: from the data cleaning to the data modelling, from the privacy issues to the intellectual property rights, from the dissemination aspects to the process design. At the same time, we have focused at the creation of the Data as a Culture, by acting in several dissemination and educative actions internally to the institution and with a broader scope at the national and international level.
The Dati Trentino portal is built over CKAN 5 , an open source data management system started by the Open Knowledge Foundation and maintained by the CKAN community itself. CKAN is specifically designed to allow programmatic access, finding and retrieval of dataset metadata through web APIs to which we are currently contributing. For instance we developed dedicated clients to access datasets in two different programming languages. The first one, Ckan Api client 6 , is an open source Python based library used to add automatically harvested datasets into a standard CKAN platform version 2.2. The second one, called Jackan 7 , is an open source Java based lightweight library with built-in support for provenance tracking to easily access the catalogue data from Java.
As of October 2014, the government of the PAT published about 860 datasets in a catalogue made available under the link http://dati.trentino.it more than a year ago, with an open license for ensuring free and unlimited use and reuse of data, representing the engagement of about 60 provincial departments. The catalogue is clustered into 13 broad categories, each consisting of a number of datasets, which are represented as one or more resources that are easily accessible and downloadable often in CSV and/or JSON format, and occasionally in XML. Each dataset can be mapped to DCAT 8 , a vocabulary designed for describing catalogues and datasets thereof for increasing interoperability. Just to cite a few of the published resources of high importance, there are provincial budget and cadastre.

Entity Type Generation
This Section depicts an entity centric approach for modelling OGD. It describes the generation of entity types according to the data published in government data catalogues. An entity type (also called as eType) is the class of an entity that has the right amount of attributes and relations for forming the foundation of creating entities with their non-trivial details intended for an application (Giunchiglia, 2012;Farazi, 2008). Some examples of entity types are person, location, organization and facility. An entity is a real world physical or abstract or digital thing that can be referred with a name (Giunchiglia, 2012). For example, Dante Alighieri (person), Trento (location), University of Trento (organization) and Trento Railway Station (facility) are entities.
In the context of this work, we have been dealing with the catalogue published by the PAT. As shown in Figure 1, we divided the entity type development in three macro-phases -datasets survey, attributes survey and producing entity types -each of them with different macro-steps. Modelling starts with the dataset identification step which relies on the scenario or task at hand. For example, our scenario involves points of interest which include datasets representing among others refreshment facility (such as restaurant, pizzeria, bar, etc.), recreational facility (such as ski lift, sports ground, museum, etc.) and transportation facility (such as bus stop, railway station, cable car stop, etc.). In the resource analysis step a rigorous study on the structure and content of the corresponding file(s) is performed to understand the relevant resources for singling them out.
The attribute analysis proceeds through examining the attributes of the already selected resources. With attributes, it means column headers in the CSV files, properties of objects in the JSON files and sub-tags under repeated object tags in XML. Attribute values are also analyzed in terms of availability (e.g., always, frequently, sometimes and never present) and quality (e.g., complete data with no or occasional syntax error, partial data with or without error and relational data with or without disambiguated reference) of the data.
In the attribute extraction step, we differentiate between the kinds of attributes according to the data encoded in them. Some attributes are used for encoding data and some others for managing data (e.g., identifiers internal to the resource used as primary keys). In this step, we also merge and split attributes, if necessary, according to the data. For example, nameEn (name in English) and nameIt (name in Italian) can be merged into name, while opening hours can be split into opening time and closing time. The attribute mapping step incorporates disambiguation of the attributes proceeds through linking them to the right concepts in the knowledge base (KB). We identify two kinds of mapping, direct mapping and indirect mapping. Finding the name of the attribute attached to the intended concept in the KB and linking to it is called direct mapping. Not always a direct mapping is present for an attribute. Finding the right sense of a term through its synonyms is called indirect mapping. Note that synonyms can be suggested by user or retrieved from a KB.
The data modelling step leads to understanding an entity type from the attributes extracted in the previous step and finding it (if exists) or a suitable parent (if created newly) in the already existing entity type lattice, a lightweight ontology (see (Giunchiglia, 2009)) formed with the concepts of the entity types. In the entity type development step, we produce a specification of an entity type defining all possible attributes, their data types (e.g. string, float) and meta attributes such as permanence (e.g. temporary, permanent), presence (e.g. mandatory, optional) and category (e.g. temporal, physical).
While producing entities of a given entity type, mandatory and optional attributes are filled in with data, which are semantified and disambiguated wherever applicable. In fact data for an entity can come from multiple resources. Through semantification, we facilitate the integration of loosely coupled data. In the case of unavailability of a mandatory attribute in the possible resources, we signal it to the data provider as pro-sumers, see (Charalabidis, 2014) and do not allow the creation of the corresponding entity unless all necessary data are present. This is how we can improve the vertical completeness.

Open Data Rise
Open Data Rise 10 (ODR) is an open source web application for data curation of OGD. It allows to easily fix errors in source data and also to enrich names by linking them to their precise meaning in a process called semantification. The framework employs the entity-centric model (Giunchiglia, 2012)  Once dataset schema has been determined, during attribute value validation step the user can adapt the dataset to the schema, exploiting OpenRefine data cleansing capabilities.
Successive attribute value disambiguation step employs Natural Language Processing techniques for enriching dataset content by linking names to known entities (such as Dante Alighieri, Florence) and words to concepts (such as male, city). In Figure 5 we see a screenshot of long text that has been automatically enriched. OpenDataRise will show in red elements that still require manual intervention from the user.
Within entity alignment step the framework considers rows in the dataset as entities, i.e. real instances. The goal of this step is to schedule changes to entity storage to be committed in the next step. Such changes can be either update of existing matching entities or creation of new entities with values from the source dataset.  In the last entity import step, the user can indicate the license of entities to import and other metadata to publish to CKAN. Updates and insertions are then committed to entity storage and a new semantified resource is published on CKAN. The resource will contain the provided metadata and a reference to imported entities in the entity storage.

Open Big Data
Nowadays big volumes of data are processed at an increasing rate, creating additional hidden information. This information represents a central aspect in the definition of Big Data that can be defined as Value. In fact, according to some industry analysts, dealing with Big Data means facing the following aspects: Volume (huge amount of data generated or data intensity that must be ingested, analyzed, and managed to make decisions based on complete data analysis), Velocity (the speed at which data must be processed), Variety (the different types and sources of data that must be analyzed and the complexity of each and the whole), Variability (intended as the inherent "fuzziness" of data, in terms of its meaning or context) and indeed, last but not least, Value. Since the public sector is increasing the quantity of data available to the public through many open data initiatives, we expect that in the near future also collected data by these initiatives will thrive by the adoption of BigData technologies to gather useful information from published data.
As part of OGD initiative of the PAT, we then focus on the problem of data explosion and the consequent need of having fast and scalable solutions for storage and analysis. We estimate the trend for growth will be up to hundred times per year, easily reaching the order of TB of data in few years from now. For instance the Trentino portal already have sensors based datasets, such as weather, traffic sensors, real time energy consumption and few others that, they alone already provide few GB of data per day if collected. Due to this very nature, they pose challenge in using traditional relational database management systems to handle them and at the same time appear as a problem to be dealt with the Big Data technologies such as Apache Hadoop The interest here is focused on producing a Big Data platform directly integrated with the Open Data portal and able to provide useful analytics, historical analysis in reasonable time.

Applications
Entities generated from open data as well as open big data exploitation resulted to the development of applications which often appear as innovations to the citizens. This is because they offer services that are either novel or come up with better results in comparison with the contemporary ones. Moreover, open data propel innovations that help devising novel applications and services (Chan, 2013).
In line with this consideration, as shown in Figure 6 and Figure 7, we developed an application running on top of entities helps finding points of interest including restaurant, pizzeria and bar with opening hours and bus stop, cable car stop and railway station with timetable and ski lift, ski rental and ski school with timetable.  Figure 7 shows semantic navigation. We call it semantic navigation as it exploits semantic relations like part-of while exploration based search proceeds. To provide an example, when user is in a museum, it can show, within a radius of, e.g. 500 m or 1 km, all kinds of points of interest also located in the proximity of the museum itself. Finally, the bottom widget shows attributes and general information about the selected entity.

Conclusion
In this paper we depicted the approach we implemented in the publication of Open Data in the Province of Trento. The result that we have already obtained is promising and we show that the complement of the data culture along with a direct technical and legal support to PA employees allow a faster diffusion and sustainability in the process of opening data within the public administration. Furthermore, we proposed an approach for generating entity leveraging open government data grounded on entity based infrastructure that by design facilitates consistency in data representation and management. As such enabling the re-use of public sector information. We also address the forthcoming data explosion issue by investigating and integrating since the beginning BigData Technologies. The entity centric data representation and the infrastructure as a whole can be considered as an input to the W3C Data on the Web Best Practices Working Group to provide guidance to data publishers. Our future work involves the creation of a broader data ecosystem. Communities, universities and other actors will be able to link their data using Open Data Rise. Moreover, we will implement crowdsourcing techniques for improving the quality of the existing data.