Evaluation of Geospatial Features for Forecasting Parking Occupancy Using Social Media Data

Urbanization and growing individual mobility are globally active trends that intensify the needs for transportation in cities. In this context, parking space has become a scarce resource. Drivers searching for open parking spots cause about one third of the total traffic in urban areas. This creates significant fuel consumption, greenhouse gas emissions and time loss. Intelligent Transportation Systems with particular focus on parking are a promising approach to overcome the information asymmetry and lead drivers directly to available parking spots. This requires highly accurate occupancy data for parking areas on a geographically extended scale. The ultimate goal of this thesis is to improve the modeling of parking occupancy by extraction of meaningful features from raw data in social media. The research focus is set to points of interest and public events in urban areas. First, robust methodologies are developed for the acquisition and benchmarking of largescale social media data. This includes exploratory data analysis and testing of Facebook as a leading platform against alternative online data sources. Here, a multistage approach for the identification of duplicates in heterogeneous data sources is applied. Secondly, a diverse set of feature extraction methodologies is developed that integrates a variety of secondary data sources and findings in the literature. This comprises the adjustment of online popularity attributes for social media objects based on external data and the extraction of parking-related attributes based on text mining. Additionally, historical parking events from Floating Car Data are cross-referenced to thematic similarities among objects and adequate feature sets are derived. This includes the category-specific transformation of historical parking patterns into characteristic timeand object-dependent features. Also, text-based topic modeling using Latent Dirichlet Allocation is applied on social media data to extract thematic object similarities as probabilistic input features for parking demand modeling. In the final evaluation phase, ground truth occupancy data for a selection of offand on-street locations is used to compare machine learning models trained with varying input feature sets. A baseline and extended set are compared while the latter includes extracted social media features. These models account for the prediction of parking occupancy over different timeframes. Random forest learning machines that include social media features are found to outperform the tested baseline models for both offand on-street parking demand modeling. Particularly event topic probabilities and category-specific parking events on an hourly basis are identified to be valuable.

Million) [1]. Also in developed economies such as Germany, the total number of passenger cars still undergoes slight increases, even though the population is rather declining [2]. An extension of existing infrastructure is required but typically is found infeasible due to budget regulations or space limitations [3].
In this context, parking space can be considered a precious resource in the urban environment. Empirical studies have shown that roughly one third of the total city traffic is caused by drivers searching for available parking [4]. This consumes resources, causes noise and increases air pollution. In fact, the quality of life in urban hotspot areas decreases remarkably with intensified traffic and shortage of parking space. Another result is the increase of illegally parked vehicles that cause macroeconomic costs being estimated to $80 Million per year alone in the city of Barcelona, Spain [5]. The individual time for drivers to find urban parking generally varies among cities but typically ranges between 3.5 to 14 min as has been stated by Shoup 2006 [6]. Therefore, parking must be considered an important factor when planning mobility and deciding on a certain travel mode [7].

AIPARK parking information platform
Artificial Intelligence Based Parking (AIPARK) is an Intelligent Transportation System (ITS) that provides comprehensive information related to the parking situation in cities. The system evolved from research activities at TU Braunschweig starting in late 2015 and provides data for currently more than 60 Million parking spots in Germany. AIPARK's main purpose is to guide drivers to available parking space near their travel destination. An overview on the elements of AIPARK is provided in Figure 1. AIPARK is implemented as a scalable platform that includes modules for data acquisition, processing, modeling and user interaction. Moreover, a comprehensive database of static information is provided that contains the location of parking areas and relevant metadata such as opening hours, pricing or parking restrictions.
This database was initially based on open-source map data derived from numerous contributions of volunteers. These sources undergo crowd-based review processes and can be used within the AIPARK system without major adjustments. Thus, they are referred to as 'direct data sources' in Figure 1. At a later point in time, modules for the automated generation of parking maps are used to refine the existing information and extend the coverage of the parking database. A core technology is the analysis of remote sensing imagery, such as satellite or aerial images for the purpose of map generation. In the course of this step, geographic locations are identified where on-street parking is possible. Significant research and development efforts are involved in extracting valuable information from the raw imagery. Therefore, this type of data is referred to as 'complex'.
The second mapping process is focused on the generation of parking area metadata based on the analysis of Floating Car Data (FCD). The latter refers to positional information generated by GPS or mobile devices that are placed within vehicles. By accumulating data from a number of sample vehicles over time, conclusions on traffic flow and driver behavior can be drawn [8]. AIPARK uses the concept of parking events, an approach that focuses on identifying when drivers leave or successfully find a parking spot based on FCD analysis. Also, negative parking events are considered, denoting the unsuccessful search of available parking spots indicated by certain driving patterns. Minor corrections of the parking database are conducted based on local scouting, the manual on-site collection of data using specific mobile applications.
Another core module of AIPARK are machine learning models that provide occupancy information for urban parking areas in the static database. Diverse dynamic data sources are acquired, prepared and transformed into valuable input features for model generation. Besides parking events from FCD, occupancy information for off-street parking facilities and sensor-monitored on-street spots are considered. Also, diverse contextual factors are taken into consideration, such as socioeconomic indicators. The latter comprises area-specific statistical data for the factors car ownership, income level, business activity and age distribution. As a subproject, long-term optical observation of parking areas is conducted for strategically important parking spots in urban areas. This is based on the automated extraction of occupancy information from camera footage that monitors parking areas over several weeks. To maintain this complex data source, costs and timewise efforts are comparably high. This is why long-term optical observation is primarily conducted within the city of Braunschweig that serves as a testing area for the AIPARK platform.

Research design 1.3.1 Primary research questions addressed
Social media platforms are one of the most rapidly growing sources of multifaceted online data of our time. Users interact and share their personal data in different explicit and implicit ways on a regular basis. Social media is expected to reflect trends, opinions and behavior in society and on a personal level. Significant research has addressed the question of how the available information can be used to derive new insights in the field of mass mobility. However, the aspect of parking demand has not been sufficiently addressed despite its relevancy due to globally increasing car usage.
This thesis aims at improving the understanding of geographically referenced social media data and its relevancy as a data source for modeling urban parking demand. A special focus is set to two different types of objects: Points of interest (POIs) and public events. The research design approaches the domain from a theoretical, literature-based perspective, as well as from a practical, implementationrelated side. This guarantees that relevant findings are also evaluated with regards to their real-life implications and scalable feasibility.

General methodological overview
Addressing the research questions previously introduced, the methodological framework developed in this thesis is subdivided in four subsequent phases. They are presented in Figure 2. In the first phase, different online data sources for POIs and events are preliminarily evaluated and leading social media and alternative platforms are selected for further investigation. Subsequently, large-scale data is acquired from these sources using their publicly accessible application programming interfaces (APIs). This phase is referred to as Data Acquisition (Chapter 4).
As the AIPARK project is primarily active in Germany, scalable data collection procedures are developed to acquire the target information on a nationwide scale.
Dense data availability is a basic requirement of drivers using ITS. In fact, also potentially relevant social media data as input for parking demand modeling must be available with significant coverage. This phase is crucial to prove the general technical feasibility. The phase Feature engineering (Chapter 6) describes the information retrieval procedures used to transform raw social media data into a variety of potentially valuable sets of input features. The applied techniques are primarily based on data fusion, integrating findings from the literature and text mining.
In the last phase, Feature evaluation (Chapter 7), a testing procedure is developed to evaluate the previously extracted features. This covers their implications on the occupancy of both off-and on-street parking. This comprises the training of multiple machine learning models while comparing the prediction performance of a baseline featureset to an extended version that includes the extracted social media features. The benefit of specific extraction procedures is also evaluated.

Contextual background
This chapter provides an introduction to established technical approaches used for solving the parking problem in urban environments. Also, basic concepts in data mining are briefly described and the characteristics of social media as a source for geospatial information are investigated. Moreover, certain mobility indicators for Germany are introduced as it is the geographic focus area of this study. This is expected to be valuable for contextual understanding of the modal split indicators observed and to make the derived findings more comparable to other focus areas in future studies.

Popular solutions to the urban parking problem
There are several alternatives that focus on improving the availability of parking in cities. The simplest option is to have governmental actions focus on increasing the infrastructural capacity to an extent where no parking shortage occurs.
Typically, this is seen as highly unrealistic for most scenarios based on the necessity of significant public investments and land use. Also, the inherent improvement potential is very limited with regards to dense historic city centers or highly frequented areas that would be affected by extensive construction work. For this reason, the usage efficiency of existing infrastructure must be increased.

Parking guidance systems
Information systems that guide drivers to available parking spots improve the infrastructural utilization and are currently widely spread. These approaches fundamentally rely on the public availability of detailed parking information with regards to the destination area. By different means, this knowledge is shared among drivers using public display boards or smartphone applications. Recent approaches also consider the distribution of parking information using vehicular ad hoc networks [9]. In consequence, the time consuming search for open parking spots is minimized.
On one hand, based on discrete event simulations for a single parking lot, Surpris, Liu,and Vincenzi [10] only identified insignificant time gains due to the introduction of a parking information system. This is interpreted as a result of the limited scope of the study. On the other hand, Caicedo et al. 2006 [11] stated reductions of required searching times of up to one third when trying fo find parking in multilevel garages. Generally, driver acceptance for IT-supported applications that deliver parking information is very high [12]. These systems are found to trigger several positive effects for urban life such as reduced traffic congestion and decreased searching time [13]. Guidance systems are most likely to be used by drivers who are unfamiliar with the destination area [14].

Stationary sensors
As of today, existing systems cover mostly only parking garages or other paid areas. These are generally referred to as off-street parking. Here, occupancy data can be easily acquired as digital entrance barriers and sensors are widely distributed. Parking operators are primarily interested in collecting occupancy data for management insights and often also share this information with the public. The vast majority of parking spots in cities, however, is present in the form of on-street parking. In this context, occupancy information is not as easily accessible and must be generated using specific sensors that have to be primarily financed with local governmental budgets. Many different stationary systems for parking surveillance have been developed. Popular concepts cover radar sensors installed on street lights [15], camera-based surveillance using large-scale image processing [16] or magnetic field sensors integrated into the ground [17]. All of these approaches correspond with high expenses for installation and maintenance.
Moreover, each of the systems can only cover a very limited amount of parking spots. One of the largest pilot projects involving parking sensors was the SFpark project. 6,000 systems were installed at an estimated cost of approximately USD 1.5 million. However, the project only covered slightly more than two percent of the estimated city's 281,000 on-street parking spots [18]. In fact, full-scale coverage with stationary parking sensors is very unlikely due to limited public budgets.
Therefore, stationary sensor systems cannot be considered a general solution for the urban parking problem.

Crowdsensing
An alternative to stationary sensors is the implementation of crowdsensing systems that dynamically collect data and automatically extract certain geospatial features. Cruising vehicles have been used as mobile ultrasonic sensor nodes that generate dynamic maps of vehicles parked on-street. Mathur et al. 2010 [18] used these sensors to detect signal patterns that relate to parked cars on the street sides while the measurement vehicle was normally driving. Evaluating the collected data, an overall information accuracy of more than 90 percent was achieved. Moreover, the effect of attaching sensors to a population of taxicabs was simulated and significant savings were predicted when using crowdsensing instead of stationary parking sensors.
Several approaches exist that focus on smartphone data for obtaining occupancy-related information. Rinne and Törmä 2014 [19] combined geofencing and activity recognition to detect when drivers are located on a designated parking area and try to find an open spot. For instance, if designated parking lots are highly occupied in reality, users tend to leave without parking and continue searching for an alternative. If spots are sufficiently available, this is indicated by successful parking events. The system suffers from the fact that every status change of a parking lot requires at least one user that cannot find an open spot immediately. Also, the procedure is not applicable to small parking lots that drivers can easily overview without entering. In this case, no trace of unsuccessful searching for parking is found in the generated movement data.
Other researchers examined the potential of magnetic field sensors in smartphones to detect nearby vehicles [20] [21]. As cars are typically built from a significant share of ferromagnetic material, they cause magnetic perturbations that can be measured as deflections of the geomagnetic field. The detection principle has successfully been used within stationary sensors [17]. Mobile systems face limitations due to dynamically changing environmental variables and low sensitivity of the measurement devices. Even though the preliminary results are promising, no fully functional system for identification of open parking spots was yet published.
Besides, mobile payment records as another source of information generated by smartphones, also have been used to derive parking occupancy information [22].

Introduction to data mining
This section provides a summarized overview on the state of the art in data mining and related sub-disciplines. It introduces typical workflows in the field and creates a simplified schema with regards to the variety of different concepts. For detailed explanation of the machine learning algorithms used in the course of this thesis, it is referred to more in-depth literature.

Cross Industry Standard Process for Data Mining
The Cross Industry Standard Process for Data Mining (CRISP-DM) is the standard reference model of the data mining field. It was introduced in the mid 1990s by a consortium of industrial companies which, at the time, were leading in applying data analysis techniques. Nowadays, it serves as one of the basic concepts for the data mining field. It consists of six phases that are relevant for both commercial and scientific use cases, and is independent from specific platforms and tools used. An overview is provided in Figure 3. The framework's main purpose is the facilitation of communication among analysts, customers and other stakeholders. CRISP-DM helps structuring data analysis projects and provides general guidance. Each top-level phase consists of diverse lower-level tasks, checklists and recommendations [23]. The first phase business understanding is focused on defining the objectives of the analysis project and deriving specific data mining goals and success criteria. This includes assessing the given situation regarding resources, requirements, preliminary assumptions and potential risks involved. Also, the cost-benefit ratio of the data mining project must be considered. Subsequently, during data understanding, raw data is collected and its characteristics are analyzed. This leads to a verification that the available data quality is sufficient for further usage. Afterwards, in the course of data preparation, low-quality data is corrected or removed and meaningful data is selected to be used in downstream procedures. During the subsequent evaluation phase, the model outcome is compared to the originally defined success criteria and the entire data analysis process is reviewed.
In case there are potential improvements, the process is restarted in any of the preceding phases to correct errors or extend the scope of actions. If the evaluation indicates successful completion of the project, deployment of the obtained results into productive systems can follow. However, in the context of CRISP-DM, this phase mainly comprises planning and monitoring activities of the deployment [24].

General concepts in machine learning
In machine learning, it is generally distinguished between supervised and unsupervised analysis problems. In the supervised case, models are generated to predict a given target attribute based on a variety of input features. Classification is one major subgroup that focuses on predicting discrete target data based on a set of labeled training samples. Regression tasks denote settings where the target variable is continuous. For unsupervised learning, there is no corresponding target attribute for the given feature vectors. Here, the ultimate goal may be exploratory data analysis or grouping of somehow similar data. This is referred to as clustering.
Evaluation of the generated models is based on separating the available data in a set for training and testing. As the latter remains unseen, it represents an adequate basis for examining the achieved generalization of the model. Cross-validation is a common process of iteratively separating the available data into changing training and testing subsets to evaluate the model accuracy while avoiding overfitting [25].

General concepts in feature engineering
Formulation and selection of relevant input parameters, typically referred to as features, is the most labor-intensive element of building machine-learning models.
The success of data analysis depends significantly on the input feature vectors [26]. Thus, preprocessing of data is considered to be the most important step in deploying data mining applications [27]. As an example, the winning contribution of the popular 2010 KDD cup data mining competition credited data preparation as their key to success [28].
The term feature engineering comprises both the construction and selection of valuable attributes. Feature construction increases the dimensionality of the problem. Based on a set of raw information, different strategies can be applied to obtain higher-order attributes. This process is typically manual and demands certain knowledge of the problem sphere. One frequently applied technique is the decomposition of categorical features. For example, if the attribute values are sorted into three classes whereas one of these represents the value 'unknown', the latter can be instead included as a separate binary feature that gives an indication on the availability of sufficient data. This avoids that a lack of data is considered a separate class. Moreover, continuous variables can be separated into bins that comprise a certain value range to obtain a transformation into categorical attributes. This can improve the understanding of data. Also, changing of units may have positive effects [29]. From a theoretical perspective, the number of attributes constructed can be infinite. Automated feature construction supports the growth of data dimensionality [30]. However, in reality, computational resources limit the feasible model complexity. Also, to obtain adequate model accuracy, the required amount of samples grows exponentially. This phenomenon is often referred to as the curse of dimensionality [31]. Especially distance-based models, for example nearest neighbor classifiers, perform badly in hyperspace [32]. Using a smaller number of attributes to train the model facilitates data visualization, storage, handling and ultimately leads to better model performance [26].
One approach to reduce the present dimensionality is the combination of compared feature subsets and no model-specific effect on the prediction performance is to be noted. As a third category, embedded methods for feature selection are considered. These are model-specific and integrate the evaluation of attributes into the model training process [26].
Both filter and wrapper approaches are based on a variety of search strategies.
A popular selection criterion within the filter category are correlation indices between input features and objective variables. Attributes with stronger correlation are considered to be generally more relevant. However, selecting only the most important features typically is not optimal as it promotes redundancy [26]. An alternative concept focuses on single variable classifiers, which comprises training of multiple models only with a single input parameter to be evaluated. The accuracy of the obtained predictions is used as selection criterion for the input parameters.
One major disadvantage of this approach is that the observed model performance highly depends on the interaction of dataset and model. Thus, different modeling approaches used for the same input parameter can lead to divergent results [34]. A third alternative is the feature selection in accordance with information-theoretic criteria.
When applying an exhaustive search strategy for wrapper methods, all potential attribute subsets are evaluated separately. Especially for large datasets with extremely high dimensionality, this is not feasible due to limitations of computability within reasonable time frames. Thus, heuristic search strategies are applied.
For example, forward selection begins with a single attribute and adds relevant features step-by-step. Alternatively, in backward elimination, all attributes can be considered for the initial feature set, being followed by stepwise attribute removal. To determine the ranking of features to be included or removed, typically information-theoretic criteria are used [35]. Guyon and Elisseeff 2003 [26] summarize the strategic procedure of feature construction and selection with a ten-point checklist. It is focused on the practical implementation of the respective techniques in the field and visualized in Figure   5. Here, case-specific solutions must be found [36]. It was found that artificial neural networks (ANNs) and support vector machines (SVMs) perform well on features that are calculated as differences and ratios of basic attributes. For random forests and gradient boosting machines, rather aggregated and count-based features are found useful. This is seen as an important reason why superior performance is frequently observed for ensemble learners that rely on individual models from both classes.

General concepts in text mining
Text mining is a subfield of data mining that focuses on the extraction of information from textual data. It is part of the research field natural language processing and uses specific methodologies that apply to unstructured data. The latter makes data cleansing and feature preparation highly complex. Natural language processing must deal with ambiguous expressions and highly depends on background knowledge for the analyzed data [37]. The following paragraphs provide a short summary of common techniques in the text analysis domain organized in a chronological order within a typical workflow.
The first step in conducting a text mining project is usually the acquisition of a text corpus -a collection of documents from a specific source or thematic distribution. All further analysis is based on the information distributed in the corpus. Single documents within this collection are typically represented as sparse and high dimensional matrices. Each word is used as a feature that may occur a certain number of times within a given document. This leads to computationally highly expensive analysis operations.
A basic representation of text used for machine learning is called bag-of-words.
Here, single words are treated as a set of occurrences while their order and grammar are not taken into account. For documents where the order of words carries valuable information, n-grams can be used as features. Here, a set of n subsequent words is treated as a single unit to account for spatial relationships among words.
Text strings are subdivided into bags-of-words using tokenizers that are based on a syntactic ruleset. Simple tokenizers separate entities at whitespace characters while more complex algorithms may also account for known expressions, for example including punctuations. If it is beneficial for the problem domain, tokenization can also be applied to separate entire sentences within a corpus.
Instead of representing word occurrences in a document using binary attributes, the respective frequency of terms can be used to assign term weights.
As sparsity is a common characteristic of feature spaces from textual data, dimensionality reduction is frequently applied to reduce computational costs and improve the model quality. One common method is the removal of stop words, frequently used words that do not contain valuable information. Being generally applicable for entire languages, stop words may be articles or prepositions. More specifically for a certain problem domain, also customized terms might be irrelevant. It has to be noted that a phase-based search often depends on terms that may generally be considered to be stop words. The n-gram 'flights to Berlin', for example, crucially changes its meaning if the stop word 'to' is left out. In fact, there must be a problem-specific decision whether or not to remove stop words.
Other methods are focused on summarizing similar words into one feature. A common technique is called stemming and aims at reducing inflectional word forms by removing the word suffixes. It is based on a heuristic process that does not always obtain the actual word stems as results. Lemmatization, on the other hand, uses lexical information and a morphological analysis to return a human-readable base form of the word. The latter is referred to as lemma. The set of terms 'am, are, is', for example, can be projected on the verb infinitive 'be' [38]. The lower the obtained perplexity value, the better the generalization-capability of the tested model [39] [40].

Geospatial information in social media platforms
This section describes the different aspects of social media as a source of geographically referenced information. It discusses the general characteristics of crowdsourced data and its applications for mapping. Also, the representativeness of social media as a basis for mobility-related findings is evaluated and typical interaction patterns are discussed regarding popularity and location-based functionalities.

Characteristics of volunteered geographic information
Traditionally, geographic maps have been generated by professionals using high-end tools. This typically involves governmental action and significant in- Simultaneously, crowdsourced volunteered geographic information (VGI) receives growing attention. As mapping is conducted by a large number of volunteers that contribute to an open database, the collection of data is very scalable and works with consistent data formats. However, the involvement of large numbers of volunteers makes VGI prune to special quality characteristics that are discussed in the following paragraphs: Quality distribution Especially in highly populated areas, a better overall data accuracy is achieved [42]. It was found that the number of voluntary contributors increases disproportionately in urban areas [43].
Shifted quality assurance While traditional mapping procedures rely on quality control by experts who create the map, VGI data errors are generally only recognized by other contributors or the users of generated content [44].
Proximity focus The quality of mapped data depends on the local knowledge of the contributor, especially if the mapping process is highly manual. One example would be the biased creation of map data from aerial or satellite imagery if image errors are present or important sections are hidden by trees. Contributors tend to collect information for locations which are close to their usual habitation [43].
Limited training Geographic data being collected by volunteers is in general less reliable than mapping completed by contracted professionals [44]. Depending on the intensity of training and the mapping experience, the quality of results can be varied.
Representativeness Characteristics and perceptions of contributors do not necessarily represent the society as a whole. Certain information might be exaggerated or neglected depending on which aspects are perceived to be important by the contributors. Additionally, only a small share of the total number of VGI platform users contributes information on a regular basis [45]. Budhathoki 2010 [46] analyzed the activity distribution of an open mapping platform and found only 0.01% of the registered users to be very active while 70% did not contribute.
Potential malicious use Motives of users to contribute can vary and indi-vidual actions taken to worsen the quality of collected data can potentially take place. Due to the shift towards quality assurance by other contributors or final users, malicious data can be active in the database for a varied amount of time [42]. VGI platforms, for this reason, typically introduce contributor rankings that grant more autonomy for experienced users than for recently registered ones.
These quality characteristics need to be taken into account when using VGI for further analysis. Recent research addresses the development of semantic frameworks to evaluate the quality of user-generated map content [47] [48].
Social Media Geographic Information is a subgroup of VGI, which originates from social media platforms and is prune to special characteristics. In comparison to initiatives with the main purpose of collecting data, the information collected in social media mainly consists of byproducts related to communication-focused activity. Place entities are generated mainly based on location check-ins or in the course of users expressing their thoughts towards place-related topics on the platform. Therefore, the collected information is a direct representation of the users' interests [49]. This increases the severity of the previously introduced issue of information representativeness. With regards to the literature, there have been no in-depth social media studies of this aspect focusing on geographical data contained. Campagna 2016 [49] emphasizes that 'a novel analytics is to be formalized for the peculiar data models which make this type of information different from more traditional vector spatial datasets' (p.49). The authors recommend considering the spatial, temporal and a contextual dimension of the available data. Also, the value of multimedia contents is emphasized for analysis.

Mobility characteristics of social media users
In Germany, about 89% of all people have access and also regularly use the internet [50]. About 50% of all citizens use social media platforms [51] while Facebook alone counts 37.9 million users in 2017 [52]. This reflects about 41% of all internet users in the country [53]. Social media sites show a steady growth over the last years of about 14% annual increase in user counts [50]. Figure 6 shows the population in Germany as of On a global scale, 56% of all Facebook users are male while the strongest gender difference is found in the age range of 18 and 34 [59]. Females are found to use Facebook more intensively while spending more time on the platform [60]. The respective data sources do not account for genders other than male and female.
Furthermore, social media usage is observed to be independent from monetary income [61].

Interaction characteristics of social media users
To understand the offline mobility behavior of Facebook users, it is necessary to understand their motivation for online interaction with POIs and events.
Unfortunately, this particular connection has not been sufficiently researched. A comprehensive literature review in 2012 by Wilson, Gosling and Graham [62] found that no research has been conducted regarding the behavioral drivers for user likes in general and particularly related to POIs and events.
With particular focus on humanitarian causes for likes, Brandtzaeg and Haugstveit 2014 [63] found this feature to be used mainly in the context of socially responsible liking. This concept describes the general willingness to support humanitarian organizations. Likes as an intermediate emotional reaction to the observed content were the second most frequent cause while the future access to further information was the third most frequent motivation. However, using the like feature is mainly seen a method for self-representation.
With regards to likes for company-representations on Facebook, access to information was found to be the main motivation for user likes. Moreover, access to special offers and other promotions were observed to drive user interaction. Showing support for the business to other users was also identified to be an important factor [64]. Further surveys confirmed these findings for brand representations [65].
Regarding location check-ins, Patil et al. 2012 [66] found users on Foursquare to share information mainly for self-presentation and access to certain social circles.
Check-ins are regarded as a symbol of acknowledgment for a POI that helps support the user in being part of his or her social group. Thus, the motivation for online check-ins is explicable primarily with social and personal objectives while sharing the actual location is only a subordinate purpose [67]. Besides, it was highlighted that users may check-in to receive special offers limited to a certain geographic area [68].

Modal split in Germany
In Germany, about 3.4 billion kilometers were traveled per daily average for passenger transport in 2016 [69]. 80.1% of this were conducted by means of motorized private transportation. The rail sector was responsible for 7.8%, public road transport for 6.8% and air travel for 5.3% [70]. Considering data for the years 2005 to 2010, the car usage in Germany is about 4.6% higher than the European average of 75.6% (EU-27) [71]. 44% of all car owners use their vehicles on a daily basis while 32% use it at least several times per week. About 82% of all households have access to one or more cars while an average household has access to 1.4 cars [72].
The latter increases with the accessible income and only one percent of households with a monthly income of more than EUR 5,000 own no cars [57]. In average over all regions in Germany for 2016, 668 cars existed per 1,000 citizens [69].  Leisure activities are responsible for the biggest share of traffic with regards to four out of five distinguished travel modes. Only for work-related trips, traveling by driving a vehicle is more popular. This travel mode makes up 58% of all traffic while traveling as a passenger is related to a share of 24% on the total distance traveled. Shopping trips are almost evenly popular with regards to all travel modes but public transport. The latter is especially popular for education-related travel.
The category 'private errands' sums up all activities that do not fit in any other group.
Among different age groups, there are also meaningful differences with regards to the means of travel chosen by individuals. Figure 8  Elderly people tend to reduce active driving and do most trips walking. This shift is supported by a decrease of the average trip length by about 50% in between the age groups 40 to 59 years and over 75 years [57].

Literature review
In this chapter, the state of the art in parking demand modeling is evaluated with focus on potentially relevant influence factors. Particularly, the impact of POIs and public events is analyzed in surrounding areas of parking spots. Also, a general summary of modeling activities with regards to travel mode choices and specifically, car usage.

Modal split modeling
In   found that individuals who showed certain preferences in the past are likely to continue making similar choices in the future [80].
Adverse weather conditions also have an empirically-proven effect on traffic patterns. Drivers tend to cancel or postpone trips, change routes, run errands preferably nearby and choose public transport instead of car travel if road congestion is present [82]. Rainfall is found to decrease car traveling speeds and the overall traffic volume. The corresponding effects for snow are similar but intensified [83] [84]. Winds and low temperatures are found to increase car and public transport travel [85]. Besides, Cools, Moons and Wets 2010 [86] highlight that the influence of certain factor combinations varies among different locations. This is explained by varying travel motives, for example those related to roads used mainly for leisure or alternatively commuting. While leisure is a rather flexible travel context where weather is an important influence, work-related travel is rather inflexible [85]. With specific regards to parking, low temperatures and rainfall were No research focusing on smaller events, especially populated with social media data, has been conducted. Moreover, no findings are available that try to quantify the relevancy of interesting place characteristics for travel mode choices. As described above, the most present geographic level of detail being applied in current studies only aims at a regional perception.

Modeling of driver behavior
Collura, Fisher and Holton 1998 [90] investigated the behavior of drivers when searching for parking using simulated driving scenarios. Common search strategies of participants were naturally focused on decreasing the total travel time. Other studies highlighted that drivers primarily tend to circle in increasing distance to the desired destination [91], choose off-street alternatives or illegal parking spots [7]. Generally, on-street spots are preferred by drivers due to a more convenient access [14].
Certain individual factors have been identified that influence drivers' behavior and the observed parking occupancy patterns. These include the individual price sensitivity with regards to the available parking choices [92], drivers' knowledge of the area [91], required walking time and distance to the actual destination [93] [94]. Personal preferences of certain parking spots can also play a role in decision making [95]. Price sensitivity was found to vary among different city areas [96] being closely interconnected with the driver's travel purpose [97] [91].
Moreover, the importance of contextual factors is highlighted. This includes the population density in the destination area [96], the total number of parking spots available nearby, their parking turnover rate and the type of destination [91].
So-called non-habitual influences, for example, special events or traffic incidents, also need to be taken into account [98]. One part of the research community sees a strong interconnection between traffic volumes and parking occupancy [99]. On the other hand, several studies indicate that there are only minor correlations between parking and the observed traffic flow intensities in one area [22] [100]. Moreover, illegally parked vehicles can have an important impact [7].  [102] or the traffic event classifier of Candelieri and Archetti 2015 [103]. Additionally, several high-level event identifiers were developed that provide a generalized applicability [104] [105] [106]. All of these include functionalities to make large-scale data sources accessible, to prepare sets of important attributes and to conduct data analysis tasks such as clustering or classification.

Event influence on the parking situation 3.3.1 Estimation of event popularity based on social media data
In the literature, four different types of gatherings triggered by social media are distinguished: Typically recurrent, scheduled meetings using specific event functionalities, planned semi-scheduled gatherings, ad hoc meetings and bigger, rarer special events [107]. An online-survey among 55,000 participants revealed that 58% of users on Facebook agree that online interactions also drive event attendance in real-life [108]. At the same time, web-based communication was also found to trigger actions in the online domain rather than having an effect on the offline behavior of individuals [109].
Du et al. 2014 [110] distinguish between three groups of influence factors that determine the event attendance in the context of social media: Content preferences of the users, the spatiotemporal context of the event and certain social influences.
The authors developed a methodology to give users recommendions for future events based on the respective similarity to past events that have been attended.
Using text mining, the similarity of events on a content-basis is derived from names and descriptions. Besides, the temporal similarity is calculated based on weekdays and time of the day. Lastly, spatial similarity is calculated as the distance between historical and recommended event locations. Later research used similar data representations and features to determine event similarity [111]. Figure   Bogaert, Ballings and Van den Poel 2016 [112] introduce a topology for event popularity modeling that is defined in accordance with the respective input data used. It distinguishes between published models that consider complex network data and approaches that focus on user data. Network data refers to interactions among users that serve as an indicator for social relationships and group behavior.
User data comprises individual user information that reflects a specific platform usage behavior. The study highlights the benefits of incorporating network data for predicting the attendance of events based on a research population of about 950 users and about 2,500 events. An overall attendance rate of 78% was reported.
The authors explain the relevancy of network data with a phenomenon called endogenous group formation, often also referred to as homophily. It denotes the preference of users to follow the decisions of their peer connections on social media.
Research focusing on alternative social media platforms confirms the importance of network data for event popularity prediction [113]. Kawano Paris, Lee and Seery 2010 [116] studied the technological acceptance of social media, in particular related to event information provided. The developed model considers the ultimate intention to attend an event as a consequence of several preliminary factors that must be fulfilled. Among others, trust with regards to the information provider, enjoyment and perceived usefulness were critical influences that affected respondents' choices. Slightly adapted models were developed by later studies [117] and the importance of trust and acceptance for user decision making was confirmed [112].
Michalco and Navrat 2012 [118] also identified critical factors that influence individuals' decision to attend events which are advertised in social media. Time of the event, the event organizer and other guests attending as a reflection of social bonds are considered for an estimation tool that predicts the likelihood of individual attendance. Later studies confirmed the importance of these factors in the decision-making process [119]. The authors also studied the relationship between claimed online event attendance and the actual real-life actions of individuals. The designed estimation tool performed with an accuracy of about 70% correct classifications. This supports the general applicability of likelihood-based methodologies in the area. The study does not provide estimations regarding a potentially generalized attendance rate for events in the social media domain.
As part of a report created for Facebook, Deloitte 2015 [120] spreads the assumption that about 50% of the positive or uncertain responses to events lead to offline attendance. Actual statistics related to more than 10,000 professional meetings held in Japan are provided by the event platform Doorkeeper [121]. It was found that mid-week events are more likely to be skipped as attendees are potentially busier with other activities than on the weekends. In the presented data, events on Saturdays have the highest check-in rates of approximately 85% whereas Wednesdays correspond to rates of only 78%. Also, paid events are more likely to have high attendance rates as individuals value the appointment also on a monetary basis. Moreover, smaller events were found to have a higher attendance rate compared to large events. The authors interpret this to be a reflection of the participants' obligation to share actual attendance information with the organizer.
This effect is called social loafing and describes the belief that individual actions such as responding to event invites is not necessary for large invitee counts. In

Event influence on parking demand
Human mobility patterns are found to be highly recurrent and predictable when being analyzed on an aggregated scale [122]. The traffic created by special events, however, is more difficult to forecast. Most publications in the field focus on the estimation of event popularity and do not connect their findings to the implied traffic or parking demand. Typically, only extraordinarily big events receive special attention that may potentially lead to temporary traffic management by the responsible authorities. Here, past experiences are sufficiently available to serve as a basis for traffic planning. The popularity and mobility demand caused by smaller events, however, depends much more on contextual factors. These include aspects such as the social group addressed and the modal split present. Especially interactions of several simultaneous events in one destination area lead to complex traffic patterns that are hard to predict [123].

Points of interest influence on the parking situation
Businesses, attractions and other places are often referred to as points of interest (POIs). They offer different interaction opportunities and attract people in various ways. This leads to urban traffic and ultimately also a certain demand for parking in the respective areas. In fact, the interaction schemes between people and points of interest, as well as their popularity, is hypothesized to be relevant for predicting the location-and time-specific occupancy of nearby parking areas.  [99]. Most parking forecasting models presented in the literature account for this factor by implicitly considering a location-specific spatial context. These approaches distinguish between geographic cells and do not explicitly take into account places or land use [100]. POIs, in fact, are introduced only as an implicit side factor among multiple area characteristics.
Landry and Morin 2013 [130] found that the parking-related impact of interesting places follows a Gaussian distribution. Multiple aerial images of parking areas at certain times served as a source for the observation that drivers prefer to park as close as possible to the respective POI. The study focuses on the occupancy within large parking lots and does not consider urban environments on a larger scale.
To conclude, the relationship of interesting places and changes of the parking situation has not yet been analyzed in an isolated manner. No sufficient research exceeding POIs as part of larger geographic areas has been conducted with regards to the effects on urban parking demand.

Modeling parking occupancy
Parking availability modeling on a generalized level can be subdivided into statistical approaches and methodologies based on artificial intelligence [131] [132].
Here, another category is introduced that focuses on simulative approaches. For all groups, the most common application is short-term occupancy forecasting of off-street facilities such as parking garages or paid areas. As sufficient data is typically available for this group, time-series forecasting is widely applied with high accuracies. Modeling on-street parking occupancy still remains a widely unsolved problem. The published approaches are not able to derive generalized models with sufficiently high prediction quality. No ubiquitous application exists, that adequately explains the observed variance in real-life occupancy data.

CHAPTER 4 Data Acquisition
This chapter discusses the methodologies developed for collecting large-scale POI and event data from Facebook and a selection of reference data sources in the respective fields. A preliminary evaluation of these online platforms limits the scope of the investigation. Moreover, exploratory data analysis is conducted for the obtained datasets, revealing certain quality-related and thematic characteristics.

Preliminary evaluation of leading online data sources
Detailed evaluation of databases for subsequent processing and analysis requires full access to the provided information. This typically corresponds to significant efforts for the development of web-interfaces and crawler modules. Therefore, a preliminary evaluation of potentially relevant data sources is conducted based on publicly accessible platform descriptions and reported indicators. Table 1  Foursquare, as an alternative social network offering location-based services, only had about 55 million active users in 2015 [151]. In the Alexa internet ranking, a popular information service focusing on website popularity, Foursquare only ranked near position 1,900 while Facebook was among the top three most visited and interacted sites as of April 2017 [146]. Therefore, Facebook is chosen as the primary data source for geographic information extracted from social media.
Other than social media, extensive open platforms exist that focus on creating freely accessible maps with global coverage. Among other information, these projects typically contain a broad collection of static information related to POIs.
One major source is OpenStreetMap (OSM) with 3.6 million users [152], worldwide coverage and about two billion contributions already in 2010 [153]. OSM contains millions of tag-based entries for points of interest all over the world in various categories. Users actively contribute to the collection of geographic information as this is the platform's main purpose. The actual OSM user count, in fact, is not as important due to higher contribution activity of individual users. Moreover, the comparably low Alexa rank (Table 1) does not genuinely reflect the platform popularity. As the accumulated map data is republished daily to be self-hosted by third-party application developers, not all web traffic accessing OSM data is recorded in this ranking.
An alternative mapping platform is Wikimapia with about 1.9 million users in 2013 [42] and a total of twelve million contributions in 2010 [153]. Contrary to OSM, geographic entities are partly cross-referenced with further information found online. According to the Alexa website ranking, the user activity on Wikimapia decreased in recent years significantly. As of April 2017, the website ranks near position 4,500, having lost about 1,500 positions since the previous year [146]. As the collected data is not as openly shared for third-party hosting as for OSM, the Alexa ranking accurately represents the actual platform popularity.
Therefore, the overall extent and quality of OSM data is expected to be higher and it is chosen as the primary data source for POIs.
With regards to public events, diverse platforms exist that commercialize the aspects of marketing and ticket sales. In this field, Eventbrite is one of the leading suppliers. The platform had about 20 million active users in 2012 [145], representing the most recently published user-related data. According to the Alexa traffic statistics, Eventbrite ranks among the top 1,000 websites (Table 1) while reporting gross ticket sales of USD 2.0 billion in 2013 [154]. In contrast to event entities in social media, Eventbrite focuses on events that require users to give attendance feedback (RSVP), particularly paid events. Facebook also uses an RSVP system for managing the event database but feedback is less binding and no sales are conducted directly over the platform [119].
One relevant competitor with wide-spread coverage is Ticketmaster, offering tickets for about 230,000 events in 83 countries as of 2017 [155]. Ticketmaster generated USD 7.2 billion in revenues for 2015 [156] and maintains a stronger focus on large-scale events than the previously introduced platforms. As Eventbrite and Ticketmaster cover different segments of the event platform market, both must be taken into account for further analysis.

Facebook POI data 4.2.1 Data acquisition
Facebook offers place-related data via location-specific calls of its Graph API (v.2.8). Using a valid access token, places in the surrounding of a given geographic location are provided as a web service response. The information is received in the Java Script Object Notation (JSON) format. The API accepts longitude and latitude of the requested location, a 'distance' parameter specifying the size of the covered area and a 'limit' attribute that defines the maximum amount of objects to be responded. Larger requests take longer to process, use a higher data volume and create significant utilization of the data source. As the API is mainly used within third-party mobile applications, the data volume used and the request response time are critical factors that affect the perceived service quality.
Thus, the limit parameter is used to reduce the amount of information requested.
If large web service responses are not required for good user experience within the target application, mobile data consumption and API can be reduced in this manner. In the context of this study, the API is used for collecting nationwide geographic information. A parser module is necessary to retrieve the available place information in a systematic manner. Figure 11(a) provides a schematic overview how request locations are distributed to cover a coherent rectangular area. The position of retrieved place objects follows a radial pattern defined by the distance parameter chosen.
As a preparatory step, the API characteristics are examined by testing its capabilities for the city area of Berlin. Figure 11(b) shows the response characteristics for requests on the city center coordinates. The previously described request parameters 'limit' and 'distance', as well as the number of retrieved place objects are taken into account as plot axes. It can be seen that a implicit response limitation is reached between 600 and 700 place objects independently from the passed request parameters. The limit is reached already at a chosen distance of 0.5 km because the density of place objects for the test area is extremely high.
In fact, greater distances passed do not lead to more POIs received. It has to be noted that about one percent of the objects received are located further from the request location than defined by the distance parameter. For further analysis, this phenomenon is neglected and a radial coverage is assumed.  In order to avoid loss of information, the request area size cannot remain static.
The API call locations passed must be dynamically adjusted in accordance with the degree of urbanization for the target area. Areas with higher POI density must be covered with smaller areas and a larger number of requests. As an indicator for the density of POIs, zip code area sizes are considered. It is observed that the smaller the zip code area, the higher the degree of urbanization. Figure 12   Germany to define the optimal relationship between zip code area size and API request distance chosen. For example, the contextual structure and degree of urbanization for the city of Braunschweig indicates an optimal distance between request points of about 2.5 km. Further tests in suburban settings allow request distances of 4.5 to 6.0 km before the API response limit is met. All zip code areas in Germany are clustered by size and are matched with request distances that are found to be optimal for the tested areas. This results in groups of equal zip code area size being covered by homogeneously spaced request points. In total, ten different clusters are distinguished. Figure 12(

Data characteristics
Before the collected POI objects can be used for further analysis, an exploration of the obtained data and certain preprocessing steps are necessary. An overview on the distribution of objects in accordance with their respective popularity is presented in Figure 13  In total, the dataset comprises POIs connected to 1,124 different categories.
By manual identification, eleven of these are found to be structurally irrelevant for forecasting parking occupancy. These refer to rather large areas other than spe-    For the first collection method, the total population of place identification numbers is divided into batches of 50 pieces that represent the maximum accepted by the API within one request. These keys are passed with the API call to obtain events that correspond to the respective places. One separate parameter determines the time frame covered by the request. Facebook also allows retrieval of information related to past events. For a certain focus area, if more data is available that can be passed in one response due to internal performance reasons of the API, the retrieved file contains links to further response pages. These are exhaustively called with the crawler script to obtain all information that fits the specified parameters. As identification numbers for places frequently change, the script also provides functionalities to obtain updated object keys passes with error responses.
This avoids mistakes in the crawling process. Data collection for all available places requires about 65,000 API calls while 57% are related to further response pages. In total, about 1.7 million event objects are collected by conducting this procedure.
The keyword-based approach uses a different endpoint of the API that is not location-specific. Thus, primarily keywords are passed that refer to certain locations. First, a list of 109 cities in Germany is extracted form a publication of the Organisation for Economic Co-operation and Development (OECD) [158].
These are used as keywords for the event search and lead to 14,700 objects retrieved in total. Mainly due to duplicate city names on a global level, only about 90% of these are located within Germany. Due to the comparably small number of objects retrieved, this approach is neglected. As a second procedure, specific POI names are used as keywords to collect events objects that have no formal connection in the graph structure but potentially show a real-world relationship based on syntactic similarities found. Testing a sample of 2,000 random place names, in total 5,600 event objects are returned while only 1,100 of these have sufficient location information and are set within Germany. It has to be noted that the keyword-based search requires user-specific authentication and the response volume is strictly limited. Thus, for large-scale data acquisition an a nationwide scope, this approach cannot be realized.

Data characteristics
Users on Facebook have different options to indicate whether they will attend an event. They can specifically accept the invitation of another user by claiming attendance. They can also note that they are just interested in the event and unsure of the actual attendance. Finally, they can also formally decline or avoid replying to the event invitation. User counts for all of these options are provided as attributes for each event in the dataset. However, due to policy changes as of April 2017, the count of users who declined a specific event is no longer available using the Graph API. By default, response objects with this attribute equal to zero are returned. Thus, this factor is neglected for further analysis.

Eventbrite
Eventbrite offers information related to both paid and free events.   Figure   19 shows information for the ten categories with the highest median capacity as  Figure 19 as the median capacity only amounts to 28.5 users per event.

Ticketmaster
The  Figure 20 shows the ten most represented subsegments with their corre-

Summary of data collected
First, leading online data sources that generally provide information with regards to POIs and public events are preliminarily evaluated. Subsequently, algorithms are developed for scraping large amounts of data from the online platforms Facebook, OSM, Eventbrite and Ticketmaster. The feasibility of scalable data collection as a basic prerequisite of feature integration from these sources is proven.
Subsequently, the collected information undergoes an exploratory analysis to improve the understanding of data quality and its thematic representativeness.

CHAPTER 5 Data source benchmark
The presented data sources use various semantics and provide different sets of attributes. Facebook is the only considered source that provides public information that explicitly relates to the popularity of POIs and events. Other than that, the collected objects only contain indirect popularity information and typically only indicate the sole existence and themes of the event. This information is considered to be sufficient to derive certain feature subsets that have value for parking prediction models. For example, all data sources contain categorical information that allows drawing conclusions on thematically similar parts of the datasets. However, due to the heterogeneity of the data, merging of the acquired sources to a new, unified database is possible only to a limited extent. Missing popularity information for the benchmark sources cannot be reliably estimated from the available attributes.
A unified database would be limited to basic information supplied by all integrated data sources. For this reason, benchmarking is primarily conducted to specify the extent and value-added by social media data compared to the alternative sources.

Duplicate identification techniques
It is assumed that especially highly popular events are reflected in different data sources. Higher popularity increases the chance of observing multiple heterogeneous representations. If unified databases are constructed and duplicate entries from the sources are not removed, the parking model input is ultimately biased and lacks real-life representation. Thus, duplicate objects need to be identified as an important part of the data preprocessing phase. For benchmarking purposes, duplicate identification is also relevant to clarify the number of exclusively supplied objects by a certain data source.

Zhang 2015 [161] developed a procedure for identification of both syntactic
and semantic similarities among events from different data sources. In this study, a methodology is applied that focuses only on syntactic similarity but also considers categorical object matches with specific focus on duplicate identification. As the crucial part of object names rarely has a specific meaning, including a procedure focused on semantic similarity is assumed not to be beneficial. The following list provides an overview on all techniques developed for duplicate identification purposes: 1. Context matching: Geographic proximity, time similarity 2. Name matching: Similarity of name strings (syntactic) 3. Categorical matching: Similarity of object themes

Context matching
The first phase, context matching, is used to limit the scope of duplicate identification to a geographic focus area. This limits the amount of computational resources needed by reducing the number of objects to be processed in further steps. The size of the considered geographic area is chosen as a compromise of computational expensiveness and accuracy of location information in the available data. A square area with an edge length of one kilometer is chosen to account for inaccuracies of geographic references among databases.

Name matching
During the second step, name matching, the similarity of name strings from different objects is analyzed. Recchia The token sets A and B refer to the respective object names. The set C is generated by calculating the longest common substring ratio (sim) for all possible combinations of A and B. If the obtained value for a pair of tokens lies above a certain threshold, the combination is counted as a match. This allows a certain variance in spelling of the object names depending on the threshold chosen.
The ratio of matches identified divided by the average length of name token sets represents the final similarity metric M . Compared to the original Jaccard index divisor A ∪ B, using the average token set length increases the metric's sensitivity in cases where object names show a significant difference in length.
For further accuracy improvements regarding identification of duplicate object names, filtering of token sets is conducted before the accuracy metric is calculated.
This includes removal of duplicate tokens and stop words within each of the compared name sets. The stop word sets are specifically defined in accordance with the respective datasets.

Categorical matching
Matching of categorical object information is a primarily manual process.
Each category of one data source has to be compared to all categories of the remaining data sources to identify one or more thematic matches.     is manually generated. A decision tree classifier is trained on a balanced subset with 560 entries. It achieves an accuracy score of 95% using three-fold cross-validation.

Event data sources
SVMs and perceptron models achieve accuracies in the same range. Thus, the modeling approach with the easiest architecture, the decision tree, is selected.
The original Jaccard index is observed to be the most important feature. Figure   24(a) shows a balanced distribution of false positive and false negative predicted labels. Even though the scatterplot in Figure 24(b) shows a similar pattern, the class overlap with regards to the Jaccard index is much smaller. This is the main

Feature extraction roadmap
To understand the role of social media data for modeling parking demand in cities, it is necessary to analyze the underlying relationship between the available and predictable information. For better modeling results, extracted input features must be as closely connected to the target attribute as possible. Figure   26 visualizes the general relationship between data in social media and the target measure parking occupancy. For each modeling stage, corresponding attribute sets are extracted to be used as input features.
Deriving parking occupancy information from social media data is a multistage modeling process. Social media users represent only a specific sample of the entire population with car access. As described in the literature, individuals go through a complex decision making procedure before using social media platforms.
Also, strong differences among users are found regarding the general usage frequency and utilization of specific functionalities. In fact, the online popularity of real-life objects is seen as the result of a subsequent decision making process that Generally, all levels in the modeling chain represent subgroups of the preceding stage. However, the actual measure for each level is also influenced by an unknown external amount of attendees or drivers that is not reflected by the POI or event data. In fact, all extracted feature sets are naturally just indicators and cannot reflect the actual driver behavior in an isolated manner.

Adjusted popularity measures
As The concept of adjusted popularity aims at correcting the bias that is in-troduced by the crowdsourced character of social media. As can be seen in the literature, thematising the representativeness of social media is determined by the over-and underrepresentation of certain social groups. In fact, the user-generated contents offered on these platforms represent only the interests of certain parts of society.

Adjustment using a reference data source
As OSM is a general mapping resource, it is assumed that the thematic distribution of its place objects corresponds to the actual real life occurrences. Based on a high number of contributors especially in Germany, the information contained is expected to be unbiased and highly detailed. Facebook, on the other hand, maintains a widely uncontrolled POI database without public review or correction processes. Thus, the over-and underrepresentation of the Facebook dataset is examined based on the thematic differences between both datasets.
First, city center coordinates for the 70 largest cities in Germany are used as a basis for creating quadratic polygons that define the focus area. As parking is an issue mainly in urban contexts, the representativeness of this section is assumed to be higher than the results when all of Germany is taken into account.
Moreover, while OSM data is available on a nationwide basis, the data acquisition applied for the Facebook dataset limits its availability to urban areas. A nationwide comparison would lead to biased results that are avoided by focusing on city areas only. For the quadratic polygon size, an edge length of 25 km is chosen to achieve a sufficiently large subset of OSM and Facebook POIs for these areas. Taking into account the categorical matches defined in the course of the data source benchmarking (Appendix F3), the sum of thematically similar objects over all city polygons is calculated for both data sources. Subsequently, the relative difference between both object counts is calculated and used as a linear measure for adjusting the raw online popularity values. As exemplarily displayed in Figure 27(a), if the sum of Facebook objects is lower then the corresponding value on the OSM side, it is assumed that this POI category is underrepresented in Facebook. In this case, the given popularity attributes must be increased to correct the observed bias.  Before the adjustment of online popularity can take place, categories with less than ten objects for any of the two data sources are removed from the analysis set. Small absolute differences among object counts for these samples would result in large popularity adjustment factors that do not reflect the general distribution.
Taking into account this filter logic, only 69  For events, no generalized data source exists that reflects objects for various genres and target groups in an equal manner. The benchmark data sources taken into account, Ticketmaster and Eventbrite, are both considered biased based on their commercial focus. These platforms generally have a low interest in promoting free events as there is no direct revenue potential regarding their business models. In fact, popularity adjustment based on a reference data source equal to the procedure applied on the POI side is not possible.

Adjustment using domain knowledge
As there are no publications describing patterns or generalized influence factors for physical visits of POIs, findings from literature cannot be used to define adjusted popularity features. However, regarding public events, domain knowledge was published by the platform Doorkeeper [121] that is used as a basis for adjustment. As Doorkeeper is a Japanese event platform with a comparably small amount of offered events and a tendency towards professional themes, the data source describes a rather specific subgroup of all possible events. However, the information published provides interesting insights covering a general rather than an individual level. Domain knowledge with regards to the influences of weekday and event size is provided and displayed in Figure 29. Also, experiences related to event pricing are provided but cannot be accounted for in the context of popularity adjustment as Facebook does not provide corresponding data for cross-reference.
For each event in the Facebook dataset, the respective weekday is extracted as a separate attribute, being used as basis for matching the weekday-specific

Target group attributes
Facebook does collect highly detailed target group demographics and behavioral information. However, for privacy reasons, this data is not publicly accessible.
Instead, the public event information must be analyzed to determine attributes that describe the social group targeted by the event. First, a labeled dataset with 500 random event objects is manually generated to be analyzed using supervised learning methods. As text input, event names and descriptions are taken into account while the names and categories for the event location are also included. The categorical information for the event itself is not considered as it is only available for less than 5% of the dataset.
A binary attribute is introduced to identify events that attract mostly elderly people. As this demographic is generally found to prefer car usage over other modes of travel, events focusing on this group are expected to create a disproportionately high parking demand compared to other events with similar popularity. Secondly, as higher income also correlates with car usage, a label is introduced to identify events that attract mostly wealthy people. Lastly, a label is added to identify events that attract environmentally-aware people as these tend to avoid car usage and prefer public transport. Even though male individuals are more likely to use cars than females, it is hard to identify event contents that are gender-typical.
Thus, this factor cannot be reflected in the analysis. Furthermore, it is expected that reflecting event pricing with separate labels is not beneficial. The online popularity of event objects is independent from more formal registrations in direct contact with the organizer. Thus, findings that indicate higher attendance rates for paid registration-only-events cannot be directly applied to the available data.
It is assumed that attendance rates and travel mode choice are not influenced by variance in event pricing on a general level.
By calculating the share of objects that the above mentioned attributes apply to, it is found that each of them is relevant for less than 4.5% of the labeled dataset.
For supervised classification tasks, this results in highly unbalanced datasets and biased accuracy measures. When balancing the labeled dataset by randomly removing samples from the majority classes, only a comparably small subset of the generated data can be used for machine learning. Using word-based features for the classification, there is only a very limited number of multiple occurrences for identical words among documents. As event descriptions use highly diverse language containing slang and special characters, this effect is intensified. Additionally, about 1.4% of all labeled events have descriptions in other languages than German even though the events are held in Germany. This also increases the total number of distinguished words in the dataset while multiple word occurrences are not affected. All in all, this leads to an extreme sparsity of the generated feature matrices for a small, balanced training set. If a learning machine is trained on this set, overfitting on the available data is observed. Thus, these target-group-related attributes cannot be used for classification. In fact, a high number of objects have to be labeled to improve the model generalization.

Event content attributes
Labels with higher penetration must be chosen to increase the theoretical generalization potential. Weather conditions are assumed to have a direct influence on the offline attendance of events that are held outdoors. Also, under certain conditions, the share of car usage in terms of modal split is influenced by the local weather conditions. Thus, a corresponding label is introduced while about 15% of the labeled dataset account for outdoor events. Moreover, a label is introduced that focuses on alcohol consumption during the event. As this is expected to decrease car usage among the attendees and promote car pooling, it is potentially relevant as feature for parking demand as target variable. By counting the number of positively labeled objects, it is found that 32% of the labeled events show alcohol involvement.
For about ten percent of the data, a human labeler cannot determine the outdoor and alcohol attribute solely based on the available text data. This can be based on a lack of text in general or a lack of significance towards the underlying event themes. As the quality characteristics of the available text depend on the organizer input, a varying informative value is observed. Thus, for each binary attribute, a separate target class 'unknown' is introduced, that stands for cases of unclear classification. If the labeled dataset is balanced as previously described, for both considered attributes, a random subset of about 150 samples is obtained that contains equal numbers for each of the three classes.
Tf-idf is used to generate feature matrices from the tokenized text collections.
For token filtering, a list of stop words is used that contains characteristic content for German and English language, city names, annual figures and other text re-garded as non-valuable from a domain and corpus-specific perspective. Stemming is applied to prepare the remaining tokens, leading to 11,500 distinct terms considered. As a separate feature, the object-specific number of words in the available text data is added as a new feature. As a lack of information is the main reason for labeling as 'unknown', the amount of text available for classification being introduced as a separate feature is expected to improve the identification of positive samples. To fit the value range of the tf-idf features in order to avoid model confusion, the text length feature is standardized.
A set of nine different learning algorithms is tested in default configuration on the available data. As both, the outdoor and alcohol label show three distinct classes, separate models for each class are generated. All samples of this class are regarded as positive while all other labels are considered to be negative. The predicted label is based on the highest obtained confidence for a generated model [165]. This approach is known as one-vs-rest strategy [166]. The selection of tested models comprises two naive bayes classifiers with Gaussian and multinomial kernel functions, a SVM classifier with linear kernel, an ANN and a decision tree. In terms of ensemble learning algorithms, a random forest, a stochastic gradient descent classifier and a voting classifier using the linear SVM and random forest model as basis for majority voting decisions. Furthermore, a dummy classifier based on uniform guessing is implemented as reference for the model performances. Figure   30 shows the obtained classification accuracies for the outdoor attribute focusing on four models in relation to the number of tf-idf features considered.
The Naive Bayes model with multinomial kernel shows performances of up to 60% accuracy for relatively small numbers of considered features. The Gaussian kernel leads to similar performances with 120 features while the multinomial kernel decreases in accuracy with increasing feature count. The decision tree performance plications. If preparatory models are used for feature extraction, the introduced error is directly implied on the actual target value -parking demand. As both, text mined attributes related to the event target group, as well as regarding spe-cific event contents cannot be realized in sufficient quality, the attribute-specific modeling is considered infeasible.

Estimation of sample size required
As labeling requires significant manual resources, it is beneficial to estimate the number of samples needed before the actual labeling is conducted. This improves the planning capability of machine learning tasks and helps managing the necessary labor for labeling. Figueroa et al. 2012 [167] introduce an estimation methodology that is based on fitting a generic learning curve to empirical model performances based on varying sample sizes. Learning curves are generally found to follow inverse-power law functions [168]. Equation 4 shows the detailed relationship between curve parameters and obtained prediction accuracy. As the classifier increases asymptotically, a determines the minimum achievable error. The parameter b defines the learning rate while c sets the decay rate of the function [167].
In the available literature, the number of features being taken into account is not specified. Thus, a flexible feature spectrum is introduced that represents about one percent of the distinct words in the sample set. A low number of samples interacting with a relatively high count of considered features leads to many irrelevant inputs being taken into account. Lowering the number of features in correspondence with the size of the corpus reduces classifier confusion.
The observed model accuracy highly depends on the chosen learning algorithm, influencing the sample size estimation. To balance this effect, all previously considered modeling approaches are applied on random subsets of the labeled dataset and the arithmetic mean of the obtained accuracy values is calculated to achieve partly independence from single classifiers. As the dummy model works independent from the sample size, it is excluded from the calculation. Figure 31 shows the mean accuracy in dependence of the considered number of labeled samples for both, the outdoor and alcohol attribute. Also, the number of features taken into account at each level is displayed with bars. Five sample size levels are defined for the analysis that lie lower or equal to the maximum available number of 150 balanced samples for each of the two considered attributes. while the outdoor value can be classified with a maximum of 47% accuracy. In fact, even with high resource contributions for labeling data, no major classifier improvement is expected. In contrast to the idealized learning curves in the literature, increased sample counts do not necessarily lead to accuracy improvements.
The actual benefit of additional samples for classification is highly dependent on the sample contents. If textual data is added that describes other concepts than the already considered samples, classifier confusion may be the result. The rapidly increasing number of distinct words within the considered sample range indicates diverging content being added. For large sample sizes, the rate of new terms being added is expected to decrease as the extent of potential new content must be part of the event domain. In fact, the analyzed sample sizes do potentially not represent the full pattern adequately as they cover only a small value range.
Testing significantly higher sample counts may lead to different findings but this cannot be tested due to scope limitations of this study.
It has to be noted, that randomness is introduced by the algorithms that balance and limit the labeled dataset to a certain sample size. For both steps, samples are chosen at random to build up equal class distributions at the desired sample size. Thus, iterative testing with identical sample sizes may lead to diverging performances. Based on the observed heterogeneity of the corpus, data selection is an important reason for high variances of the observed classifier accuracy. To balance this effect and to get a robust performance estimator, each constellation of sample size level and target attribute is covered three times and the mean accuracy of all iterations is reported ( Figure 31).

Thematic modal split modeling 6.4.1 Parking events as direct modal split indicator
Real-life entities in similar thematic areas are expected to show a strong resemblance regarding their implied mobility demands. Urban bakeries, for instance, typically experience peak occupancy in the mornings when people have breakfast. Differences among individual businesses are assumed to result uniquely from entity-specific factors such as marketing, product quality and geographic location. It has to be noted that parking events are extracted from FCD using specific Figure 32. Cross-referencing parking events to Facebook objects machine learning models. However, even though these instances are expected to represent the real-world parking demand very closely, they are certainly no exact representation and cannot be used as ground truth. First of all, only a sourcespecific subset of all vehicles supplies FCD and it is unclear whether this sample adequately represents the actual local mobility demand being satisfied by car.
Secondly, as a model-based parking event extraction is applied, an identification error is introduced. No formal ground truth is available to estimate the relevancy of this type of error. Thus, parking events are seen as an indicator for the local parking situation rather than an explicit representation.

Category-specific parking demand for POIs
All obtained Facebook POIs contain category information based on the circumstance that this field is mandatory during data collection. With 1,124 different categories, the thematic separation is very detailed and parking events can directly be matched with the respective values. In total, two category-specific feature sets are extracted with regards to POIs and parking events. The first one focuses on parking demand observed during POI opening hours. The second one provides aggregated parking indicators on an hourly level.
Opening hours As most POIs are limited to certain opening hours, parking demand is only triggered by the object within these timeframes. Thus, opening hour information that corresponds to POIs in the analysis areas is used as a filter criterion for local parking events. Facebook provides opening hour fields for up to two shifts per day and parking events during these timeframes are summarized.
Individual object polygons with 500 m edge length are chosen to define the potential influence area for the POI. This represents the expected walkable distance for POI visitors after having parked their cars.
The city center test areas cover POIs from 842 different categories. As a certain number of samples is required to derive meaningful patterns, all categories with less than five corresponding objects are neglected for further analysis. Furthermore, in average, about 70% of all considered objects do not comprise opening hour information. Thus, only objects from 287 different categories can be taken into account to determine parking events specifically related to the POI opening status. Figure 33 shows the number of accumulated parking events per weekday within the focus polygons around shopping malls. It is distinguished between parking events during and after the regular opening hours. 29 objects serve as basis for these findings while 80 further objects cannot be taken into account due to missing opening hour information.
The data indicates significant popularity of shopping malls on Saturdays while on Sundays, most objects are closed and the total parking demand is at its lowest point. Thursday also represents a weekday with low general popularity while the

Category-specific parking demand for events
As event objects also comprise a category attribute, this can also be used as Using the set of learning algorithms from Chapter 6.3, tf-idf in combination with stemming and filtering for stop words is applied to the available text for feature preparation. Also, the word count of the respective object descriptions is added as standardized input. Figure 35 shows the highest precision among tested learning machines in default configuration for varying numbers of td-idf features considered.
Superior categorization As the obtained category models are not satisfactory with regards to their error implications, the chosen strategy has to be changed.
As the reduced extent of the labeled dataset after balancing excludes the majority of available data, learning potential is lost. Thus, the detailed original category schema is replaced by the summarized system used in Appendix F2.  These cannot be obtained in the context of this thesis due to resource limitations.
Thus, category-specific parking event features are not taken into account.

Parking demand from topic models
It was found in the previous section, that the category scheme supplied by The obtained scores are shown in Figure 36.
It turns out that the simplest model with only five differentiated topics is    Only some of the directly available attributes can be used for both categories.
The adjusted popularity approach uses a reference data source for POIs and domain knowledge for events to increase the representativeness of the available Facebook data. Direct mining of text-based attributes is found infeasible due to a lack of textual data (POIs) or insufficient accuracy of preparatory classifiers (events).
The extraction of features based on thematic object similarity is conducted using the specifically supplied category attribute on the POI side. This information is matched with historical parking events identified in FCD based on the 70 largest cities in Germany as an analysis area. As features, category-specific parking demand indicators are derived for both, based on POI opening hours and as specific values for combinations of weekdays and hours of the day. Conducting a similar approach is not possible for event objects as the majority of relevant category data is missing. Multi-class labeling models are found to achieve insufficient performance for being implemented as a corrective step. Thus, probabilistic categoryassignment of event objects is conducted using LDA models. Here, the best model is selected based on the perplexity measure. It differentiates between five topics.
The individual topic probabilities are computed for the analysis dataset, leading to five new input features being added. On the POI side, this procedure cannot be repeated due to the lack of textual information.

Feature evaluation
This chapter describes the procedures used for evaluation of the extracted input feature sets based on historical off-and on-street parking occupancy data.
Multiple machine learning models are generated based on different feature constellations. A set of baseline features is introduced and the added value of the social media features is assessed.

Feature evaluation workflow
The first part of the evaluation is based on historical occupancy information related to diverse off-street facilities in multiple German cities. The target for this stage is high prediction performance on the relative utilization of the paid facilities. In the second stage, on-street parking utilization from long-term camera surveillance within the Braunschweig city area is used as ground truth. The first target value in this case is also the parking area utilization. The second target is represented by the binary differentiation between the parking area states 'full' and 'available'. Different timeframes for the parking occupancy predictions are applied.
These comprise four levels ranging from short-term forecasts with one hour relative difference to long-term predictions being limited by a 72 hour timeframe. For each combination of application and prediction timeframe, separate models are trained including the social media feature sets to be evaluated. The obtained crossvalidated model performances are subsequently compared to a baseline reference that considers only basic features. Figure 40 summarizes the described design of the feature evaluation workflow.
As dynamic features, the respective weekday, the hour of the day and past utilization are considered in the baseline model. The latter comprises a feature As the extracted features are all object-specific, an aggregation procedure is necessary that summarizes the effects of multiple objects on the occupancy of parking areas in the surrounding. This is realized by computing the respective features over all objects in a focus polygon around the considered parking area.
As the interaction of objects with regards to the target variables has not been sufficiently researched, potential effects in this area are neglected. In terms of polygon size, an edge length of 0.5 km is chosen due to the initial assumptions with regards to the potential walking distance of car drivers after having parked their car.

Off-street evaluation
In total, historical occupancy is available for 57 off-street facilities during the It can be seen, that the prediction timeframe of one hour leads to high accuracies among all tested models. Longer prediction timeframes tend to result in lower model performances. However, better accuracies on the 24-hour-timeframe than for the eight-hour-timeframe are observed for all tested models. Adding event features is found to improve model performance by a small degree in both cases, Figure 41. Random forest off-street prediction performance in different feature configurations for the facility-specific models, as well as for the combined models using data over all tested facilities. As can be seen with regards to the analyzed model pair, the respective difference between baseline and extended model is larger for the combined model than for the mean over the single models.

Discussion
In the first phase, the research activities comprise the development of scalable methodologies for data acquisition and the exploratory analysis of the available data. Using a self-developed benchmarking methodology, the scope and quality of the data from social media is evaluated against alternative data sources. In the second phase, diverse information retrieval techniques are applied to extract potentially relevant sets of input features for the ultimate goal of parking demand modeling. Finally, the value-added by these feature is evaluated against a ground truth of parking occupancy data for both off-and on-street parking facilities.

Data acquistion
In the course of this thesis, leading online data sources in the areas of social media, mapping and events are preliminarily benchmarked based on publicly accessible popularity indicators. Facebook, OSM, Eventbrite and Ticketmaster are chosen as target platforms and scalable approaches are developed for data acquisition from these sources. While the collection of data based on publicly accessible web APIs is feasible at scale for most of the considered sources, data acquisition using the Facebook API requires the development of a specific methodology. It is based on achieving complete aerial coverage using a large number of API requests that supply location-specific data for a circular area. A flexible, density-based parametrization is implemented that leads to 1.41 million POI objects and 1.7 million event objects retrieved. The aerial size of zip code areas is used as the basis for estimating the location-specific degree of urbanization. Only urban areas are selected for data acquisition that represent 60% of the total aerial extent of Germany. The collected objects include a variety of metadata such as the object-specific online popularity and further attributes.
While it is likely that zip code areas do not indicate the respective object density in a highly accurate manner, no detailed demographic or alternative indicators are available that may serve as a ground truth or as a better indicator. Based on the limited availability of alternatives, it is reasonable to use this data as basis for the density estimation.
Generally, the developed collection algorithm is based on a large number of API calls that cause load on the provider side. It represents a workaround that specifically targets the proprietary structure of the API, allowing it to supply geographically-referenced data with sufficient coverage. Load limitations prohibit the efficient parallelization of API calls by requiring waiting time after a certain data volume has been extracted. Thus, the time consumption by the data acquisition process has to be taken into account when implementing it into a productive system. The observed changes over time regarding the extent of provided data are considered to be manageable with moderate cycles for data recollection. Thus, the acquisition procedure is considered sufficiently scalable.

Database benchmark
In the second phase, acquired reference data is compared to the Facebook dataset using a multi-stage procedure for identification of duplicate objects. This includes contextual matching with focus on geographic and time-based proximity, syntactic matching of object names and thematic matching of objects based on their respective category. For the name matching phase, a combined methodology is developed that is based on the longest common substring method and the Jaccard index as key similarity indicators. Categorical matching is based on manual assignment of congruent category labels among the data sources. This process is supported by syntactic label matching and dictionary-based retrieval of simi- Regarding the overlap between Ticketmaster and Facebook events, an accuracy of 77% is achieved for the trained model, leading to 550 identified duplicates.
Taking into account the overlap between Eventbrite and Facebook events, a classifier accuracy of 95% is observed, representing 635 duplicates. As the Facebook dataset for the same timeframe is multiple times larger than the benchmark sources and duplicates represent a large share of them, Facebook is considered to be fully superior. The availability of popularity data and textual object descriptions also supports this evaluation.

Feature engineering 8.3.1 Adjusted popularity measures
The literature review and exploratory data analysis indicates that the Facebook dataset is a skewed representation of the actual behavior observed in society.
Thus, in terms of feature extraction, directly available popularity attributes are adjusted. One approach represents adjustment based on a reference data source that is assumed to be representatively distributed in terms of themes covered and user interaction observed. While OSM is used for this purpose on the POI side, no adequate reference source is available on the event side. The second adjustment-focused approach is based on the inclusion of domain knowledge from the literature, being applicable only on the event side due to the availability of relevant findings. As one of few published sources of information in this area, publications by the Japanese event platform Doorkeeper are taken considered for feature extraction. As it covers mostly professionally-themed and paid events within a different cultural context than the target area Germany, the representativeness of the information delivered remains unclear. However, these findings represent the most detailed source of information available as only very limited research has focused on this particular area.

Text mining for feature extraction
Another approach for feature extraction is developed that focuses on the explicit retrieval of attributes from textual contents of the Facebook objects. As the collected POIs do not contain a sufficient amount of text, only event objects are taken into account for this process. Text is transformed into term-based features using the tf-idf technique and modeling is conducted using multi-label machine learning classifiers. The target attributes to be extracted represent certain concepts indicated by the respective object text. One set of target labels focuses on demographics and behavioral characteristics of the event's potential attendees.
Certain influence factors are covered that are presented to be relevant in literature for the travel mode choices made, implicitly indicating the parking demand for certain event objects. In particular, attributes denoting events specifically for elderly people, individuals with relatively high income and environmentally-aware users are introduced. However, low penetration of the dataset regarding events that fall in one of these categories makes these attributes impossible to be used in an automated classification context.
For this reason, alternative target attributes describing the actual event con-tents are added. The focus is set on events that are held outdoors and events that involve alcohol. Outdoor events are expected to imply a weather-dependent mobility behavior while alcohol is expected to shift the observed modal split away from car usage, decreasing the parking demand. Given a more prominent penetration of the dataset with these attributes, derived machine learning classifiers only achieve cross-validated accuracies up to 60%. As the error introduced by feature extraction models directly influences the subsequent parking demand modeling, the text-mining-based attribute retrieval is considered to be infeasible with the available scope of manually labeled data. An estimation of the required number of labeled samples is conducted that may lead to higher classifier accuracy. The approach is based on fitting a learning curve over models being trained on a variety of different sample sizes. It is found that adding further samples cannot meaningfully increase the classifier accuracies achieved. This behavior is interpreted as a result of strong data heterogeneity regarding the number of common term features in different objects. Adding new samples mainly leads to diverging content being added and a higher number of distinct words being taken into account. Compared to popular reference corpora in the natural language processing literature, the event-related data is in fact observed to be more heterogeneous and less easily generalizable.
In this phase, a fixed set of stop words and stemming as preparation of terms are used for all classifications. Lemmatization-based feature preparation was also tested and found to increase the heterogeneity of the training-ready feature set. This is explained by the characteristic differentiation of semantically similar terms based on their diverging suffixes. In fact, this alternative preparation technique is less favorable. Also, alternatives to tf-idf features like the binary consideration of term occurrences are tested but found to lead to lower classifier accuracies.

Thematic modal split modeling
Another However, the approach introduces bias based on the observation that opening hours for many POI categories primarily cover active daytime with an independently greater traffic demand compared to the nighttime hours. In fact, the parking-related influence of POIs cannot be fully separated from other factors such as the weekday-and hour-of-day-specific background influence. Moreover, 70% of all objects are missing opening hour information and this type of feature cannot be extracted, which strongly limits the amount of data for pattern recognition.
Also, due to limitations in computability, only POI objects in the focus areas are considered. With over 1,100 different POI categories in the Facebook dataset, the number of objects considered in the focus areas is not large enough. Smaller object counts are assumed to limit the generalization of potentially underlying patterns.
As a minimum of category-specific samples is required, only about 26% of all POI categories can be taken into account for feature extraction. This coverage is generally considered to be insufficient for meaningful feature extraction but the created set is still passed to the feature evaluation phase. Furthermore, it has to be noted that the analyzed FCD is not evaluated for its capability to represent real-life traffic patterns with sufficient accuracy. The FCD is provided by external suppliers that potentially only cover user groups with certain behavioral patterns. It is possible that peaks are observed at locations where the source is more popular than at other points. As no reasonable statement can be made with regards to these aspects, an unbiased representativeness is assumed for feature generation in the given applications. Additionally, the presented processes do not distinguish between different kinds of parking events. Individual patterns are observed for vehicles leaving their parking spot, successful parking and unsuccessful searching for parking. These may have different implications on the availability of parking space. The main reason for neglecting this differentiation is a limited number of available parking event data in the respective categories if they are assigned to specific Facebook objects. Only the summarized version allows to distinguish object-specific patterns. Finally, only parking events from a timeframe of two months are put into consideration for computational reasons. It remains unclear whether the observed patterns in this period adequately represent potential general observations on yearly basis.
The category attribute is not available for most Facebook event objects. Thus, the feature extraction approaches on the POI side cannot be directly applied for events. First, a set of text-based predictors is developed that aims at replacing the missing categorical attribute. Different levels of thematic summarization are tested but no sufficiently accurate classifier could be constructed based on the generated set of labeled data. Thus, unsupervised topic modeling based on LDA is applied on the event-related text data. This technique identifies thematic concepts in the entire text corpora in a probabilistic manner. Continuously evaluating the degree of model generalization, a configuration distinguishing five topics is selected.
Subsequently, each event in the focus dataset is assigned with respective topic probabilities and the observed number of parking events is analyzed in relation to the distinguished topics. Event objects with a high probability for the themes 'sports' and 'fitness' are found to correspond with higher parking intensities. The respective probability vector is directly used as feature input.

Feature evaluation
The extracted features are evaluated for their implications on the occupancy of parking space at selected locations in Germany. Various prediction timeframes are used that range from one to 72 hours in the future from the momentary situation. For each configuration, the performance of a baseline model using a set of basic features is compared to an extended model that has the extracted social media features added. Focus areas covering a walkable distance around the considered parking locations are defined and object-related features within these areas are summarized to reflect co-existence. Potentially important interaction patterns among objects of certain themes or categories are not investigated. In the literature, the mobility-related interaction of POI or event objects has hardly been discussed and simple feature summarization is applied based on the lack of more promising approaches. For each feature configuration, different learning algorithms are tested and random forest models are found to outperform the evaluated alternatives.
On the off-street side, a mean R 2 over all tested prediction timeframes of 0.88 is achieved based on the extended feature set including social media features.
The corresponding baseline model only leads to a mean R 2 of 0.85. Short-term predictions for one hour in the future strongly rely on past utilization values from the past hour. In fact, there are local short-term trends that can be identified and modeled with this feature set. For longer prediction timeframes, also the respective weekday and hour of the day turn out to be particularly relevant. On average, over all prediction timeframes, these two sets remain important and explain about 86% of the observed variance. Direct and adjusted popularity measures, event topics and hourly parking events account for an average 9% of the observed variance only. Even though the efforts for feature extraction are high, valuable accuracy improvements are induced by certain social media feature groups.
Regarding on-street parking areas, camera-based occupancy data from strategically important locations in the Braunschweig city area is used for model training.
The extended feature set is similarly used to forecast the share of spots utilized within the monitored parking areas. A mean R 2 of 0.46 is achieved, representing a significant increase from the mean R 2 of 0.26 for the baseline model. However, both models are not sufficiently accurate to derive valuable occupancy predictions.
Thus, the problem formulation is transformed into modeling the binary state of parking areas 'full' and 'available'. This is expected to reflect the actual information need of drivers in a more user-centered way. For them, it is only relevant whether or not parking spots are available at a specific location and time. It has to be noted that the on-street findings are based on a rather small ground truth as no extensive occupancy data for these areas is available. This increases the risk of observed model overfitting as no broad set of parameter configurations can be used for training and testing. Also, all observation points are located within the Braunschweig city area, preventing statistical social factors to be included. It is possible that validating an on-street occupancy model with this data leads to geographic overfitting on the tested city. This is the case if traffic and car usage patterns in other cities are fundamentally different. However, based on the homogeneity of parking-related findings in the literature over different geographical contexts, similarity among cities can be assumed.

Conclusion and contributions
This thesis represents one of the first studies that focuses on large-scale feature extraction from social media to model urban parking demand. It draws one of the first connections between crowdsourced data and modal split using extracted data from FCD and various other sources with extensive coverage. Multiple approaches for scalable data acquisition and an accurate methodology for text-based identification of duplicates in heterogeneous online databases are introduced. Here, an extension of established procedures for syntactic similarity mining is developed and applied for benchmarking of social media against alternative online data sources.
Findings in the literature are used to identify potentially relevant modeling parameters that are covered with specifically extracted attributes from the raw data.
Among others, this phase covers the adjustment of directly available popularity attributes based on reference data sources and external publications. Also, textand model-based identification of the targeted attendee group and further event characteristics are tested. This approach was found to be infeasible with comparably small labeled datasets due to the heterogeneity and divergence of the event text corpus. Finally, thematic similarities among POIs and events are used to draw category-based connections to historical parking events extracted from FCD. An extensive analysis area covering the 70 largest cities in Germany is used to derive thematic features. For events, the text-based reconstruction of missing attributes is tested and finally, an unsupervised topic modeling based on LDA is applied to derive probabilistic features focusing on thematic similarity.
The evaluation of constructed features is based on historical data for multiple off-street facilities across Germany and an on-street ground truth for the city of Braunschweig. Separate models were generated using a baseline selection of influence features, as well as using an extended feature set comprising extracted social media attributes. Random forest models were found to perform best among different tested learning algorithms, leading to a mean R 2 of 0.88 over different prediction timeframes for off-street facilities with the extended feature set. Here, the extracted social media features were found to explain a low, but still relevant part of the observed variance. For the tested on-street facilities, a mean model R 2 of 0.85 is achieved using PCA for feature preparation and a binary target attribute that distinguishes between available and not available parking areas. Here, event topic probabilities and aggregated parking events on an hourly basis are identified as particularly relevant input sets. Summarizing, it is recommended to include social media features into parking demand modeling as their integration leads to comparably small, but valuable accuracy improvements of the underlying machine learning models.

Future work
For the future, it is recommended to extract and test further feature configurations from the raw social media data. The integration of further data sources and potential future findings from the literature may lead to further accuracy improvements. For instance, as benchmarking showed that OSM is a comprehensive data source showing comparably low duplicates compared to the Facebook POI set, future research may focus on the integration of derived features using OSM as a data source. POI-related text data may be cross-referenced from secondary online sources to create the basis for text mining and popularity estimation.
New findings regarding the representativeness of social media in comparison to the physical attendance behavior in society may lead to improvements of the adjusted popularity approach. In this area, huge potential is seen with regards to more in-depth data covering the interaction behavior of social media users with POIs and events. The most promising approach is to build up a large focus group that voluntarily contributes information for behavioral analysis based on social media data. Besides opening up interesting research potential in the social sciences, these behavioral patterns are expected to be a valuable basis to understand how interactions in social media have implications on the individual mobility behavior in real life. This may include many parking-related influences such as age, income, car accessibility and particular interests. Exemplarily, this may allow to gasp the social sentiments towards particular events in order to derive more accurate estimates for the observed offline attendance. It is possible to include individual behavior as part of an agent-based simulation or a similar technique. Also, social media mining may be extended by analyzing photo and video data to recognize implicitly happening events and get estimates on their mobility implications. Having access to highly detailed data of individuals, their media usage and travel mode choices, provides the opportunity to derive varied new features for parking demand modeling. Closing the research gap between highly available social media data and individual mobility behavior is expected to have a significant impact on many areas such as city planning and digitalized mobility services.
Furthermore, a larger on-street ground truth for parking occupancy covering more data points and more diverse locations would provide a generally higher reliability of the derived findings. As the generalization of on-street occupancy models is directly dependent on this data, representativeness is particularly important.
Moreover, larger sample sizes for the text-based attribute extractors may lead to higher achieved accuracy of the generated models. This requires significant further labeling and data acquisition over longer timeframes. Given the required computational resources, larger amounts of FCD can be taken into account to derive revised feature sets. In this case, the applied focus areas for deriving features may also be extended to a nationwide scale. A larger dataset for topic modeling may also lead to different, potentially more valuable thematic structures being identified in the text corpora.
Finally, evaluating the representativeness of the available FCD sources in comparison to other traffic intensity indicators may increase the degree of data understanding and reliability of derived models. Also, further applications of parking events and other comparable features from FCD can be researched.