We have just learned interesting aspects of geospatial big data, including the common attributes of big data in the context of geospatial data, as well as the convexity of geospatial big data. Now, let's look at geospatial data science. What's the science behind geospatial big data essentially is what we are going to briefly address. You can think of geospatial data science is the intersection among three domains. One of the three domains is a geospatial sciences and technologies. This is providing the scientific foundation, technological basics for geospatial data science. Then the second domain is mathematical and statistical sciences. This is a very much about what is the core of data and how do you reveal the trends and patterns of general data. The third domain is cyberinfrastructure and computational sciences. We have already learned the basics of cyberinfrastructure. And computational science essentially, the signs of computational work and studies, meaning, how do you use computation as a major mechanism for scientific research. Cyberinfrastructure, computational science tends to be closely related. Now, geospatial, their science essentially is intersecting all three of these domains, but a very much centered on big data and CyberGIS, which would have reviewed and cover to a good degrees so far. One of the major principles of geospatial data science is the notion of divide and conquer. Because we're dealing with a geospatial big data, and oftentimes, we really cannot fit a single problem into our computational resources as a whole and once. We really have to break the big problem into parts and solve them using multiple or even large number of computing resources together. This is a very major difference between geospatial data science and other branch of data science. Because in geospatial data science, we need to use a geospatial principals and the foundations and knowledge to understand how to divide and conquer a big problem and a significant big data exercise into different parts that are still connected together to represent the holistic problem. The divided conquer example, this is an illustration I'm showing here is, in this case, you could think of this layers of maps representing thematic attributes, say a big dataset. Now, if it is too big for our conventional GIS approaches to solving this problem, we definitely have to break this dataset into parts and associated analytics also into parts. Now, if you break them into parts, you need to figure out which parts should go to, what computing resources. Now, I have three very basic and intuitive representations of computing resources just from capacity point of view. There is, for instance, small capacity computing results and there is a larger one. The middle one is the largest computing capacity resource. Now, the question is, how do you match the geospatial parts coming from the big dataset, big problem into this different computing capacity allocation and in fact, this is not trivial. This is supposed to be guided by our understanding of the geospatial problems we need to solve and the geospatial data model and a structure we need to build and then optimize the allocation of this different parts to the different capacity computing resources. Now, the different capacity of computing is provisioned as we learn through advanced Cyberinfrastructure. Now, on cyberinfrastructure, such computing resources with a different capacity could be offered by, or integrated supercomputers such as the virtual roger, which is a geospatial supercomputer, is handy for our purpose of solving such a problem. But it could also be offered by a set of distributed computing resources provision by cyberinfrastructure. Now, how to coordinate across the computing resources with a different capacity to solve our problem, respecting the nature of different parts, geospatial parts that need to be consistently organized for the automated solution, that needs to connect the different parts back to the holistic solution. This example. Dividing and conquering is a good one to help us understand the geospatial aspects of data science versus the generic data science is not necessarily concerned about how these different parts are organized, how to optimize the different parts to match with the different computing capacity resources. How do you connect those different parts when it's solved back to our holistic solution and how do you assure the quality of the solutions are not going to be compromised by the individual parts getting solved on different computing elements that are provisioned by Advanced Cyber instructors. Geospatial is essentially pervasive, not only from the problem representation, how you divide and conquer but also how to synthesize the solutions in the end back to the holistic solution you hope to achieve. That is a good example for us to appreciate the geospatial sciences and technologies the geospatial knowledge we need to inherently use to guide the problem-solving processes build on top of cyber-infrastructure and eventually becoming CyberGIS approaches to coming up with the solutions. The notion of harnessing geospatial big data through the support of geospatial data science has to be scalable. Scalable meaning not only from the geospatial side, we need to understand and the best utilize the knowledge of spatial characteristics such as spatial distribution, meaning, for example, we have distribution of population in the US and across the globe that's very heterogeneous depending upon the social-economic status and many factors. This distribution is always changing and that needs to be taken into account when we solve a problem or study a phenomenon. This is inherently special and we need to address the uniqueness of such special attributes. But on the computational side, we need to figure out a number of trade-offs, such as computational complexity versus intensity. When we study an algorithm, we often study the complexity of the algorithm in terms of, for instance, the upper bound of the time taken to finish the computing of algorithm versus computational intensity is how much computational resource you actually need to use to finish a computational task. Complexity versus intensity. Computational complexity is more on the theoretical side competition intensity is more on the empirical side, but they're not exactly the same. We need to trade-off sometimes to resolve the complexity versus the intensity. There's also computational uncertainty versus validity. Any computation is approximation in our mainstream computing architecture because the digitization of our real-world representation is approximation. So inherently, computing as approach to scientific research bears with uncertainty. But at the same time, you want your results and solutions to be valid that is always very important priority, there's a trade-off. How could you bear with uncertainty at the same time, making sure you're computing results and findings are valid. We always want to have good performance to finish our computational tasks but at the same time, we want to make sure our computational processes are reliable. Especially for certain critical applications, reliability is a huge, hugely important concern. So the performance versus reliability is a very important trade-off. Now, scale, meaning we need both the knowledge of the spatial characteristics and the computational trade-offs, putting to the context of scalable problem-solving and scalable handling of geospatial big data and scalable divide and conquer. This is a major difference between, say, a few decades ago, we were using GIS to do geo-spatial data analysis versus today, we need to combine and synthesize the knowledge between the spatial and the computational side to tackle geospatial big data problems. The signs of geospatial data is very much centered on how to achieve this scalable synthesis and integration between the spatial and the computational science. That's really important emphasis we will continue to highlight throughout our course. Meaning the scalability aspects, both from the geospatial side as was from the computational side and how do you incorporate both together in our applications and in our problem-solving exercising scenarios. The fact that in this course, we're learning CyberGIS and geo-spatial data science using the examples of geospatial big data is a major contrast to some well-understood examples in our common geospatial data. Because we're able to address this scalability as a desirable goal. With that, this is a good transition to scientific applications and drivers, which we'll learn as the next topic.