1:47
Let's format the query and see what that actually does.
So you can see, this is very hard to read.
It's all in just one massive line.
It's actually formatted.
It's a little bit easier to read.
So we have, automatically,
your clauses are capitalized.
And one of the great things about this convention again,
even if you write perfect SQL,
your standards might not be another one of your team members standards.
So one of the things that you can enforce is basically say, "Hey,
let's all just align to the common formatter just so
my code is easily as readable as yours is as well."
So a lot of teams have done that.
So we look in the lower right and the query validators says,
"Okay, this is a valid SQL query."
Great. But just because you get
that green checkmark doesn't mean you've written a good code.
So what are some of the things that immediately jump out at
you guys as you're taking a look at this?
So if you've done the lab,
one of the things that you might look at is say, all right,
these aliases col2 is even more ambiguous than what the calculated field was before.
And yes, you can actually group by always as part of the order first.
So this is the first thing that you're selecting, ein,
so you can actually group by what's called
the index of the columns that you have in your select statement.
But if you're mixing the index,
like group by one,
you can either do group by one,
column three to have the ein and the name.
Or you can actually just say,
"Hey, I'm going to make things a little bit more readable.
I will just say, group by the ein and the name."
So let's continue on and see what this actual question that you want to answer is.
Okay. So it's going to be the find duplicate records queries.
Is the one that we already had written from the previous lab.
But this time, we're going to make it actually go from
a very ugly version of it to a functioning high quality coverage.
So find duplicate records.
So we have a nice comment.
And again, standard SQL,
really, you only need in the very first line of your code.
So here is the purpose of the query.
And let's see.
Perfect. That query looks exactly like that.
We need to clean up any ambiguous comments or aliases,
remove defaults, and reorder aggregations created.
Okay, cool. So a couple of things.
In previous labs, you heard me mention the order by default to ascending.
So if you ever see ASC in there,
that's just syntactical sugar that we don't need.
So it defaults as you saw here,
actually the query results,
from a to z or ascending.
And one of the things that is a good best practice,
you've mixing aggregations, your counts,
and then you just return in the ein and the name.
You actually want to have all your aggregations done in the front or done in the end,
just so it's a little bit cleaner for people that are reading it.
So you can have all of your non-aggregated fields show
first and then all of your aggregations are grouped together a little bit later.
So a lot of times in some others SQL code,
you'll see things like this works.
It will even say something like this:
aggregations or calculations if you have 20 things that you're selecting.
And this especially is very true when we start joining data from different tables.
You can actually break up your select statement with
comments for additional readability as well.
But for here, it's pretty small.
So we want the count of eins and we want instead of column two greater than one,
one of the things you can use is in the having,
in the the order by, you can actually use the alias.
But not in the where clause,
which we're going to cover a little bit later.
Okay. So it's looking a little bit better.
So this is almost good to go.
Let's make sure that query actually works.
I'm seeing an error.
So in case your count as has typing in it,
when you start moving columns around,
make sure your last column does not have a trailing comma.
That's one of the banes in existence for SQL developers.
There's actually a big argument or debate of whether or not you should actually
lead with the comma or have the comma be at the end.
So the query actually runs. That's fantastic.
Let's compare that with the solution that we have below.
Perfect. It's got some great comments in here.
Have the EIN-Counts in the counts ein order by name.
Okay. Great. That part is done.
All right. So we have some invalid queries that we need to fix.
And you can only have more than one query running.
Absolutely.
So let's copy and paste the below query.
So I'm going to just go ahead and comment this one out.
Paste in the next, what I'm going to solve.
I'm going to execute it.
And while it's a valid queries,
this is the solution from before.
There's two queries actually here in one.
So if you remember from the lab zero,
where you're actually a "Hello world!"
query, this is very similar.
This, you can actually run by itself,
and this, you can run by itself.
But you can't change both these two queries together
even if you end one of them with the semi-colon,
which is how you terminate that query.
So we need to do one at a time or we just need to comment out one of them.
So we want to block comments in the first query.
Boom, again to review,
that is your command and forward slash on a Mac and
control forward slash on your Linux boxes and Windows.
Great. Make sure that the validator looks good in the last query.
Boom, you have successfully run that query.
Okay, onto the next one.
Let's apply a block comment here again and keep fixing as we go.
All right. So we've got the EIN,
we've got the looks like number of employees from this table,
ordering and by that whereas most employees for us.
When we go and run this even though I know that it is
a validation error, and unexpected brackets.
So this is a very common error if you're inheriting older code from other coworkers,
and that's because in Legacy SQL,
this is how you actually escape the table names,
in a standard SQL, again you're again using that back ticks.
And instead of separating the project name with a colon,
you're doing it with just a dot now.
And you can see that actually turns to be a valid table name,
when the color changed from black to dark blue here.
And you can see let's make sure that actually still runs.
Great. These EIN's have the most employees.
Wow. One EIN is 787,000 employees.
This is incredible.
That's huge.
Almost a million employees,
let me look that one up here.
Well is that doing the work?
Okay.
So we fixed that standard SQL means that we need to actually use
those back ticks and that's exactly what the solution recommends.
All right. So what you do is solve the next one.
So before we do that again one created
time let's make sure we just block everything before this,
and you can either run the query and get the error,
or get the error before hand by clicking on the validator.
So we have a trailing comma before the from clause,
this is an error that you saw earlier in the recording this lab.
So go ahead and remove that and then let's see what we have. All right.
So we had a trailing comma but now we have potentially a wrong field name itself.
Says, "Unrecognized name total revenue,
did you mean the actual field name top revenue?"
Which exactly is what we did mean.
Now if you had something like that's a value right now that will run,
but if you had something strange and you can't
remember what it doesn't give you a hint of what the column is,
one of the neat things that you can actually do,
is you can go into that data set.
In this case it's the 2015 filing or if you don't have it open here,
one of the neat things you can do is hold down the control key or that command key,
which will then highlight all of the tables in your SQL query.
You can click on it, and you can see
the available fields that you can query as part of that table.
So often what I'll do is I'll control find,
I'll take a look at revenue,
scroll down everything's highlighted,
and I'm looking for something that's like total revenue, field for totrevenue.
It was a very long field name,
if you don't want to copy that directly,
one of the things that you can do is you can actually just click on it,
and then put that field directly into your query as well,
and see exactly what this query is doing.
So we're returning revenue expenses, employee count,
and the number of people making over 100,000 from
the 2015 filings and only those where it's a school,
in that order by the ones with the most employees first and limited to 10.
Let's see what we have.
Query returns zero results.
As you playing along at home will go.
That's a valid query,
but maybe there's something going on with your filtering.
So the filters inside of SQL to filter out and rows is done through your where clause.
So other things you might be asking is operates is as cool, code is why.
How do you know is why and not like yes or something like that.
Well one of the easiest ways that you can do this,
you can take my word for,
and the solution is actually it has to
be uppercase Y and then all that will get it to run.
But if you didn't know that then there's no results right there.
One of the great things that you can do,
is you can again hold down that command or control get back to the schema,
and you can click on preview.
And this will give you a preview of
the top few rows in that particular table, and then you can say,
"I'm curious to see operates as a school,
and see if we can find that one operates as a school,
capital N and you can see a lot of these as well.
These these operators capital Y capital N as well.
So you get a flavor for that data,
just by accessing the table right next to this schema,
get the main data about the table which again is the rows and the signs,
but also that preview is extremely useful.
And one of the common things that you see a lot of SQL developers do is that select star,
from a particular table,
which returns all the columns.
This is a faster way to do for the UI,
and a lot of times it's bad practice to select star just because that's
again you're potentially selecting more columns that were returned,
in some of these tables can have upwards of 200 columns,
you're paying for the amount of bytes that are being processed as well.
So be sure to click on that preview button because a lot of that is just
automatically cash behind the scenes by BigQuery.
You don't pay for cash retrievals.
All right so we fixed that.
There were three errors there and to recap,
there's a trailing comma there,
there was a field that wasn't found,
and last but not least,
the filter wasn't really
valid and there was nothing that actually did match on lowercase y.
So another thing that you can do just as additional bonus
is if you didn't know whether it was
uppercase or lowercase because a lot you might be saying,
"Hey you're looking at only a sample of year how do you know
your data is clean you keep saying that all that is dirty data."
What happens if some of them are lowercase y and some uppercase Y,
at which point I would say you're absolutely correct,
don't trust that at all.
One of the things that you can do is you can wrap this
in a string manipulation function and basically say,
"Whatever you are passing in the here,
capital Y or lowercase y,
force it to be a lowercase value of whatever that string is,
which will then match on that Y as well.
So you get additional insurance by using string manipulation functions.
All right query of that value icon let's see what we have next.
So we went over finding the data schema.
This is what the query looks like after the fact.
And it covered the fact that you returned zero,
and that's because again the case sensitivity is now uppercase,
and we're moving on to our next query.
Great. We're going to go in block coming out with this.
Next on the chopping block, what do we have?
Error number four: Missing Quotes from String Literals.
So you're going to want the string literal is you basically looking for something that
can't match the equivalency value without being in what we call escape.
So right now, literally if you put this in here,
this is looking for a state equals the column of New York, right?
Because actually, your column names here are just in no quotes or anything like that.
And you need to make sure that you're actually escaping
this much like you saw with what we did.
Select your first name.
It will basically treat those as columns unless you do what's called escape.
Okay, so plot twist again,
if you execute this because you got no results and thus,
you're following along, you may start to
pick up on the tricks that we're throwing at you.
This again, you can solve with
you're either extremely smart knowledge of the data and your pre-processing,
cleaning it up making sure it's awesome,
or you can do a little bit additional insurance by doing something like this.
Any value that's coming in the data path into this function,
the string nucleation function,
and then turn it to uppercase that will match that New York.
So we have, how many profits are in New York?
So according to this 102,000.
Excellent.
Next, one of my favorites plays the second most common error.
The first most common error is forgetting those commas.
The second most common error is this one: Non-aggregate fields found in the GROUP BY.
So you're really excited you want to count the number of
nonprofits of charities by state,
and then order by which state has the most.
And in this particular case,
you've put on your aggregation function here.
You execute your query and you're perplexed like, "Hey,
I've got navigation function here. What's going on?"
And keep in mind, an aggregation function is going to take all of
those rows and aggregate them or mash them all up into one row.
If you're mashing them all up,
and you have state as an additional column there,
and you're not doing any kind of mashing on state,
then you don't have a mismatch in the amount of rows.
You're going to have one row for the total lot of cherries that you just counted.
But then you're going to have a lot of values like New York and
California just on the other side
and not being able to aggregate state somehow, it's going to be a problem.
It's how you actually aggregate that is basically say, well,
whenever you aggregate that count of charities,
what I want you do is bucket them by the state value as well.
And the long and the short of this is when you
actually write an aggregation function like this,
you do immediately put whatever is not being
aggregated as part of your query and immediately into the group by.
And let's see if we get that
green check mark rates and let's see which states have the most charities.
Right here is our top results.
Again, you have 62 here.
So if you didn't want to go through the page name through all of these,
you could actually add a limit.
So best practices, if you're doing something with a massive order by,
be sure you actually be mindful of
the resources you're using and then just throw in the limit there as well.
So it limits the amount of data that you're processing and that's returned.
So California, Texas, New York,
Florida, Pennsylvania, and US. Great.
And that is what we have as our solution here.
Well, one of the last things we want to cover,
we have type mismatches in operators.
Let's see what we've got.
So, we have a bigger query here.
Let's take a look at what we're doing.
We have key statistics, we have employees that we're counting,
a bunch of really cool fields.
As we mentioned before, you can break them up with these comments.
That makes it very readable which is nice where you
want only the ones that have not ceased operations.
So this is not equals or the way to write that is not equals or use the bang equals.
And they had more than one employee,
or they had employees, period.
So if we actually try to execute that really quickly, boom fail.
Unrecognized name: Y. And again,
we see that unrecognized name.
That means, "Hey, this is a string value."
And well, being before we execute another area comes up:
No matching signature for operator for
argument type INT64 when we're passing it as string.
Look at this.
So we have zero which is actually a numerical,
an integer, and we basically need to
treat the integer as it should be not escaping it in his quotes.
It treats it as a string and then that'll run perfectly.
So strings in integers,
mathematical operations are fine.
So if you wanted to have more than 100 employees,
more than a thousand,
you just don't need quotes up on those.
And these are your terminated nonprofits.
Let's see what we have in 2015.
Oh, this is not terminated.
I was about to say that number is really high.
So it was 300,000.
So let's see the ones that did cease operations.
How many did we lose in 2015?
1,031.
And again, this is the glory of SQL,
it's just quick tweaks and changes to the code and then rerunning it.
This is again a rather small data set by BigQuery standards.
It's only 16 megabytes to process this particular query but the exact same process
that you're using now is what you're going to be applying on
your terabytes and petabytes of data as well.
We fixed the comparison operators' strings in quotes
and numbers are not fantastic.
And let's see, we
had 1,031 which is a different answer than we have here.
How many nonprofits have ceased operations in 2015?
Ceased operations and had employees.
So this was Had employees was greater than or equal to zero.
So zero actually matches in greater than or equal to,
so we need to get rid of that equal sign.
So tricky, tricky.
So if we actually didn't want to have, by definition,
at least one employee,
we could do greater than equal to one or just greater than zero.
So if you want Ceased operations and Had employees period,
we get that 399.
Tricky. So watch those operators,
use the greater than are equals If you want,
and then you make sure you're matching on what you think you should be matching on.
Fantastic. That's the end at the lab.
A quick recap of what you've learned.
You're using block comments as you iterate through and run more and more ad hoc queries.
You can access that table schema very,
very quickly as we cover by holding down command control,
going to the schema,
clicking on the details to see the number of rows,
previewing the data to get the columns.
Again, finding the fields that are available using that just simple find, finding place.
So if you want to look at expenses,
you could just look at expenses through here.
See what's highlighted. Pull those up.
Clicking on the fields that are available.
And overall, just get a familiarity with where things are located.
All right.
Now, we played with a little bit of SQL. Let's keep going.