0:06
At this point of the course,
I hope you feel like you've learned quite a lot.
In the project, we're going to give you
the opportunity to put that knowledge into practice.
What we're going to do is work with text files,
this is a common thing that you will
find yourself needing to do when you're writing scripts.
And in particular, we look for differences between two text files.
In the project, you're going to find the first difference between
two text files and display that difference in a nicely formatted way to the user.
Right now, you can imagine how you can easily
generalize this to finding all of the differences in the file.
Once you found the first one,
you can just keep searching from there to find the next one and so on, alright?
Or we keep it simple here and just find the first one, all right?
But I want you to recognize that this is much more powerful
than just finding differences in text files.
This is about understanding how to process the text in files in arbitrary ways so that
you can do interesting things with data that
last beyond the lifetime of just the single execution of your program.
All right? Let's take a look at
the logistics of the project and what we're asking you to do.
Before I start talking about the actual task that you need to do,
let's talk about some preliminaries here,
all right? First, coding style.
It is not good enough just to quickly whip up something that works, right?
We want you to think more carefully about how you write your programs, okay?
And so in this class, you're required to follow a set of
coding style guidelines and these guidelines are graded by the machine grader, okay?
So you need to read these coding style guidelines.
If you took our previous class,
they are the same as they were in that class,
so you are already familiar with them.
If not, please do read them and make sure that you follow them,
they're just simple common sense things about naming
your variables meaningfully and indenting consistently and things like that.
When you do this kind of things,
your code becomes much easier to read by others when they take a look at it,
and try to help you out or even by yourself when you go look at it later.
This becomes especially important when you start programming in groups,
when you have more than one person programming at the same time,
you have to agree on a consistent set of guidelines and ways that the code should work.
Otherwise, you have a bunch of code that looks
one way interspersed with another bunch of code that looks another way and
it becomes very difficult for someone to deal
with and read and maintain that code. All right?
So it's good to get in the habit of following
good coding style right from the start. Now, testing.
You also need to test your code and part of that is testing it as you go, all right?
There are going to be multiple parts to this assignment
and you want to do them one at a time and get each one working.
The problem is build upon each other,
so you want to get the first one working tested,
make sure that you know what it's doing before you move on to the next one. All right?
This will allow you to much more efficiently arrive at a final solution,
than if you just tried to look everything up at once and then you
find complicated interrelated bugs that are difficult to track
down because you can't figure out if it's in the function that you're working
on right now or a function that you wrote originally long time ago to do something else.
So, get in the habit right from the start of testing code as you write it, okay?
And write your own simple tests and make sure that
the functions are behaving the way you think they should behave.
We've also provided you a machine grader to
help you evaluate the validity of your programs as well.
And so, you can go to Owltest.
Here is the link right here to
the appropriate Owltest page for this assignment and you can
submit your code at any time and you will get a grade and feedback from Owltest.
And that feedback will help you to understand where your code might be going wrong, okay?
So we encourage you to use Owltest as much as you would like.
This does not submit any grades to Coursera,
this is simply for your own use.
Right? This is for your own feedback to get
an understanding of where you're at right now with the program.
So after you have written a function to solve one of the problems,
I suggest you test it on your own to make sure that you think it works.
Then, you submit it to Owltest,
get us to test it and then see how well you do there and
continue to fix things until everything works as expected,
and then move on to the next problem.
Once you have finished the assignment,
submitting to Owltest is not sufficient.
You do have to go back to Coursera and submit things directly to
the Coursera LTI assignments and then that will register your grade with Coursera.
Now, it's important to understand that you run it through
exactly the same machine grader that you get from Owltest and so,
it's the same machine grader, either way,
just if you run it through this link on Owltest,
it's feedback for your use.
If you go through the Coursera LTI submission,
then it's actually submitting your grade to Coursera,
so you do have a grade in the course.
Now, let's take a look at what you're going to actually have to do for the project.
As I've said, the overall objective is
to find the first difference between two text files.
We're going to break that problem down.
And the first thing you're going to do is find
the first difference between two single line strings, okay?
So you're going to write a function called singleline_diff. All right?
And this is the signature for that function.
It takes two inputs, two different lines,
each will be a string that represents a single line and you're going to return
the index where the first difference between those two lines occur.
There's a bunch of different cases here, right?
Those lines might be identical.
If so, you're going to return the constant identical.
This is already defined for you to be
minus one in the template and I'll come back to that at the end,
I'll show you the template.
So, if they're the same, you are going to return that constant.
If they're not the same, what do you do?
And what does it mean not to be the same?
Well, if they're the same length,
you're going to simply find the first character,
the first index where there's a character that is different in line one and line two.
If they're different in lengths,
well, it might be okay.
You might still find that the first difference before the end of the shorter string.
But if you don't, if the shortest string is a prefix of the longer string,
then we're going to define the first difference to be the index,
right after the end of the shorter string. And this makes sense, right?
There's a character in the longer string and there's no character in the shorter string,
so that is the index of the first difference. All right.
So, you should read this and familiarize yourself carefully
with what we mean by the first difference and how we've defined it.
Like I said, it's a little bit tricky,
when the strings are different lengths but hopefully
it's described well and you understood what I said anyway.
Okay. Next, we're going to present that difference in a nicely formatted way.
So, we're going to write a function called singleline_diff_ format,
and what that takes is two strings and an index, right?
Presumably, that index was found by the singleline_diff function that you just wrote,
and you passed it in here and it formats it, as we can see here.
It prints the first line,
that a separator line,
and then the second line, all right?
And it doesn't actually print it,
even though, I just said that. All right.
It returns a string that looks like this and the separator line is the key here.
What it is, is a bunch of equal signs and then a caret,
and the caret points at the first character that is different, right?
So, if we have the string, or the line A, B, C,
D and the line A, B, E, F,
well, index two is the problem.
And so, it should say, equal equal caret,
that points at the C in line one,
which is different from the e in line two.
Hopefully, this makes sense but this is much easier
for you as a human to understand, when you see that you're like, "Oh, yes.
Those strings are different right there," then if I told you,
"Hey, they're different at index two."
If there was a large number and there were long lines,
you'd have to start counting characters could be,
using pretty easily, right?
So, here's the signature.
For this function, like I said,
takes two strings in an index and it returns a string that shows
those differences in a nicely formatted way, okay?
Now, we're going to extend this in part three here, a problem three,
what we're going to find the first difference across multiple lines,
so this is going to be called multiline_diff. All right.
And if we look at the signature for this function,
it takes two lists, all right?
And each item in each list is a single line string.
So, if we have a list of single line strings we're going to find the first pair
of lines that have a difference and
the index in those pair of lines that has that difference, okay?
So, we're going to return a tuple this time that gives
the line number and the character offset of the difference.
And we're going to define everything to be the
same as we did with singleline_diff between the two lines,
the index is the same.
So, if we have one line shorter than the other,
and the shorter line is a prefix, again,
the index will be the first character.
After the end of the shorter line,
will be the difference, if they're the same length that makes sense.
If they're identical, well,
I'm not going to report it from this function,
I'm going to go on and look at the next pair of lines and keep going.
And the line offset started zero,
we're computer scientists, that's how we work.
And if you get through the whole thing and
the two lists of strings are exactly identical,
we're going to return the tuple that has two values in it, identical, identical,
so that's going to be minus one, minus one.
Now we said we really want to do this from text files,
not from list of strings,
so we got to do some work on files here and so
the next problem is getting lines from a file.
So, you're going to write a function called
get_file_lines and that's going to take as an input
a file name and what it's going to return is a list of single-line strings.
Now, I don't want the newline characters to be in those single-line strings,
otherwise they wouldn't be a single line, right?
They'd be the line plus a carriage return.
That's going to actually screw up some of
the previous functions if I tried to reuse them because
they assume that those carriage returns and new lines are not in the strings.
So, this function does have to strip those off of the lines.
Python is not going to do that for you automatically.
So, I'm going to return a list of lines from the file.
We are going to assume for this project that
the file definitely exists and you are allowed to read it.
So, if somebody passes you a file name here
that's not valid in the sense that that file doesn't exist or you don't have permissions,
this function can go down in a blaze of glory.
You do not have to write any code to check it.
Whatever happens, happens.
That's OK. And then finally pulling everything together,
problem five, we're going to write a function called file_diff_format.
Let's look down here at its signature.
It's going to take two file names, all right.
And what it does is return a string that is
a nicely formatted representation of where the first difference is.
So if we go back up here,
we see what that should look like.
It's a four-line string.
The second and fourth lines should look familiar at this point.
That's going to be formatted in basically the same way we did
before when we used single-line diff format.
That shows exactly where the difference between the two lines are.
But, that's not enough, right?
These are two files and so we don't know which line that's on,
so the first line here is going to tell you which line is on.
It's going to say, for instance,
at line three and that's going to end with a colon
and that allows us now to very easily see if I give you two files,
I get this four-line output that tells me line three,
there's a first difference and it's on index two in
the line and I have a nice pretty output that shows me that, where it is.
If it turns out that the files are exactly the same,
then instead of returning this four-line string,
we should return a string that just says no differences.
That's also nicely human-readable,
allow you to understand that there are no differences in these two files.
So here's the signature again of file_diff_format.
It takes two inputs, filename1, filename2,
which are the names of two files.
Just as before, you do not need to worry about
whether or not these files exist or you have permissions to read them.
You can assume that everything is fine here and if not,
whatever happens, happens. It does not matter.
You do not need to do any error checking,
making sure these files are there.
Now, once you have this function ready,
you actually have a very powerful program that can
find the first difference between two text files.
Now, I want to return to the template.
I don't want you just starting from an empty file.
Instead, we provided you a template that has a skeleton of this project,
and here is the link up here in the description of the project.
If you click on that, you're going to get this template file.
And, this is basically the skeleton of the code that you need to write.
So, when you come here,
you can see that we have the docstring is already written for you,
the function signatures are already there,
you just have to write the code.
You also can see that the constant identical is
defined up here as minus one and you should be using that.
So make sure that you download this template and start from here.
At this point, I hope you're excited to start working on the project.
This is going to give you the opportunity to put
everything you've learned in the course into practice
and to put it all together to make a pretty significant program here.
Now remember, work on things in order.
Work on things slowly and methodically.
We put the functions in the order that we
did for a reason because they build upon each other and that's
useful to be able to use the previous functions that
you've written as you build up the more complicated functions.
Also, read.
Before you start,
read through the entire project description.
There's plenty of information here that's valuable
and important as you work through things.
And, as you work on each problem,
make sure that you read the hints.
The hints are there for you.
Make sure that you're thinking about these problems in a constructive way.
Don't panic.
Work slowly and methodically.
Make sure that you read everything and I'm
sure that you're going to be able to get this project.
And hopefully, the sense of accomplishment you feel when it's
over is well worth the time that you're going to spend on it.
All right. Good luck.