Lecture6: Version control
信息爆炸的时代区分专业领域人才的最好工具便是对概念的深刻掌握和理解吧?
记老师在数据结构课上讲过的笑话:
在面试时,面试官问,你觉得你的C++怎么样?
答曰,我精通C++
那,你能解释一下多态的概念吗?
…
你会使用git 吗?
会啊,push/pull/fetch/clone/branch/checkout就完了。
真的如此吗?
概览
抛开版本控制对于版本控制的好处,version control的另一好处是:
Why is version control useful?
Modern VCSs also let you easily (and often automatically) answer questions like:
- Who wrote this module?
- When was this particular line of this particular file edited? By whom? Why was it edited?
- Over the last 1000 revisions, when/why did a particular unit test stop working?
我个人学过git 很多次,在我的另一篇文章植树节记中我有写到过,这次是我第5次学习git这个 command line版本控制工具,却是我第一次自底向上(bottom-up)学习git.
Git has an ugly 、leaky and high abstracted interface;
Git records a series of snapshots and a bunch of metadata.
Git has an ugly interface has to be memorized, a beautiful design can be understood.
Git’s data model
git的数据结构实现,很想一个tree structure, 也即可以理解成一个recursive data model.
在Git的术语中,可以把directory/folder理解成tree,把文件夹里的文件内容则理解成blob.
比如,我们可以有如下的git的层级结构样例
1 | <root> (tree) |
然后,git要做的,就是take a snapshot of 上述的文件层级结构。
Modeling history:relating snapshots
既然是历史版本的更迭,我们就会很自觉的想到呈线性形状的版本迭代历史,但是git不是简单的线性结构。
将git提交多次,即记录了多次snapshot以后,我们可以将 a commit history做如下的可视化:
1 | o <-- o <-- o <-- o |
不加注释的演示图如下:
1 | o <-- o <-- o <-- o |
可以值得注意的是,向整个住版本中添加某些额外功能或者进行bugs的某些修复,可以跟主线开发parallel的进行。
当我们的额外功能(本次例子)的开发文成以后,我当然要进行版本的合并(merge)啦,因为当然我想让我将来的版本中,既有我的现在主线的功能,也必不可少我开发的额外功能。
So eventually, we use git merge in this case, which combines 2 changes from different parallel branches of development.
所以,在git merge以后,我们再次查看我们的commit history or snapshots history的时候,可能就变成下面这个样子啦,
1 | o <-- o <-- o <-- o <---- o |
利用伪代码表示git的底层数据结构
1 | //a file is a bunch of bytes |
可以把 blob | tree| commit(snapshot)都理解成面向对象编程中的对象,所以我每次commit版本中不是包含真的实体,而是指针结构(pointer),通过ID进行索引
Object and content-addressing
我知道你肯定要问,什么是ID?
在git的存储结构中,每个objects(a.k.a.我上述的blob ,tree, commit)都是content-addressed by their SHA-1 hash.
这个SHA-1 hash完成的是什么呢?
它完成的是 take in a big bunch of data, and return you a short string, which can be interpretated as name or address for your data;
也就是我之前所说的那样,git commit存储的不是一个个实体,而是一个个由哈希函数形成的ID,这些ID就相当于去取这些object的指针,也是commit ID,或者是中文中的版本号。
So what git maintains on disk is like this:
1 | objects = map<string, objects> |
Blobs, trees, and commits are unified in this way: they are all objects.
When they reference other objects, they don’t actually contain them in their on-disk representation, but have a reference to them by their hash.
Git’s content data store is a data store where all objects are addressed by their hash.
In Git’s terminology, it identify object by Sha-1 hash.
啥哈希啊,就是给我们这个内容取名字,有了名字去翻就简单多了。
比如名单上有五个人的人名,我改了名单有6个人,我新储存这个名单的时候,我只需要储存前五个人人名,再加上第6个人和人名,就不用再费劲心力存那五个人啦,因为在之前的名单上他们都存过啦,不是说这五个人不用在disk上村喔OwO!
我上面的名单就是snapshots,人就是object, 人名就是我们用SHA-1哈希而产生的reference/commit ID/name,哈希在这里,就是帮我们取名字的工具人。
那下面来举个例子来看git下面的这些底层实现:
1 | ~$ git log |
在这个样例结构中,tree包含了3个指向它的内容的指针(即blob).
If we look at the contents addressed by the hash corresponding to baz.txt with git cat-file -p
也就是,如果我们把 tree 的内容打开来看看
1 | ~$: git cat-file -p 06cd65f026bb6579d65bd33bf425def948d10f6b |
再把blob打开看看内容的话,就是我们写入文件的内容啦!
Reference
哈哈,SHA-1 hash这个工具人帮我们取的名字也太长了吧,那我们就再利用它的这个名字,再来个reference,这下不仅仅是machine memorable, 更是human readable和humanly friendly啦!
前面说的references, 也就是SHA-1 Hash生成的这些名字,我提到了reference这个词,但是在这里为了避免概念的混淆不清,还是将reference 专门用来指git 的terminology 中的 “references“ a.k.a human-readable names for SHA-1 hashes.
Unlike objects, which are immutable, references are mutable (can be updated to point to a new commit).
实现reference机制的伪代码如下:
1 | references = map<str,str> |
Git can use human-readable names like “master” to refer to a particular snapshot in the history, instead of a long hexadecimal string.
There is one more detail to be noticed:
In Git, that “where we currently are” is a special reference called “HEAD”.
“HEAD” where you are currently working right now;
master
Reference to the main branch of development in your code.
up-to-date version of your project.
Head
Special reference to where you are *currently * working right now;
Github v.s. Git
U1S1,我一直到快做数据科学与工程导论大作业的时候,才有点儿分的清,Github 和 Git。
Git 在本质上跟你躺在命令行里的其他的工具一样,Git 的版本控制是在你本地的计算机之上的。
也就是说,快忘掉你自以为会Git的add commit push 三步走吧Orz
Repositories
On disk, all git stores is objects and references.
所以总结的来讲,我们在交互界面输入的种种git cmd,
就是对,一些对commit DAG的操作的映射,这些操作往往是通过adding objects(i.e. take a new snapshot/make a new commit/delate … all the changes in your current working directory)和 adding/updating references 达到的。
[^1]:有时,与其把整个snapshots存下来,git 给用户提供了很多的flexibility, such as what changes to include in the next snapshot,这也就引出了,下面要讲的staging area.
Staging area
Staging area 是 git 为了什么样的工作情景为我们准备的呢?
For example, imagine a scenario where you’ve implemented two separate features, and you want to create two separate commits, where the first introduces the first feature, and the next introduces the second feature. Or imagine a scenario where you have debugging print statements added all over your code, along with a bugfix; you want to commit the bugfix while discarding all the print statements.
上面摘自Lecture Notes的笔记太长可以不看,翻译成人话就是,你是做了很多改动,可你就不是想让所有改动都被拍成snapshots,难不成加一个字就照张照片?这也是为什么有时候iphone误触的连拍快照总是这么annoying的缘故。
我们“会”Git的人不是常常把git 就变成 git add
和 git commit
吗,什么三个区的概念中文翻译又很拗口,其实 git add
这一步,就是把你可能想要放进snapshot的东西准备好,类似于每次点啥重要的都会再三提示你确定要这样吗是一个道理,不然,为啥不直接commit就完事了呢?
所以,Git add
那一步,就是把你做的change提交到staging area,提交到staging area里的东西才是我真正存在snapshot里的东西。
Staging area is where you tell your git what changes should be included in the next steps?
Git allows you to specify which modifications should be included in the next snapshot through a mechanism called the “staging area”.
Git command-line interface
lecture notes让我们去看pro git,可是要看的书太多了,哈哈。
所以我下面就按照我看Video中笔记的顺序了。
毕竟还是演练的好,Git cheatsheet 满天飞呢。
Basics
git log
git log
看日志,看日志干啥,就是看你的提交历史。
欧吼,有历史可以看了,不就是谁提交谁改动谁出错谁干啥都能找到了吗?
我想看的更容易点,
git log --all --graph --decorate
Visualize your commit history as a DAG(就上上面画的那种图)
git checkout
Move around in your version history.
剩下的谨见Version control (2)