Tag: git

Using Git Submodules

Contents

Using Git Submodules

What Are Submodules?

Before You Start

Recursion

Adding Submodules

Performing Operations on All Submodules

Cloning a Repository That Contains Submodules

Making Changes to a Submodule

Pulling Changes Made to Submodules

Including Submodules in ADO Pipelines

Git Submodule Quick Reference Guide

What Are Submodules?

A submodule is a link – it is a pointer at a specific path in the parent repository’s directory structure that references a specific commit in another git repository. The actual git repository is no different than any other repository – what makes it a ‘submodule’ is someone using it inside a parent repository. If I create a really cool PHP function – an Excel parsing utility – I can publish it to a git repository to share it. This repository is not a submodule for me. Someone who wants to use my utility may add my repository as a submodule in their code repo. In their repo, my repository that houses the Excel parsing utility is a submodule. You cannot tell, by looking at a repository, if it is included elsewhere as a submodule.

The example provided above – one individual looking to include someone else’s code in their project – is the typical use case for submodules. But there are other scenarios where isolating different parts of code make sense. In our organization, we cannot deploy code until it passes a security scan. This means problems in one section of the code can hold up development on everything else. Additionally, our code scanning is charged per repository. While we may have a dozen independent tools that happen to reside in a parent folder, it is not cost effective to have dozens of repositories being scanned. Using submodules will allow us to roll up independent tools into a single scanned repository.

The parent repository does not actually track the changes in a submodule. Instead, the parent contains a pointer to a commit identifier in the submodule repository. All changes in the submodule are tracked in its repository.

Before You Start

Before working with submodules, ensure you have an up to date version of git. If, as you interact with remote repositories, you see an error indicating you are using an older version of git:

You are apt to see fatal errors like this:

Recursion

You can have submodules inside of submodules inside of submodules – because of this, most command that interact with submodules have a –recursive flag that will recursively apply the command to any submodules contained within submodules within the parent repository.

Adding Submodules

Begin with a git repository that will serve as the parent – in this example, I am starting with a blank repo that only contains a .gitignore and a readme.md file, but you can start with a repository that’s been under development for years.

We link in submodules using “git submodule add” followed by the repo URL. If you want the folder to have a different name, use “git submodule add URLHERE folderNameForSubmodule”

Once the Tool1 submodule has been added, files from Tool1 are on disk.

Check out a branch in the submodule, or use “git submodule foreach” to check each submodule out to the desired branch (main in this case)

If you use git diff with the –submodule switch, you can see that “Tool1” is being tracked as a submodule.

Add and commit the change – since this is both an entry in the .gitmodules file and the submodule reference to a commit hash, I am using “git add *”

Then push the changes to ADO

Looking in ADO, you will see each submodule folder is represented as a file. The file contents is the hash of the commit to which the submodule points.

Because the “folder” is reflected as a file in the repository, it will not sort with the other folders. You’ll need to look in the file listing to find the submodule file.

Performing Operations on All Submodules

The “git submodule foreach” command allows you to run commands on each submodule in a repository.

You can run any git command in this fromeach loop – as an example, I will checkout a specific branch in all submodules using

git submodule foreach –recursive git checkout main

Cloning a Repository That Contains Submodules

Cloning with –recurse-submodules flag

When cloning a repository with submodules, add –recurse-submodules to your git clone command to pull all of the submodules. To clone my ParentRepo, I use:

git clone –recurse-submodules https://Windstream@dev.azure.com/e0082643/e0082643/_git/ParentRepo

You will see that git checks out the parent project and all included submodules.

Use “git submodule foreach –recursive git checkout main” to check each submodule out to the desired branch (in this case, main).

Cloning without –recurse-submodules

If you forget to include –recurse-submodules in your clone command, or if the submodules have been added to a repository that you have already cloned, you will have empty folders instead of each submodule’s files. A parent repository only has pointers to commits for submodules – until the submodules are initialized, those pointers go nowhere.

Use git submodule update –init –recursive to pull files from the registered submodules.

Making Changes to a Submodule

Working within a submodule’s folder will perform operations in the submodule’s git repository. Working in other folders will perform operations in the parent’s git repository.

To make a change to Tool3, I first need to change into the Tool3 directory.

Make whatever changes are needed. Then add the changes, create a commit, and push the commit to the submodule’s repository. You can create branches, merge branches, and such all within the submodule’s folder to interact with the submodule’s git repository.

If you look at the parent repository, you will see that no changes have been made to it – committing changes to a submodule repository will not kick off the pipeline code scan.

To update the parent repository to point the submodule to your most recent commit, you will need to commit the submodule folder in the parent folder. Change directory into the parent repo – add, commit, and push the changes.

And push the references to the remote using “git submodule update –recursive –remote”

This commit updates the contents of the file representing the repo’s folder – the folder’s file used to reference a commit ending in 6d95 and now it references a commit ending in 6292

Which you can see in the git log of the submodule repository:

Because we have made changes to the parent repository, the pipeline that initiates our code scanning should execute.

Pulling Changes Made to Submodules

To sync changes made by others (or that you’ve made in other locations), you will need to pull the submodules, you need to pull the submodule as well as the parent repo.

To pull (fetch and merge) changes from all upstream submodules, use:

git submodule update –recursive –remote –merge

Using “git submodule status” to view the submodules, you can see the submodule now points to the commit hash from the change we made

Including Submodules in ADO Pipelines

Veracode scanning is initiated through a pipeline. Since our goal is to include files from all of the submodules when scanning the parent repository, we need to ensure those files get bundled into the zip file that is submitted for scanning.

To do so, we need to add a checkout step to the pipeline and set “submodules” to ‘true’

When the submodules are all part of the same ADO project, you do not need to supply additional credentials in the pipeline. Where a different set of credentials are required, you can check out the submodule by passing an extra header with an authorization token.

Git Submodule Quick Reference Guide

Clone repo and all submodules:

git clone –recurse-submodules REPO_URL

cd RepoFolder

git submodule update –init –recursive

git submodule foreach –recursive git checkout main

Add a submodule to a repo:

git submodule add REPO_URL /path/to/folderForSubmodule

git submodule update –init –recursive

git submodule foreach –recursive git checkout main

git add .

git commit -m “Adding submodules”

git push

git submodule update –recursive –remote

Check Out a Branch in All Submodules

git submodule foreach –recursive git checkout main

Committing Change to a Submodule

cd .\submodule_folder

# Make some changes!

git add .

git commit –author=”Lisa Rushworth <Lisa.Rushworth@windstream.com>” -m “Commit Message”

git push

cd ..

git add submodule_folder

git commit -m “Updated submodule”

git push

Pull Updates into All Submodules:

git submodule update –recursive –remote –merge

 

Updating git on Windows

There’s a convenient command to update the Windows git binary

git update-git-for-windows

Which is useful since ADO likes to complain about old git clients —

remote: Storing packfile... done (51 ms)
remote: Storing index... done (46 ms)
remote: We noticed you're using an older version of Git. For the best experience, upgrade to a newer version.

Git – Removing Confidential Info From History

The first cut of code may contain … not best practice code. Sometimes this is just hard coding something you’ll want to compute / look up in the future. Hard coding user input isn’t a problem if my first cut is always searching for ABC123. Hard coding the system creds? Not good. You sort that before you actually deploy the code. But some old iteration of the file has MyP2s5w0rD sitting right there in plain text. That’s bad in a system that maintains file history! The quick/easy way to clean up passwords stashed within the history is to download the BFG JAR file.

For this test, I created a new repository in .\source then created three clones of the repo (.\clone1, .\clone2, and .\clone3). In each cloneX copy, I created a tX folder that has a file named ldapAuthTest.py — a file that contains a statically assigned password as

strSystemAccountPass = "MyP2s5w0rD"

The first thing I did was to redact the password in the files — this means anyone looking at HEAD won’t see the password. Source, clone1, and clone2 are all current. The clone3 copy has pulled all changes but has a local change committed but not merged.

To clean the password from the git history, first create a backup of your repo (just in case!). Then mirror the repo to work on it

mkdir mirror
cd mirror
git clone --mirror d:\git\testFilterBranch\source

 

Create file .\replacements.txt with the string to be redacted — in this case:

strSystemAccountPass = "MyP2s5w0rD"

Formatting notes for replacements.txt

MyP2s5w0rD # Replaces string with default ***REMOVED***
MyP2s4w0rD==>REDACTED # Replaces string using custom string REDACTED
MyP2s3w0rD==> # Replaces string with null -- i.e. removes the string
regex:strSystemAccountPass\s?=\s?"MyP2s2w0rD""==>REDACTED # Uses a regex match -- in this case we may or may not have a space around the equal sign

So, in my mirror folder, I have the replacement.txt file which defines which strings are replaced. I have a folder that contains the mirror of my repo.

lisa@FLEX3 /cygdrive/d/git/testFilterBranch/mirror
$ ls
replacements.txt source.git

To replace my “stuff”, run bfg using the –replace-text option. Because I only want to replace the text in files named ldapAuthTest.py, I also added the -fi option

java -jar ../bfg-1.14.0.jar --replace-text ..\replacements.txt -fi ldapAuthTest.py source.git

 

lisa@FLEX3 /cygdrive/d/git/testFilterBranch/mirror
$ java -jar ../bfg-1.14.0.jar --replace-text replacements.txt -fi ldapAuthTest.py source.git

Using repo : D:\git\testFilterBranch\mirror\source.git

Found 3 objects to protect
Found 2 commit-pointing refs : HEAD, refs/heads/master

Protected commits
-----------------
These are your protected commits, and so their contents will NOT be altered:
* commit 87f1b398 (protected by 'HEAD')

Cleaning
--------
Found 5 commits
Cleaning commits: 100% (5/5)
Cleaning commits completed in 613 ms.

Updating 1 Ref
--------------

Ref Before After
---------------------------------------
refs/heads/master | 87f1b398 | 919c8f0f

Updating references: 100% (1/1)
...Ref update completed in 151 ms.

Commit Tree-Dirt History
------------------------

Earliest Latest
| |
. D D D m

D = dirty commits (file tree fixed)
m = modified commits (commit message or parents changed)
. = clean commits (no changes to file tree)

Before After
-------------------------------------------
First modified commit | dc2cd935 | 8764f6f1
Last dirty commit | 9665c4e0 | ccdf0359

Changed files
-------------

Filename Before & After
-------------------------------------
ldapAuthTest.py | 25e79fa6 ? 4d12fdad

In total, 8 object ids were changed. Full details are logged here:
D:\git\testFilterBranch\mirror\source.git.bfg-report\2021-06-23\12-50-00

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

Check to make sure nothing looks abjectly wrong. Assuming the repo is sound, we’re ready to clean up and push these changes.

cd source.git

git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push

 

Pulling the update from my source repo, I have merge conflicts

These are readily resolved and the source repo can be merged into my local copy.

And the change I had committed but not pushed is still there.

Pushing that change produces no errors

Now … pushing the bfg changes may not work. In my case, the real repo has a bunch of branchs and I am getting “non fast-forward merges”. To get the history changed, I need to do a force push. Not so good for the other developers! In that case, everyone should get their changes committed and pushed. The servers should be checked to ensure they are up to date. Then the force push can be done and everyone can pull the new “good” data (which, really, shouldn’t differ from the old data … it’s just the history that is being tweaked).

List Extensions Within Folder

It didn’t occur to me that Apache serves everything under a folder and the .git folder may well be under a folder (you can have your project up a level so there’s a single folder at the root of the project & that folder is DocumentRoot for the web site). Without knowing specific file names, you cannot get anything since directory browsing is disabled. But git has a well-known structure so browsing to /.git/index or really scary for someone who stuffs their password in the repo URL /.git/config is there and Apache happily serves it unless you’ve provided instructions otherwise.

A coworker brought up the intriguing idea of, instead of blocking the .git folder so things subordinate to .git are never served, having a specific list of known good extensions the web server was willing to serve. Which … ironically was one of the things I really didn’t like about IIS. Kind of like the extra frustration of driving behind someone who is going the speed limit. Frustrating because I want to go faster, extra frustrating because they aren’t actually wrong.

But configuring a list of good-to-serve extensions means you’ve got to get a handle on what extensions are on your server in the first place. This command provides a list of extensions and a count per extension (so you can easily identify one-offs that may not be needed):

find /path/to/search/ -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort | uniq -c

 

Tar Excluding Git Folders

You can, of course, use –exclude and avoid adding the .git folders to your tar archive, but I discovered a really cool option that excludes the folders created by a whole host of version control systems:

--exclude-vcs

Which, as of version 1.32, means excluding CVS, RCS, SCCS, git, SVN, Arch, Bazaar, Mercurial, and Darcs as follows:

  • CVS/ — recursive
  • RCS/ — recursive
  • SCCS/ — recursive
  • .git/ — recursive
  • .gitignore
  • .gitmodules
  • .gitattributes
  • .cvsignore
  • .svn/ — recursive
  • .arch-ids/ — recursive
  • {arch}/ — recursive
  • =RELEASE-ID
  • =meta-update
  • =update
  • .bzr
  • .bzrignore
  • .bzrtags
  • .hg
  • .hgignore
  • .hgrags
  • _darcs

Git — Show Dates For Branches

We wanted to see all of the branches in a repository with the dates of the latest commit:

git for-each-ref --sort=committerdate refs/heads/ --format='%(committerdate:rfc-local) %(refname:short)'

This outputs the full date/time (you can use any of the git log date formats [relative|local|default|iso|rfc|short|raw] after committerdate; short for the date without time)

 

Git: Using Soft Reset To Clean Up Un-pushed Commits

I missed a file when I was cleaning up debugging lines. I made the change and included it in a second commit, but I’d rather not have two commits for the same purpose. I hadn’t pushed my changes yet, so these commits only exist on my workstation … which means I can reset and bundle the changes into a single commit.

Find commit number that is one before the duplicate debug logging cleanup — this is the point to which you want to reset. In my case, it is the commit start with b443348c

Reset there with “–soft” — this doesn’t change anything on the file system (i.e. I don’t have to clean up those debug lines again) but puts the changes back into the staging area.

Now those files are staged again, so I can make a single commit for removing debug logging from my code.

Voila! I can push these changes and not clutter our history with my error.

 

Preventing erronious use of the master branch on development servers

One of the web servers at work uses a refspec in the “git pull” command to map the remote development branch to the local remote-tracking master branch. This is fairly confusing (and it looks like the dev server is using the master branch unless you dig into how the pull is performed), but I can see how this prevents someone from accidentally typing something like “git checkout master” and really messing up the development environment. I can also see a dozen ways someone can issue what is a completely reasonable git command 99% of the time and really mess up the development environment.

While it is simple enough to just checkout the development branch, doing so does open us up to the possibility that someone will erroneously  deliver the production code to the development server and halt all testing. While you cannot create shell aliases for multi-word commands (or, more accurately, alias expansion is performed for the first word of a simple command is checked to see if it has an alias … so you’ll never get the multi-word command), you can define a function to intercept git commands and avoid running unwanted commands:

function git() { 
     case $* in 
         "checkout master" ) command echo "This is a dev server, do not checkout the master branch!" ;; 
         "pull origin master" ) command echo "This is a dev server, do not pull the master branch" ;; 
         * ) command git "$@" ;; 
     esac
}

Or define the desired commands and avoid running any others:

function git(){
     if echo "$@" | grep -Eq '^checkout uat$'; then
          command git $@
     elif echo "$@" | grep -Eq '^pull .+ uat$'; then
          command git $@
     else
          echo "The command $@ needs to be whitelisted before it can be run"
     fi
}

Either approach mitigates the risk of someone incorrectly using the master branch on the development server.

Reverting a Single File with Git

Git revert is great for resetting the entire project to a particular state – I went down a bad path, really don’t want to do this, and resetting to the state I was in this morning is exactly what I want to do. Sometimes, though … that’s not the case. I added a couple of debugging lines to a file that I don’t really need. Or I’ve gone down a bad path here but have good work in a few other files too. In those cases, you can revert a single file to the latest committed version. Run “git status” and “git diff” to confirm that it is an uncommited change.

To revert a single file to its latest committed state, use “git checkout – filename” – you can see the added line has disappeared.