Relinking 30k GitLab commits to JIRA

22.11.2018 | Arne Babenhauserheide

We switched from Subversion to Git and could finally get rid of the svn plugin for JIRA. Thanks to that we could update Jira (without paying for the suddenly expensive plugin)!

\o/ \o/ \o/

But now our GitLab-based linking plugin could only link to new commits, not to old ones.

O_o o_O

So we decided to relink old commits manually. This post will be technical, because I want to share not only the solution but the relevant part of the working process. Also writing code is what I do. The resulting code is available on GitHub.

Preparation.
Preparing a quick test runner
Get GitLab URL
Extract a task list with issues and commits to link
Test performance
Authenticate to Jira securely
Relink Gitlab and Jira
Check the speed again

Preparation

We need a tool which can get a mapping from issues to commits and their description out of Git and upload it into our JIRA. Time to revisit those Python skills and Mercurial knowledge!

.oO(???)Oo.

Why Mercurial? Let me explain. The Git commandline is famous for its caveats, and I needed a robust and simple way to access the Git history. Preferably from Python with its convenient string operations. So let’s run associative memory: Python, Git, Mercurial … hg-git? Too high-level. Monty Python, cocktail party with the Gits … Dulwich! The library used in hg-git for high performance Git operations from Python.

This can provide the information from Git. Next step: Connect to JIRA, surely there’s a library for that. We’re using Python, sure there is (popularity does have its perks)!

.oO(10⁶ flies might be wrong but they build the 10³ tools you need)Oo.

pip3 install --user dulwich
pip3 install --user jira
pip3 install --user pycrypto
pip3 install --user gpg

Problem solved, we’re done! … well, almost done.

Let’s put it all together.

Preparing a quick test runner

To work efficiently in a new Python project, I typically start with a test runner: When called with --test, the script runs its doctests. Also I add a shortcut in Emacs to run the tests.

The Python side:

parser = argparse.ArgumentParser()
parser.add_argument("--test", action="store_true",
            help="Run tests")
# …
def _test(args):
    from doctest import testmod
    tests = testmod()
    if not tests.failed:
        return "^_^ ({})".format(tests.attempted)
    else: return ":( "*tests.failed
# …
if __name__ == "__main__":
    args = parser.parse_args()
    if args.test:
        print(_test(args))

The Emacs side:

(defun test-this-python-file ()
    (interactive)
    (shell-command
        (concat (buffer-file-name (current-buffer)) " --test")))

Then use M-x local-set-key RET F9 test-this-python-file. That’s it, press F9 to run the tests, set to go.

Get GitLab URL

We get the target URL of GitLab directly from the remote.

def assemble_repoinfo(path):
    """ Get the information needed to re-link issues and commits from a todo file.
    """
    R = dulwich.repo.Repo(path)
    C = R.get_config()
    url = C.get((b"remote", b"origin"), b"url")
    url_stripped = url[url.index(b"//"):]
    if b"@" in url_stripped:
        url_stripped = url_stripped[url_stripped.index(b"@") + 1:]
    commit_uri_prefix = b"https://" + url_stripped + b"/commit/"
    return {'commit_uri_prefix': commit_uri_prefix.decode("utf-8")}

Hit F9 and smile at the successful doctest.

The information is stored as a json file for later usage (not shown here, see the full source).

Extract a task list with issues and commits to link

Now we need the commits mapped to issues. The full code also tracks the time to relink the most recent issues first. Here this is left out for brevity.

JIRA_ID_MATCHER = re.compile(rb"(?:[^A-Z]*)([A-Z]+-[0-9]+)", flags=re.MULTILINE)

def get_jira_issues(commit_message):
    """retrieve the JIRA issue referenced in the commit message
    """
    start = 0
    match = JIRA_ID_MATCHER.search(commit_message[start:])
    issues = set()
    while match:
        issues.add(match.group(1))
        start += match.end(1)
        match = JIRA_ID_MATCHER.search(commit_message[start:])
    return issues

def get_commit_ids_and_messages(repo_path, limit=1):
    """Get up to limit commits in the given repo_path.
    """
    repo = dulwich.repo.Repo(repo_path)
    walker = repo.get_graph_walker()
    commit_ids = []
    n = 0
    while limit is None or n < limit:
        c = walker.next()
        if c is None:
            logging.debug("No more commits from walker %s", walker)
            break
        yield (c, repo.get_object(c).message, repo.get_object(c).commit_time)
        n += 1

def get_issue_references(repo_path, limit=1, withfiles=False, withsizes=False):
    """get all issue references and their commit IDs.
    """
    for commit_id, message, commit_time in get_commit_ids_and_messages(repo_path, limit):
        isodate = epochtime_to_isodate(commit_time).encode("utf-8")
        # a commit might reference multiple issues
        issues = get_jira_issues(message)
        for issue in issues:
            yield (commit_id, issue, isodate, message)

def format_todo_entry(commit_id, issue, isodate, message):
    return b' '.join([commit_id, issue, isodate, message.replace(b'\n', b'---')])

That’s it. Run the code as

./retrieve_commits_and_issues.py [--output TODO_FILE.todo] [--previous OLD_TODO_FILE.todo] PATH_TO_GIT_REPO

to save a list of commits with their JIRA issues and commit date in the file TODO_FILE.todo. If you have a previous file, the new file only includes the commits missing in the OLD_TODO_FILE.todo.

That’s our internal data format: A tasklist of commits to add to JIRA.

Test performance

Let’s keep in mind that we have to work with more than one million lines of code and 10 years of history (that’s the part we want to relink). And up to here I just tested the code on the history of the relinking project itself.

So let’s test extracting a full todo list for 60k commits.

$ time ./retrieve_commits_and_issues.py --output TODO_FILE.todo cadenza/
real    0m32,795s
user    0m31,273s
sys     0m1,277s

$ wc -l TODO_FILE.todo
62918 TODO_FILE.todo

Wow, Git is fast! And with Dulwich it is fast from Python!

No problem here, we can go forward.

Authenticate to Jira securely

JIRA login data should not be in a plain text file. Especially when it might have to adjust JIRA issues you as developer do not have access to. And don’t want to have access to, because then you’d have to take care of the credentials. And the credentials should not lie around on a server unencrypted. Really not. But you want to be able to set up the system for testing and have only a single file to replace.

An encrypted file. So let’s start with encryption, because it’s critical.

I decided to blog about this, even though it’s not really part of Git or JIRA, because this requirement will come up in many projects. I hope it will be useful to some of you and improve the state of credentials.

The de-facto standard for general purpose encryption suited for scripting is gpg, so let’s use the gpg package to read the credentials file:

import gpg
def user_and_password(netrc_path, parse_function):
    with gpg.Context() as c:
        with open(netrc_path) as f:
            # parse_function takes the file data and the host for which you need the credentials
            login_info = parse_function(c.decrypt(f)[0], jira_server))
    return login_info["user"], login_info["password"]

With the most critical part out of the way, we can go to actually encode the data.

A standard for credentials in a file is the netrc file. It looks like this:

machine jira.HOST.TLD login USER password PASSWORD

And luckily there’s the netrc package for that. But it only works with files, so we use tempfile to get from data to a file. If you do this, ensure that your temp is either encrypted or on tmpfs (with swap disabled).

import netrc, tempfile
def login_info_from_netrc(data, machine):
    """Retrieve username and password for the last entry matching the given machine from the netrc file

    :param data: bytes, i.e. b'machine FOO'

    >>> sorted(login_info_from_netrc(b"machine jira.HOST.TLD login USER password PASSWORD\\n", "jira.HOST.TLD").items())
    [('password', 'PASSWORD'), ('user', 'USER')]
    >>> sorted(login_info_from_netrc(b"machine jira.HOST.TLD login USER password PASSWORD\\nmachine jira.HOST.TLD login USER2 password PASSWORD2\\n", "jira.HOST.TLD").items())
    [('password', 'PASSWORD2'), ('user', 'USER2')]
    """
    temp = tempfile.NamedTemporaryFile()
    temp.write(data)
    temp.flush() # ensure that the data is written
    parsed = netrc.netrc(temp.name)
    res = parsed.authenticators(machine)
    temp.close() # the data is gone now
    return {"user": res[0], "password": res[2]}

That solves getting the login data securely. So we can finally get to the core of this task.

Relink Gitlab and Jira

We have a list of commits and their JIRA issues. We have the GitLab URL. We have the JIRA server URL and credentials. Let’s use the JIRA API. But first, we need to read our intermediate format:

def read_todo(filepath):
    """ Read the todo file
    """
    with open(filepath) as f:
        for i in f:
            yield i

def process_taskline(taskline, commit_uri_prefix):
    """Retrieve issue_id, url, and title from a taskline in a todo file.
    """
    commit_id, issue_id, isodate, title = taskline.split(" ", 3)
    shorttitle = title.split('---')[0]
    # if the title is too short, include the second non-empty line
    if len(shorttitle) < 20:
        shorttitle = " - ".join([i for i in title.split('---')
                        if i.strip()][:2])
    return (issue_id,
        commit_uri_prefix + commit_id,
        " ".join((isodate, shorttitle.strip())))

Then we need an authenticated JIRA object.

import jira.JIRA
netrc_path = os.path.expanduser(args.netrc_gpg_path)
user, password = user_and_password(netrc_path, login_info_from_netrc):

with open(args.repo_info_file) as f:
    commit_uri_prefix = json.load(f)["commit_uri_prefix"]

try:
    authed_jira = jira.JIRA(args.jira_api_server, basic_auth=(user, password))
except jira.exceptions.JIRAError as e:
    if "CAPTCHA_CHALLENGE" in str(e):
    logging.error(
        ("JIRA requires a CAPTCHA, please log in to %s in the browser to solve them"
            "(log out first if you are already logged in)."
            "Please also check that the PASSWORD used is correct."),
        args.jira_api_server)
    def ask(msg, options, default="y"):
        def getletter(msg):
            return "".join(input(msg).lower().strip()[:1])
        reply = getletter(msg)
        while (reply not in options
            and not (default and reply == "")):
            reply = getletter(msg)
        return (reply if reply else default)
    reply = ask("Open JIRA in browser (Y, n)? ", ["y", "n"], default="y")
    if reply == "y":
        import webbrowser
        webbrowser.open(args.jira_api_server)
    else:
        logging.warn("Not opening browser, please login to %s manually.",
                args.jira_api_server)
    else:
    raise

This is horribly long because JIRA kept failing me with a CAPTCHA challenge, so I had to login from the browser and solve a CAPTCHA to prove that it’s an actual human who uses the API. And while we’re starting a browser from Python, we might as well do more :-)

import antigravity

Try it! :-)

Back to work: We have authed JIRA, so we can tie it all together:

def create_a_link(authed_jira, issue_id, url, title, icon_url):
    """Actually create the links."""
    # shorten too long titles
    if title[255:]:
            title = title[:251]+ " ..."
    linkobject = {"url": url, "title": title,
            "icon": {"url16x16": icon_url,
                "title": "Gitlab"}}
    # get the existing links
    links = get_all_links(authed_jira, issue_id)
    if url in links:
        # FIXME: This is almost exactly factor 100 slower than initial creation.
        links[url].update(linkobject)
    else:
        authed_jira.add_simple_link(issue_id, linkobject)

@functools.lru_cache(maxsize=60000)
def get_all_links(authed_jira, issue_id):
    """Retrieve all links defined in the issue."""
    links = authed_jira.remote_links(issue_id)
    return {i.object.url: i for i in links}

Note the FIXME: you do not want to be forced to re-do this, so start small and only run the full code when you’re ready.

Als note the title[255:]. It is really, really annoying when you find out after 30k commits that your script breaks because JIRA only allows titles up to 255 chars.

All the same it is extremely relieving when you then realize that you won’t run into the factor 100 slowdown, because your script created a FINISHED file you can use to exclude all the links which are already done. Kudos to our lead of development for suggesting that!

The last part we need is the list of tasks, and the list of tasks to exclude :-)

try:
    excluded = set(read_todo(args.exclude_file))
except FileNotFoundError as e:
    logging.warn("Exclude file not found: %s", args.exclude_file)
    excluded = set()
tasklines = []
for i in args.todofiles:
    tasklines.extend([i for i in read_todo(i)
                if not i in excluded])

And we can finally pull it all together:

starttime = time.time()
with open(args.logfile_for_processed_tasks, "a") as f:
    for taskline in tasklines:
        issue_id, url, title = process_taskline(taskline, commit_uri_prefix)
        try:
            links = get_all_links(authed_jira, issue_id)
        except jira.exceptions.JIRAError as e:
            logging.error(e)
            continue
        if args.create_the_links:
            create_a_link(authed_jira, issue_id, url, title, args.icon_url)
            f.write(taskline)
            logging.info("created link using JIRA server %s for issue %s using url %s and title %s",
                    args.jira_api_server, issue_id, url, title)
        else:
            logging.warn("Tryout mode! Use -c to create link using JIRA server %s for issue %s using url %s and title %s",
                    args.jira_api_server, issue_id, url, title)
if args.create_the_links:
    stoptime = time.time()
    logging.info("created %s links in %s seconds", len(tasklines), stoptime - starttime)

Check the speed again

So that’s it. But we’re not done yet without a last speed test!

The basic test I did was running it on the commits in the tools project, and then scaling up. The script created 4 links in 0.153 seconds, so it would create about ~25 per second.

So we can estimate that with the 60k commits in our Cadenza history, it would would require around one hour to relink everything, if JIRA would not slow down. And luckily JIRA stayed fast.

So that’s that: We’re on GitLab!

Thank you for reading! I hope you enjoyed reading it as much as I enjoyed doing the relinking right!