Repository

The repository object represents one single git repository, can be created in two main ways:

  • By explicitly passing the directory of a local git repository
  • By explicitly passing a remote git repository to be cloned locally into a temporary directory.

Using each method:

Explicit Local

Explicit local directories are passed by using a string for working dir:

from gitpandas import Repository
pd = Repository(working_dir='/path/to/repo1/', verbose=True)

The subdirectories of the directory passed are not searched, so it must have a .git directory in it.

Explicit Remote

Explicit remote directories are passed by using simple git notation:

from gitpandas import Repository
pd = Repository(working_dir='git://github.com/user/repo.git', verbose=True)

The repository will be cloned locally into a temporary directory, which can be somewhat slow for large repos.

Detailed API Documentation

class gitpandas.repository.GitFlowRepository[source]

A special case where git flow is followed, so we know something about the branching scheme

class gitpandas.repository.Repository(working_dir=None, verbose=False)[source]

The base class for a generic git repository, from which to gather statistics. The object encapulates a single gitpython Repo instance.

Parameters:working_dir – the directory of the git repository, meaning a .git directory is in it (default None=cwd)
Returns:
blame(rev='HEAD', committer=True, by='repository', ignore_globs=None, include_globs=None)[source]

Returns the blame from the current HEAD of the repository as a DataFrame. The DataFrame is grouped by committer name, so it will be the sum of all contributions to the repository by each committer. As with the commit history method, extensions and ignore_dirs parameters can be passed to exclude certain directories, or focus on certain file extensions. The DataFrame will have the columns:

  • committer
  • loc
Parameters:
  • rev – (optional, default=HEAD) the specific revision to blame
  • committer – (optional, default=True) true if committer should be reported, false if author
  • by – (optional, default=repository) whether to group by repository or by file
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

branches()[source]

Returns a data frame of all branches in origin. The DataFrame will have the columns:

  • repository
  • branch
  • local
Returns:DataFrame
bus_factor(by='repository', ignore_globs=None, include_globs=None)[source]

An experimental heuristic for truck factor of a repository calculated by the current distribution of blame in the repository’s primary branch. The factor is the fewest number of contributors whose contributions make up at least 50% of the codebase’s LOC

Parameters:
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
  • by – (optional, default=repository) whether to group by repository or by file
Returns:

commit_history(branch='master', limit=None, days=None, ignore_globs=None, include_globs=None)[source]

Returns a pandas DataFrame containing all of the commits for a given branch. Included in that DataFrame will be the columns:

  • date (index)
  • author
  • committer
  • message
  • lines
  • insertions
  • deletions
  • net
Parameters:
  • branch – the branch to return commits for
  • limit – (optional, default=None) a maximum number of commits to return, None for no limit
  • days – (optional, default=None) number of days to return, if limit is None
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

coverage()[source]

If there is a .coverage file available, this will attempt to form a DataFrame with that information in it, which will contain the columns:

  • filename
  • lines_covered
  • total_lines
  • coverage

If it can’t be found or parsed, an empty DataFrame of that form will be returned.

Returns:DataFrame
cumulative_blame(branch='master', limit=None, skip=None, num_datapoints=None, committer=True, ignore_globs=None, include_globs=None)[source]

Returns the blame at every revision of interest. Index is a datetime, column per committer, with number of lines blamed to each committer at each timestamp as data.

Parameters:
  • branch – (optional, default ‘master’) the branch to work in
  • limit – (optional, default None), the maximum number of revisions to return, None for no limit
  • skip – (optional, default None), the number of revisions to skip. Ex: skip=2 returns every other revision, None for no skipping.
  • num_datapoints – (optional, default=None) if limit and skip are none, and this isn’t, then num_datapoints evenly spaced revs will be used
  • committer – (optional, defualt=True) true if committer should be reported, false if author
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

file_change_history(branch='master', limit=None, days=None, ignore_globs=None, include_globs=None)[source]

Returns a DataFrame of all file changes (via the commit history) for the specified branch. This is similar to the commit history DataFrame, but is one row per file edit rather than one row per commit (which may encapsulate many file changes). Included in the DataFrame will be the columns:

  • date (index)
  • author
  • committer
  • message
  • filename
  • insertions
  • deletions
Parameters:
  • branch – the branch to return commits for
  • limit – (optional, default=None) a maximum number of commits to return, None for no limit
  • days – (optional, default=None) number of days to return if limit is None
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

file_change_rates(branch='master', limit=None, coverage=False, days=None, ignore_globs=None, include_globs=None)[source]

This function will return a DataFrame containing some basic aggregations of the file change history data, and optionally test coverage data from a coverage_data.py .coverage file. The aim here is to identify files in the project which have abnormal edit rates, or the rate of changes without growing the files size. If a file has a high change rate and poor test coverage, then it is a great candidate for writing more tests.

Parameters:
  • branch – (optional, default=master) the branch to return commits for
  • limit – (optional, default=None) a maximum number of commits to return, None for no limit
  • coverage – (optional, default=False) a bool for whether or not to attempt to join in coverage data.
  • days – (optional, default=None) number of days to return if limit is None
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

file_detail(include_globs=None, ignore_globs=None, rev='HEAD', committer=True)[source]

Returns a table of all current files in the repos, with some high level information about each file (total LOC, file owner, extension, most recent edit date, etc.).

Parameters:
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
  • committer – (optional, default=True) true if committer should be reported, false if author
Returns:

file_owner(rev, filename, committer=True)[source]

Returns the owner (by majority blame) of a given file in a given rev. Returns the committers’ name.

Parameters:
  • rev
  • filename
  • committer
has_coverage()[source]

Returns a boolean for is a parseable .coverage file can be found in the repository

Returns:bool
hours_estimate(branch='master', grouping_window=0.5, single_commit_hours=0.5, limit=None, days=None, committer=True, ignore_globs=None, include_globs=None)[source]

inspired by: https://github.com/kimmobrunfeldt/git-hours/blob/8aaeee237cb9d9028e7a2592a25ad8468b1f45e4/index.js#L114-L143

Iterates through the commit history of repo to estimate the time commitement of each author or committer over the course of time indicated by limit/extensions/days/etc.

Parameters:
  • branch – the branch to return commits for
  • limit – (optional, default=None) a maximum number of commits to return, None for no limit
  • grouping_window – (optional, default=0.5 hours) the threhold for how close two commits need to be to consider them part of one coding session
  • single_commit_hours – (optional, default 0.5 hours) the time range to associate with one single commit
  • days – (optional, default=None) number of days to return, if limit is None
  • committer – (optional, default=True) whether to use committer vs. author
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

is_bare()[source]

Returns a boolean for if the repo is bare or not

Returns:bool
parallel_cumulative_blame(branch='master', limit=None, skip=None, num_datapoints=None, committer=True, workers=1, ignore_globs=None, include_globs=None)[source]

Returns the blame at every revision of interest. Index is a datetime, column per committer, with number of lines blamed to each committer at each timestamp as data.

Parameters:
  • branch – (optional, default ‘master’) the branch to work in
  • limit – (optional, default None), the maximum number of revisions to return, None for no limit
  • skip – (optional, default None), the number of revisions to skip. Ex: skip=2 returns every other revision, None for no skipping.
  • num_datapoints – (optional, default=None) if limit and skip are none, and this isn’t, then num_datapoints evenly spaced revs will be used
  • committer – (optional, defualt=True) true if committer should be reported, false if author
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
  • workers – (optional, default=1) integer, the number of workers to use in the threadpool, -1 for one per core.
Returns:

DataFrame

punchcard(branch='master', limit=None, days=None, by=None, normalize=None, ignore_globs=None, include_globs=None)[source]

Returns a pandas DataFrame containing all of the data for a punchcard.

  • day_of_week
  • hour_of_day
  • author / committer
  • lines
  • insertions
  • deletions
  • net
Parameters:
  • branch – the branch to return commits for
  • limit – (optional, default=None) a maximum number of commits to return, None for no limit
  • days – (optional, default=None) number of days to return, if limit is None
  • by – (optional, default=None) agg by options, None for no aggregation (just a high level punchcard), or ‘committer’, ‘author’
  • normalize – (optional, default=None) if an integer, returns the data normalized to max value of that (for plotting)
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

revs(branch='master', limit=None, skip=None, num_datapoints=None)[source]

Returns a dataframe of all revision tags and their timestamps. It will have the columns:

  • date
  • rev
Parameters:
  • branch – (optional, default ‘master’) the branch to work in
  • limit – (optional, default None), the maximum number of revisions to return, None for no limit
  • skip – (optional, default None), the number of revisions to skip. Ex: skip=2 returns every other revision, None for no skipping.
  • num_datapoints – (optional, default=None) if limit and skip are none, and this isn’t, then num_datapoints evenly spaced revs will be used
Returns:

DataFrame

tags()[source]

Returns a data frame of all tags in origin. The DataFrame will have the columns:

  • repository
  • tag
Returns:DataFrame