Repository¶

The repository object represents one single git repository, can be created in two main ways:

By explicitly passing the directory of a local git repository

By explicitly passing a remote git repository to be cloned locally into a temporary directory.

Using each method:

Explicit Local¶

Explicit local directories are passed by using a string for working dir:

from gitpandas import Repository
pd = Repository(working_dir='/path/to/repo1/', verbose=True)

The subdirectories of the directory passed are not searched, so it must have a .git directory in it.

Explicit Remote¶

Explicit remote directories are passed by using simple git notation:

from gitpandas import Repository
pd = Repository(working_dir='git://github.com/user/repo.git', verbose=True)

The repository will be cloned locally into a temporary directory, which can be somewhat slow for large repos.

Detailed API Documentation¶

class gitpandas.repository.GitFlowRepository[source]¶: A special case where git flow is followed, so we know something about the branching scheme

class gitpandas.repository.Repository(working_dir=None, verbose=False)[source]¶

The base class for a generic git repository, from which to gather statistics. The object encapulates a single gitpython Repo instance.

Parameters:	working_dir – the directory of the git repository, meaning a .git directory is in it (default None=cwd)
Returns:

blame(rev='HEAD', committer=True, by='repository', ignore_globs=None, include_globs=None)[source]¶

Returns the blame from the current HEAD of the repository as a DataFrame. The DataFrame is grouped by committer name, so it will be the sum of all contributions to the repository by each committer. As with the commit history method, extensions and ignore_dirs parameters can be passed to exclude certain directories, or focus on certain file extensions. The DataFrame will have the columns:

committer

loc

Parameters:

rev – (optional, default=HEAD) the specific revision to blame
committer – (optional, default=True) true if committer should be reported, false if author
by – (optional, default=repository) whether to group by repository or by file
ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.

Returns:

DataFrame

branches()[source]¶

Returns a data frame of all branches in origin. The DataFrame will have the columns:

repository

branch

local

Returns:	DataFrame

bus_factor(by='repository', ignore_globs=None, include_globs=None)[source]¶

An experimental heuristic for truck factor of a repository calculated by the current distribution of blame in the repository’s primary branch. The factor is the fewest number of contributors whose contributions make up at least 50% of the codebase’s LOC

Parameters:	ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing include_globs – (optinal, default=None) a list of globs to include, default of None includes everything. by – (optional, default=repository) whether to group by repository or by file
Returns:

commit_history(branch='master', limit=None, days=None, ignore_globs=None, include_globs=None)[source]¶

Returns a pandas DataFrame containing all of the commits for a given branch. Included in that DataFrame will be the columns:

date (index)

author

committer

message

lines

insertions

deletions

net

Parameters:

branch – the branch to return commits for
limit – (optional, default=None) a maximum number of commits to return, None for no limit
days – (optional, default=None) number of days to return, if limit is None
ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.

Returns:

DataFrame

coverage()[source]¶

If there is a .coverage file available, this will attempt to form a DataFrame with that information in it, which will contain the columns:

filename

lines_covered

total_lines

coverage

If it can’t be found or parsed, an empty DataFrame of that form will be returned.

Returns:	DataFrame

cumulative_blame(branch='master', limit=None, skip=None, num_datapoints=None, committer=True, ignore_globs=None, include_globs=None)[source]¶

Returns the blame at every revision of interest. Index is a datetime, column per committer, with number of lines blamed to each committer at each timestamp as data.

Parameters:

branch – (optional, default ‘master’) the branch to work in
limit – (optional, default None), the maximum number of revisions to return, None for no limit
skip – (optional, default None), the number of revisions to skip. Ex: skip=2 returns every other revision, None for no skipping.
num_datapoints – (optional, default=None) if limit and skip are none, and this isn’t, then num_datapoints evenly spaced revs will be used
committer – (optional, defualt=True) true if committer should be reported, false if author
ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.

Returns:

DataFrame

file_change_history(branch='master', limit=None, days=None, ignore_globs=None, include_globs=None)[source]¶

Returns a DataFrame of all file changes (via the commit history) for the specified branch. This is similar to the commit history DataFrame, but is one row per file edit rather than one row per commit (which may encapsulate many file changes). Included in the DataFrame will be the columns:

date (index)

author

committer

message

filename

insertions

deletions

Parameters:

branch – the branch to return commits for
limit – (optional, default=None) a maximum number of commits to return, None for no limit
days – (optional, default=None) number of days to return if limit is None
ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.

Returns:

DataFrame

file_change_rates(branch='master', limit=None, coverage=False, days=None, ignore_globs=None, include_globs=None)[source]¶

This function will return a DataFrame containing some basic aggregations of the file change history data, and optionally test coverage data from a coverage_data.py .coverage file. The aim here is to identify files in the project which have abnormal edit rates, or the rate of changes without growing the files size. If a file has a high change rate and poor test coverage, then it is a great candidate for writing more tests.

Parameters:

branch – (optional, default=master) the branch to return commits for
limit – (optional, default=None) a maximum number of commits to return, None for no limit
coverage – (optional, default=False) a bool for whether or not to attempt to join in coverage data.
days – (optional, default=None) number of days to return if limit is None
ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.

Returns:

DataFrame

file_detail(include_globs=None, ignore_globs=None, rev='HEAD', committer=True)[source]¶

Returns a table of all current files in the repos, with some high level information about each file (total LOC, file owner, extension, most recent edit date, etc.).

Parameters:	ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing include_globs – (optinal, default=None) a list of globs to include, default of None includes everything. committer – (optional, default=True) true if committer should be reported, false if author
Returns:

file_owner(rev, filename, committer=True)[source]¶

Returns the owner (by majority blame) of a given file in a given rev. Returns the committers’ name.

Parameters:	rev – filename – committer –

has_coverage()[source]¶

Returns a boolean for is a parseable .coverage file can be found in the repository

Returns:	bool

hours_estimate(branch='master', grouping_window=0.5, single_commit_hours=0.5, limit=None, days=None, committer=True, ignore_globs=None, include_globs=None)[source]¶

inspired by: https://github.com/kimmobrunfeldt/git-hours/blob/8aaeee237cb9d9028e7a2592a25ad8468b1f45e4/index.js#L114-L143

Iterates through the commit history of repo to estimate the time commitement of each author or committer over the course of time indicated by limit/extensions/days/etc.

Parameters:

branch – the branch to return commits for
limit – (optional, default=None) a maximum number of commits to return, None for no limit
grouping_window – (optional, default=0.5 hours) the threhold for how close two commits need to be to consider them part of one coding session
single_commit_hours – (optional, default 0.5 hours) the time range to associate with one single commit
days – (optional, default=None) number of days to return, if limit is None
committer – (optional, default=True) whether to use committer vs. author
ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.

Returns:

DataFrame

is_bare()[source]¶

Returns a boolean for if the repo is bare or not

Returns:	bool

parallel_cumulative_blame(branch='master', limit=None, skip=None, num_datapoints=None, committer=True, workers=1, ignore_globs=None, include_globs=None)[source]¶

Returns the blame at every revision of interest. Index is a datetime, column per committer, with number of lines blamed to each committer at each timestamp as data.

Parameters:

branch – (optional, default ‘master’) the branch to work in
limit – (optional, default None), the maximum number of revisions to return, None for no limit
skip – (optional, default None), the number of revisions to skip. Ex: skip=2 returns every other revision, None for no skipping.
num_datapoints – (optional, default=None) if limit and skip are none, and this isn’t, then num_datapoints evenly spaced revs will be used
committer – (optional, defualt=True) true if committer should be reported, false if author
ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
workers – (optional, default=1) integer, the number of workers to use in the threadpool, -1 for one per core.

Returns:

DataFrame

punchcard(branch='master', limit=None, days=None, by=None, normalize=None, ignore_globs=None, include_globs=None)[source]¶

Returns a pandas DataFrame containing all of the data for a punchcard.

day_of_week

hour_of_day

author / committer

lines

insertions

deletions

net

Parameters:

branch – the branch to return commits for
limit – (optional, default=None) a maximum number of commits to return, None for no limit
days – (optional, default=None) number of days to return, if limit is None
by – (optional, default=None) agg by options, None for no aggregation (just a high level punchcard), or ‘committer’, ‘author’
normalize – (optional, default=None) if an integer, returns the data normalized to max value of that (for plotting)
ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.

Returns:

DataFrame

revs(branch='master', limit=None, skip=None, num_datapoints=None)[source]¶

Returns a dataframe of all revision tags and their timestamps. It will have the columns:

date

rev

Parameters:

branch – (optional, default ‘master’) the branch to work in
limit – (optional, default None), the maximum number of revisions to return, None for no limit
skip – (optional, default None), the number of revisions to skip. Ex: skip=2 returns every other revision, None for no skipping.
num_datapoints – (optional, default=None) if limit and skip are none, and this isn’t, then num_datapoints evenly spaced revs will be used

Returns:

DataFrame

tags()[source]¶

Returns a data frame of all tags in origin. The DataFrame will have the columns:

repository

tag

Returns:	DataFrame