Repository¶
The repository object represents one single git repository, can be created in two main ways:
- By explicitly passing the directory of a local git repository
- By explicitly passing a remote git repository to be cloned locally into a temporary directory.
Using each method:
Explicit Local¶
Explicit local directories are passed by using a string for working dir:
from gitpandas import Repository
pd = Repository(working_dir='/path/to/repo1/', verbose=True)
The subdirectories of the directory passed are not searched, so it must have a .git directory in it.
Explicit Remote¶
Explicit remote directories are passed by using simple git notation:
from gitpandas import Repository
pd = Repository(working_dir='git://github.com/user/repo.git', verbose=True)
The repository will be cloned locally into a temporary directory, which can be somewhat slow for large repos.
Detailed API Documentation¶
-
class
gitpandas.repository.
GitFlowRepository
[source]¶ A special case where git flow is followed, so we know something about the branching scheme
-
class
gitpandas.repository.
Repository
(working_dir=None, verbose=False)[source]¶ The base class for a generic git repository, from which to gather statistics. The object encapulates a single gitpython Repo instance.
Parameters: working_dir – the directory of the git repository, meaning a .git directory is in it (default None=cwd) Returns: -
blame
(rev='HEAD', committer=True, by='repository', ignore_globs=None, include_globs=None)[source]¶ Returns the blame from the current HEAD of the repository as a DataFrame. The DataFrame is grouped by committer name, so it will be the sum of all contributions to the repository by each committer. As with the commit history method, extensions and ignore_dirs parameters can be passed to exclude certain directories, or focus on certain file extensions. The DataFrame will have the columns:
- committer
- loc
Parameters: - rev – (optional, default=HEAD) the specific revision to blame
- committer – (optional, default=True) true if committer should be reported, false if author
- by – (optional, default=repository) whether to group by repository or by file
- ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
- include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns: DataFrame
-
branches
()[source]¶ Returns a data frame of all branches in origin. The DataFrame will have the columns:
- repository
- branch
- local
Returns: DataFrame
-
bus_factor
(by='repository', ignore_globs=None, include_globs=None)[source]¶ An experimental heuristic for truck factor of a repository calculated by the current distribution of blame in the repository’s primary branch. The factor is the fewest number of contributors whose contributions make up at least 50% of the codebase’s LOC
Parameters: - ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
- include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
- by – (optional, default=repository) whether to group by repository or by file
Returns:
-
commit_history
(branch='master', limit=None, days=None, ignore_globs=None, include_globs=None)[source]¶ Returns a pandas DataFrame containing all of the commits for a given branch. Included in that DataFrame will be the columns:
- date (index)
- author
- committer
- message
- lines
- insertions
- deletions
- net
Parameters: - branch – the branch to return commits for
- limit – (optional, default=None) a maximum number of commits to return, None for no limit
- days – (optional, default=None) number of days to return, if limit is None
- ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
- include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns: DataFrame
-
coverage
()[source]¶ If there is a .coverage file available, this will attempt to form a DataFrame with that information in it, which will contain the columns:
- filename
- lines_covered
- total_lines
- coverage
If it can’t be found or parsed, an empty DataFrame of that form will be returned.
Returns: DataFrame
-
cumulative_blame
(branch='master', limit=None, skip=None, num_datapoints=None, committer=True, ignore_globs=None, include_globs=None)[source]¶ Returns the blame at every revision of interest. Index is a datetime, column per committer, with number of lines blamed to each committer at each timestamp as data.
Parameters: - branch – (optional, default ‘master’) the branch to work in
- limit – (optional, default None), the maximum number of revisions to return, None for no limit
- skip – (optional, default None), the number of revisions to skip. Ex: skip=2 returns every other revision, None for no skipping.
- num_datapoints – (optional, default=None) if limit and skip are none, and this isn’t, then num_datapoints evenly spaced revs will be used
- committer – (optional, defualt=True) true if committer should be reported, false if author
- ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
- include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns: DataFrame
-
file_change_history
(branch='master', limit=None, days=None, ignore_globs=None, include_globs=None)[source]¶ Returns a DataFrame of all file changes (via the commit history) for the specified branch. This is similar to the commit history DataFrame, but is one row per file edit rather than one row per commit (which may encapsulate many file changes). Included in the DataFrame will be the columns:
- date (index)
- author
- committer
- message
- filename
- insertions
- deletions
Parameters: - branch – the branch to return commits for
- limit – (optional, default=None) a maximum number of commits to return, None for no limit
- days – (optional, default=None) number of days to return if limit is None
- ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
- include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns: DataFrame
-
file_change_rates
(branch='master', limit=None, coverage=False, days=None, ignore_globs=None, include_globs=None)[source]¶ This function will return a DataFrame containing some basic aggregations of the file change history data, and optionally test coverage data from a coverage_data.py .coverage file. The aim here is to identify files in the project which have abnormal edit rates, or the rate of changes without growing the files size. If a file has a high change rate and poor test coverage, then it is a great candidate for writing more tests.
Parameters: - branch – (optional, default=master) the branch to return commits for
- limit – (optional, default=None) a maximum number of commits to return, None for no limit
- coverage – (optional, default=False) a bool for whether or not to attempt to join in coverage data.
- days – (optional, default=None) number of days to return if limit is None
- ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
- include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns: DataFrame
-
file_detail
(include_globs=None, ignore_globs=None, rev='HEAD', committer=True)[source]¶ Returns a table of all current files in the repos, with some high level information about each file (total LOC, file owner, extension, most recent edit date, etc.).
Parameters: - ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
- include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
- committer – (optional, default=True) true if committer should be reported, false if author
Returns:
-
file_owner
(rev, filename, committer=True)[source]¶ Returns the owner (by majority blame) of a given file in a given rev. Returns the committers’ name.
Parameters: - rev –
- filename –
- committer –
-
has_coverage
()[source]¶ Returns a boolean for is a parseable .coverage file can be found in the repository
Returns: bool
-
hours_estimate
(branch='master', grouping_window=0.5, single_commit_hours=0.5, limit=None, days=None, committer=True, ignore_globs=None, include_globs=None)[source]¶ inspired by: https://github.com/kimmobrunfeldt/git-hours/blob/8aaeee237cb9d9028e7a2592a25ad8468b1f45e4/index.js#L114-L143
Iterates through the commit history of repo to estimate the time commitement of each author or committer over the course of time indicated by limit/extensions/days/etc.
Parameters: - branch – the branch to return commits for
- limit – (optional, default=None) a maximum number of commits to return, None for no limit
- grouping_window – (optional, default=0.5 hours) the threhold for how close two commits need to be to consider them part of one coding session
- single_commit_hours – (optional, default 0.5 hours) the time range to associate with one single commit
- days – (optional, default=None) number of days to return, if limit is None
- committer – (optional, default=True) whether to use committer vs. author
- ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
- include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns: DataFrame
-
parallel_cumulative_blame
(branch='master', limit=None, skip=None, num_datapoints=None, committer=True, workers=1, ignore_globs=None, include_globs=None)[source]¶ Returns the blame at every revision of interest. Index is a datetime, column per committer, with number of lines blamed to each committer at each timestamp as data.
Parameters: - branch – (optional, default ‘master’) the branch to work in
- limit – (optional, default None), the maximum number of revisions to return, None for no limit
- skip – (optional, default None), the number of revisions to skip. Ex: skip=2 returns every other revision, None for no skipping.
- num_datapoints – (optional, default=None) if limit and skip are none, and this isn’t, then num_datapoints evenly spaced revs will be used
- committer – (optional, defualt=True) true if committer should be reported, false if author
- ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
- include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
- workers – (optional, default=1) integer, the number of workers to use in the threadpool, -1 for one per core.
Returns: DataFrame
-
punchcard
(branch='master', limit=None, days=None, by=None, normalize=None, ignore_globs=None, include_globs=None)[source]¶ Returns a pandas DataFrame containing all of the data for a punchcard.
- day_of_week
- hour_of_day
- author / committer
- lines
- insertions
- deletions
- net
Parameters: - branch – the branch to return commits for
- limit – (optional, default=None) a maximum number of commits to return, None for no limit
- days – (optional, default=None) number of days to return, if limit is None
- by – (optional, default=None) agg by options, None for no aggregation (just a high level punchcard), or ‘committer’, ‘author’
- normalize – (optional, default=None) if an integer, returns the data normalized to max value of that (for plotting)
- ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
- include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns: DataFrame
-
revs
(branch='master', limit=None, skip=None, num_datapoints=None)[source]¶ Returns a dataframe of all revision tags and their timestamps. It will have the columns:
- date
- rev
Parameters: - branch – (optional, default ‘master’) the branch to work in
- limit – (optional, default None), the maximum number of revisions to return, None for no limit
- skip – (optional, default None), the number of revisions to skip. Ex: skip=2 returns every other revision, None for no skipping.
- num_datapoints – (optional, default=None) if limit and skip are none, and this isn’t, then num_datapoints evenly spaced revs will be used
Returns: DataFrame
Returns a data frame of all tags in origin. The DataFrame will have the columns:
- repository
- tag
Returns: DataFrame
-