In my last WordPress entry I wrote some scripts to backup the entire site files, database and ssl certificates on a daily basis. This has been running for about 6 weeks so it now needs me to prune the number of backups to a more reasonable number.
I decided that for completeness I wanted to:
- Keep the 10 most recent backup files
- Keep one backup per week for the past 10 weeks
- Keep one backup per month for the past year
This sounded like it would be a pretty simple script with some rules for retaining files and for deleting others. The backups were named in the format YYYYMMDD_domain.tar.gz to make this easier.
I decided to give the task to Claude.ai to see what sort of job it could do. This was a well defined problem with a limited number of permutations so it seemed well suited to AI.
The prompt
write a scrip that will prune backup files that are dated in the form YYYYMMDD where YYYY is the year, MM is the month number and DD is the day. I would like to keep the last 10 backup files, and then one backup for the past 10 weeks. finally I would like to keep one backup for each month for the past year
It took about a minute thinking about ti and then replied
I'll write a script that prunes backup files according to your retention policy. This script will:
Keep the 10 most recent backup files
Keep one backup per week for the past 10 weeks
Keep one backup per month for the past year
<script here - omitted for clarity>
Here's how it works:
It looks for files with dates in the YYYYMMDD format in their names
It implements your retention policy:
Keeps the 10 most recent backup files
Keeps one backup per week for the past 10 weeks
Keeps one backup per month for the past year
The script runs in "dry run" mode by default, showing which files would be deleted without actually removing anything. To actually delete the files, you'll need to run it with the --delete flag.
Usage:
bash# To see which files would be deleted (dry run):
python backup_pruner.py /path/to/backup/directory
# To actually delete the files:
python backup_pruner.py /path/to/backup/directory --delete
The script provides detailed output showing which files are being kept and why (recent backup, weekly backup, or monthly backup), along with a summary at the end.
Would you like me to explain any part of the script in more detail?
The script itself was
#!/usr/bin/env python3
import os
import re
import sys
from datetime import datetime, timedelta
from collections import defaultdict
def parse_date_from_filename(filename):
"""Extract date from filename matching YYYYMMDD pattern."""
match = re.search(r'(\d{8})', filename)
if match:
date_str = match.group(1)
try:
return datetime.strptime(date_str, '%Y%m%d')
except ValueError:
return None
return None
def group_by_week(files_with_dates):
"""Group files by week."""
week_groups = defaultdict(list)
for filename, date in files_with_dates:
# Use ISO calendar week as the key
year, week_num, _ = date.isocalendar()
week_key = f"{year}-W{week_num:02d}"
week_groups[week_key].append((filename, date))
return week_groups
def group_by_month(files_with_dates):
"""Group files by month."""
month_groups = defaultdict(list)
for filename, date in files_with_dates:
month_key = f"{date.year}-{date.month:02d}"
month_groups[month_key].append((filename, date))
return month_groups
def prune_backups(backup_dir, dry_run=True):
"""
Prune backup files according to the retention policy:
- Keep the 10 most recent backups
- Keep one backup per week for the past 10 weeks
- Keep one backup per month for the past year
"""
# Get all backup files with dates
backup_files = []
for filename in os.listdir(backup_dir):
date = parse_date_from_filename(filename)
if date:
backup_files.append((filename, date))
if not backup_files:
print("No backup files found with YYYYMMDD date pattern.")
return
# Sort files by date (newest first)
backup_files.sort(key=lambda x: x[1], reverse=True)
# Calculate retention dates
now = datetime.now()
ten_weeks_ago = now - timedelta(weeks=10)
one_year_ago = now - timedelta(days=365)
# Files to keep
keep_files = set()
# 1. Keep 10 most recent backups
for i, (filename, _) in enumerate(backup_files):
if i < 10:
keep_files.add(filename)
print(f"Keeping recent backup: {filename}")
else:
break
# 2. Keep one backup per week for the past 10 weeks
weekly_backups = []
for filename, date in backup_files:
if ten_weeks_ago <= date <= now:
weekly_backups.append((filename, date))
weekly_groups = group_by_week(weekly_backups)
for week, files in weekly_groups.items():
# Skip if we already kept a file from this week in the recent backups
if all(f[0] not in keep_files for f in files):
# Keep the newest file from each week
files.sort(key=lambda x: x[1], reverse=True)
keep_files.add(files[0][0])
print(f"Keeping weekly backup: {files[0][0]} (week {week})")
# 3. Keep one backup per month for the past year
monthly_backups = []
for filename, date in backup_files:
if one_year_ago <= date <= now:
monthly_backups.append((filename, date))
monthly_groups = group_by_month(monthly_backups)
for month, files in monthly_groups.items():
# Skip if we already kept a file from this month
if all(f[0] not in keep_files for f in files):
# Keep the newest file from each month
files.sort(key=lambda x: x[1], reverse=True)
keep_files.add(files[0][0])
print(f"Keeping monthly backup: {files[0][0]} (month {month})")
# Delete files not in keep_files
deleted_count = 0
for filename, _ in backup_files:
if filename not in keep_files:
file_path = os.path.join(backup_dir, filename)
if dry_run:
print(f"Would delete: {filename}")
else:
try:
os.remove(file_path)
print(f"Deleted: {filename}")
deleted_count += 1
except Exception as e:
print(f"Failed to delete {filename}: {e}")
print("\nSummary:")
print(f"Total backup files: {len(backup_files)}")
print(f"Files to keep: {len(keep_files)}")
if dry_run:
print(f"Files that would be deleted: {len(backup_files) - len(keep_files)}")
else:
print(f"Files deleted: {deleted_count}")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python backup_pruner.py <backup_directory> [--delete]")
sys.exit(1)
backup_dir = sys.argv[1]
dry_run = True
if len(sys.argv) > 2 and sys.argv[2] == "--delete":
dry_run = False
if not os.path.isdir(backup_dir):
print(f"Error: {backup_dir} is not a valid directory")
sys.exit(1)
print(f"Analyzing backup files in: {backup_dir}")
if dry_run:
print("Running in DRY RUN mode. No files will be deleted.")
print("To actually delete files, run with the --delete flag.")
else:
print("WARNING: Running in DELETE mode. Files will be permanently deleted.")
print("-" * 60)
prune_backups(backup_dir, dry_run)
The code seemed well structured and intuitive. It took a few minutes to work out what was going on, and in general I was impresed.
The code was parsing the filenames to get the date and then filtering them according to the rules which I gave. If the date was in the last 10 days then the filename was put in the keep_files list. For the weeks it took a two pass approach. In pass one it worked out the week number from the date and then grouped the filenames by the week number. I pass two it took the most recent filename from each group for the last 10 groups and add that to the keep_files list. It took a similar approach to the monthly filter.
Finally the code would delete all files that matched the filename formate that were found bar those on the keep_files list.
Would I have written it that way? Probably not – but that doesn’t make it a bad script. I wouldn’t have used regex for parsing the filename as it seems overkill, however that is personal preference. I would have a used a keep_list approach, but my filtering would probably have been different – but not very much.
Testing
Of course I had to test the code before I could run it on my backups, so I asked Claude to write a script that would generate some test data. It took a couple of attempts; the first time it generated the filesnames using an incorrect format, but again the attempt was solid.
write a script to generate empty files for each day in the past year following the naming convention given
the code
#!/usr/bin/env python3
import os
import sys
from datetime import datetime, timedelta
def generate_test_backup_files(output_dir, suffix=".tar.gz"):
"""
Generate empty test backup files for each day in the past year.
Files will be named using the pattern: YYYYMMDD + suffix
"""
today = datetime.now()
# Create output directory if it doesn't exist
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print(f"Created directory: {output_dir}")
# Generate a file for each day in the past 365 days
count = 0
for days_ago in range(365, -1, -1): # From 365 days ago to today
target_date = today - timedelta(days=days_ago)
date_str = target_date.strftime('%Y%m%d')
filename = f"{date_str}{suffix}"
file_path = os.path.join(output_dir, filename)
# Create an empty file
with open(file_path, 'w') as f:
pass # Just create an empty file
count += 1
# Print progress every 30 days
if count % 30 == 0 or days_ago == 0:
print(f"Generated {count} files... Latest: {filename}")
print(f"\nFinished generating {count} test backup files in {output_dir}")
print(f"Files follow the naming pattern: YYYYMMDD{suffix}")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python generate_test_backups.py <output_directory> [suffix]")
print("Default suffix: '.tar.gz'")
sys.exit(1)
output_dir = sys.argv[1]
# Optional argument
suffix = ".tar.gz"
if len(sys.argv) > 2:
suffix = sys.argv[2]
print(f"Generating test backup files in: {output_dir}")
print(f"Using naming pattern: YYYYMMDD{suffix}")
print("-" * 60)
generate_test_backup_files(output_dir, suffix)
Again a solid result. Running the script generate 365 files which I could then run the backup_pruner.py against. I manually edited the script so that it generated 400 files as I wanted to see it delete backups over a year old.
Running on Live
I edited the script to remove the ‘dry_run’ parameter and ran it on the live folder
Analyzing backup files in: .
------------------------------------------------------------
Keeping recent backup: 20250429_tangentuniverse.net.tar.gz
Keeping recent backup: 20250428_tangentuniverse.net.tar.gz
Keeping recent backup: 20250427_tangentuniverse.net.tar.gz
Keeping recent backup: 20250426_tangentuniverse.net.tar.gz
Keeping recent backup: 20250425_tangentuniverse.net.tar.gz
Keeping recent backup: 20250424_tangentuniverse.net.tar.gz
Keeping recent backup: 20250423_tangentuniverse.net.tar.gz
Keeping recent backup: 20250422_tangentuniverse.net.tar.gz
Keeping recent backup: 20250421_tangentuniverse.net.tar.gz
Keeping recent backup: 20250420_tangentuniverse.net.tar.gz
Keeping weekly backup: 20250413_tangentuniverse.net.tar.gz (week 2025-W15)
Keeping weekly backup: 20250406_tangentuniverse.net.tar.gz (week 2025-W14)
Keeping weekly backup: 20250330_tangentuniverse.net.tar.gz (week 2025-W13)
Keeping weekly backup: 20250323_tangentuniverse.net.tar.gz (week 2025-W12)
Keeping weekly backup: 20250316_tangentuniverse.net.tar.gz (week 2025-W11)
Deleted: 20250419_tangentuniverse.net.tar.gz
Deleted: 20250418_tangentuniverse.net.tar.gz
Deleted: 20250417_tangentuniverse.net.tar.gz
Deleted: 20250416_tangentuniverse.net.tar.gz
Deleted: 20250415_tangentuniverse.net.tar.gz
Deleted: 20250414_tangentuniverse.net.tar.gz
Deleted: 20250412_tangentuniverse.net.tar.gz
Deleted: 20250411_tangentuniverse.net.tar.gz
Deleted: 20250410_tangentuniverse.net.tar.gz
Deleted: 20250409_tangentuniverse.net.tar.gz
Deleted: 20250408_tangentuniverse.net.tar.gz
Deleted: 20250407_tangentuniverse.net.tar.gz
Deleted: 20250405_tangentuniverse.net.tar.gz
Deleted: 20250404_tangentuniverse.net.tar.gz
Deleted: 20250403_tangentuniverse.net.tar.gz
Deleted: 20250402_tangentuniverse.net.tar.gz
Deleted: 20250401_tangentuniverse.net.tar.gz
Deleted: 20250331_tangentuniverse.net.tar.gz
Deleted: 20250329_tangentuniverse.net.tar.gz
Deleted: 20250328_tangentuniverse.net.tar.gz
Deleted: 20250327_tangentuniverse.net.tar.gz
Deleted: 20250326_tangentuniverse.net.tar.gz
Deleted: 20250325_tangentuniverse.net.tar.gz
Deleted: 20250324_tangentuniverse.net.tar.gz
Deleted: 20250322_tangentuniverse.net.tar.gz
Deleted: 20250321_tangentuniverse.net.tar.gz
Deleted: 20250320_tangentuniverse.net.tar.gz
Deleted: 20250319_tangentuniverse.net.tar.gz
Deleted: 20250318_tangentuniverse.net.tar.gz
Deleted: 20250317_tangentuniverse.net.tar.gz
Deleted: 20250315_tangentuniverse.net.tar.gz
Deleted: 20250314_tangentuniverse.net.tar.gz
Summary:
Total backup files: 47
Files to keep: 15
Files deleted: 32
Running the script for a second time
Analyzing backup files in: .
------------------------------------------------------------
Keeping recent backup: 20250429_tangentuniverse.net.tar.gz
Keeping recent backup: 20250428_tangentuniverse.net.tar.gz
Keeping recent backup: 20250427_tangentuniverse.net.tar.gz
Keeping recent backup: 20250426_tangentuniverse.net.tar.gz
Keeping recent backup: 20250425_tangentuniverse.net.tar.gz
Keeping recent backup: 20250424_tangentuniverse.net.tar.gz
Keeping recent backup: 20250423_tangentuniverse.net.tar.gz
Keeping recent backup: 20250422_tangentuniverse.net.tar.gz
Keeping recent backup: 20250421_tangentuniverse.net.tar.gz
Keeping recent backup: 20250420_tangentuniverse.net.tar.gz
Keeping weekly backup: 20250413_tangentuniverse.net.tar.gz (week 2025-W15)
Keeping weekly backup: 20250406_tangentuniverse.net.tar.gz (week 2025-W14)
Keeping weekly backup: 20250330_tangentuniverse.net.tar.gz (week 2025-W13)
Keeping weekly backup: 20250323_tangentuniverse.net.tar.gz (week 2025-W12)
Keeping weekly backup: 20250316_tangentuniverse.net.tar.gz (week 2025-W11)
Summary:
Total backup files: 15
Files to keep: 15
Files deleted: 0
Adding a cron job
Editing the cron jobs, I added an entry to run the backup script every morning at 6am. The backups are scheduled to run at 3am so they will be long finished by then.
0 6 * * * /home/john/tu_backup/backup_pruner.sh /home/john/tu_backup