WordPress – tidying up the backups

In my last WordPress entry I wrote some scripts to backup the entire site files, database and ssl certificates on a daily basis. This has been running for about 6 weeks so it now needs me to prune the number of backups to a more reasonable number.

I decided that for completeness I wanted to:

  • Keep the 10 most recent backup files
  • Keep one backup per week for the past 10 weeks
  • Keep one backup per month for the past year

This sounded like it would be a pretty simple script with some rules for retaining files and for deleting others. The backups were named in the format YYYYMMDD_domain.tar.gz to make this easier.

I decided to give the task to Claude.ai to see what sort of job it could do. This was a well defined problem with a limited number of permutations so it seemed well suited to AI.

The prompt

write a scrip that will prune backup files that are dated in the form YYYYMMDD where YYYY is the year, MM is the month number and DD is the day. I would like to keep the last 10 backup files, and then one backup for the past 10 weeks. finally I would like to keep one backup for each month for the past year

It took about a minute thinking about ti and then replied

I'll write a script that prunes backup files according to your retention policy. This script will:

Keep the 10 most recent backup files
Keep one backup per week for the past 10 weeks
Keep one backup per month for the past year

<script here - omitted for clarity>

Here's how it works:

It looks for files with dates in the YYYYMMDD format in their names
It implements your retention policy:

Keeps the 10 most recent backup files
Keeps one backup per week for the past 10 weeks
Keeps one backup per month for the past year



The script runs in "dry run" mode by default, showing which files would be deleted without actually removing anything. To actually delete the files, you'll need to run it with the --delete flag.
Usage:
bash# To see which files would be deleted (dry run):
python backup_pruner.py /path/to/backup/directory

# To actually delete the files:
python backup_pruner.py /path/to/backup/directory --delete
The script provides detailed output showing which files are being kept and why (recent backup, weekly backup, or monthly backup), along with a summary at the end.
Would you like me to explain any part of the script in more detail?

The script itself was

#!/usr/bin/env python3
import os
import re
import sys
from datetime import datetime, timedelta
from collections import defaultdict

def parse_date_from_filename(filename):
    """Extract date from filename matching YYYYMMDD pattern."""
    match = re.search(r'(\d{8})', filename)
    if match:
        date_str = match.group(1)
        try:
            return datetime.strptime(date_str, '%Y%m%d')
        except ValueError:
            return None
    return None

def group_by_week(files_with_dates):
    """Group files by week."""
    week_groups = defaultdict(list)
    for filename, date in files_with_dates:
        # Use ISO calendar week as the key
        year, week_num, _ = date.isocalendar()
        week_key = f"{year}-W{week_num:02d}"
        week_groups[week_key].append((filename, date))
    return week_groups

def group_by_month(files_with_dates):
    """Group files by month."""
    month_groups = defaultdict(list)
    for filename, date in files_with_dates:
        month_key = f"{date.year}-{date.month:02d}"
        month_groups[month_key].append((filename, date))
    return month_groups

def prune_backups(backup_dir, dry_run=True):
    """
    Prune backup files according to the retention policy:
    - Keep the 10 most recent backups
    - Keep one backup per week for the past 10 weeks
    - Keep one backup per month for the past year
    """
    # Get all backup files with dates
    backup_files = []
    for filename in os.listdir(backup_dir):
        date = parse_date_from_filename(filename)
        if date:
            backup_files.append((filename, date))
    
    if not backup_files:
        print("No backup files found with YYYYMMDD date pattern.")
        return
    
    # Sort files by date (newest first)
    backup_files.sort(key=lambda x: x[1], reverse=True)
    
    # Calculate retention dates
    now = datetime.now()
    ten_weeks_ago = now - timedelta(weeks=10)
    one_year_ago = now - timedelta(days=365)
    
    # Files to keep
    keep_files = set()
    
    # 1. Keep 10 most recent backups
    for i, (filename, _) in enumerate(backup_files):
        if i < 10:
            keep_files.add(filename)
            print(f"Keeping recent backup: {filename}")
        else:
            break
    
    # 2. Keep one backup per week for the past 10 weeks
    weekly_backups = []
    for filename, date in backup_files:
        if ten_weeks_ago <= date <= now:
            weekly_backups.append((filename, date))
    
    weekly_groups = group_by_week(weekly_backups)
    for week, files in weekly_groups.items():
        # Skip if we already kept a file from this week in the recent backups
        if all(f[0] not in keep_files for f in files):
            # Keep the newest file from each week
            files.sort(key=lambda x: x[1], reverse=True)
            keep_files.add(files[0][0])
            print(f"Keeping weekly backup: {files[0][0]} (week {week})")
    
    # 3. Keep one backup per month for the past year
    monthly_backups = []
    for filename, date in backup_files:
        if one_year_ago <= date <= now:
            monthly_backups.append((filename, date))
    
    monthly_groups = group_by_month(monthly_backups)
    for month, files in monthly_groups.items():
        # Skip if we already kept a file from this month
        if all(f[0] not in keep_files for f in files):
            # Keep the newest file from each month
            files.sort(key=lambda x: x[1], reverse=True)
            keep_files.add(files[0][0])
            print(f"Keeping monthly backup: {files[0][0]} (month {month})")
    
    # Delete files not in keep_files
    deleted_count = 0
    for filename, _ in backup_files:
        if filename not in keep_files:
            file_path = os.path.join(backup_dir, filename)
            if dry_run:
                print(f"Would delete: {filename}")
            else:
                try:
                    os.remove(file_path)
                    print(f"Deleted: {filename}")
                    deleted_count += 1
                except Exception as e:
                    print(f"Failed to delete {filename}: {e}")
    
    print("\nSummary:")
    print(f"Total backup files: {len(backup_files)}")
    print(f"Files to keep: {len(keep_files)}")
    if dry_run:
        print(f"Files that would be deleted: {len(backup_files) - len(keep_files)}")
    else:
        print(f"Files deleted: {deleted_count}")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python backup_pruner.py <backup_directory> [--delete]")
        sys.exit(1)
    
    backup_dir = sys.argv[1]
    dry_run = True
    
    if len(sys.argv) > 2 and sys.argv[2] == "--delete":
        dry_run = False
    
    if not os.path.isdir(backup_dir):
        print(f"Error: {backup_dir} is not a valid directory")
        sys.exit(1)
    
    print(f"Analyzing backup files in: {backup_dir}")
    if dry_run:
        print("Running in DRY RUN mode. No files will be deleted.")
        print("To actually delete files, run with the --delete flag.")
    else:
        print("WARNING: Running in DELETE mode. Files will be permanently deleted.")
    print("-" * 60)
    
    prune_backups(backup_dir, dry_run)

The code seemed well structured and intuitive. It took a few minutes to work out what was going on, and in general I was impresed.

The code was parsing the filenames to get the date and then filtering them according to the rules which I gave. If the date was in the last 10 days then the filename was put in the keep_files list. For the weeks it took a two pass approach. In pass one it worked out the week number from the date and then grouped the filenames by the week number. I pass two it took the most recent filename from each group for the last 10 groups and add that to the keep_files list. It took a similar approach to the monthly filter.

Finally the code would delete all files that matched the filename formate that were found bar those on the keep_files list.

Would I have written it that way? Probably not – but that doesn’t make it a bad script. I wouldn’t have used regex for parsing the filename as it seems overkill, however that is personal preference. I would have a used a keep_list approach, but my filtering would probably have been different – but not very much.

Testing

Of course I had to test the code before I could run it on my backups, so I asked Claude to write a script that would generate some test data. It took a couple of attempts; the first time it generated the filesnames using an incorrect format, but again the attempt was solid.

write a script to generate empty files for each day in the past year following the naming convention given

the code

#!/usr/bin/env python3
import os
import sys
from datetime import datetime, timedelta

def generate_test_backup_files(output_dir, suffix=".tar.gz"):
    """
    Generate empty test backup files for each day in the past year.
    Files will be named using the pattern: YYYYMMDD + suffix
    """
    today = datetime.now()
    
    # Create output directory if it doesn't exist
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        print(f"Created directory: {output_dir}")
    
    # Generate a file for each day in the past 365 days
    count = 0
    for days_ago in range(365, -1, -1):  # From 365 days ago to today
        target_date = today - timedelta(days=days_ago)
        date_str = target_date.strftime('%Y%m%d')
        filename = f"{date_str}{suffix}"
        file_path = os.path.join(output_dir, filename)
        
        # Create an empty file
        with open(file_path, 'w') as f:
            pass  # Just create an empty file
        
        count += 1
        
        # Print progress every 30 days
        if count % 30 == 0 or days_ago == 0:
            print(f"Generated {count} files... Latest: {filename}")
    
    print(f"\nFinished generating {count} test backup files in {output_dir}")
    print(f"Files follow the naming pattern: YYYYMMDD{suffix}")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python generate_test_backups.py <output_directory> [suffix]")
        print("Default suffix: '.tar.gz'")
        sys.exit(1)
    
    output_dir = sys.argv[1]
    
    # Optional argument
    suffix = ".tar.gz"
    
    if len(sys.argv) > 2:
        suffix = sys.argv[2]
    
    print(f"Generating test backup files in: {output_dir}")
    print(f"Using naming pattern: YYYYMMDD{suffix}")
    print("-" * 60)
    
    generate_test_backup_files(output_dir, suffix)

Again a solid result. Running the script generate 365 files which I could then run the backup_pruner.py against. I manually edited the script so that it generated 400 files as I wanted to see it delete backups over a year old.

Running on Live

I edited the script to remove the ‘dry_run’ parameter and ran it on the live folder

Analyzing backup files in: .
------------------------------------------------------------
Keeping recent backup: 20250429_tangentuniverse.net.tar.gz
Keeping recent backup: 20250428_tangentuniverse.net.tar.gz
Keeping recent backup: 20250427_tangentuniverse.net.tar.gz
Keeping recent backup: 20250426_tangentuniverse.net.tar.gz
Keeping recent backup: 20250425_tangentuniverse.net.tar.gz
Keeping recent backup: 20250424_tangentuniverse.net.tar.gz
Keeping recent backup: 20250423_tangentuniverse.net.tar.gz
Keeping recent backup: 20250422_tangentuniverse.net.tar.gz
Keeping recent backup: 20250421_tangentuniverse.net.tar.gz
Keeping recent backup: 20250420_tangentuniverse.net.tar.gz
Keeping weekly backup: 20250413_tangentuniverse.net.tar.gz (week 2025-W15)
Keeping weekly backup: 20250406_tangentuniverse.net.tar.gz (week 2025-W14)
Keeping weekly backup: 20250330_tangentuniverse.net.tar.gz (week 2025-W13)
Keeping weekly backup: 20250323_tangentuniverse.net.tar.gz (week 2025-W12)
Keeping weekly backup: 20250316_tangentuniverse.net.tar.gz (week 2025-W11)
Deleted: 20250419_tangentuniverse.net.tar.gz
Deleted: 20250418_tangentuniverse.net.tar.gz
Deleted: 20250417_tangentuniverse.net.tar.gz
Deleted: 20250416_tangentuniverse.net.tar.gz
Deleted: 20250415_tangentuniverse.net.tar.gz
Deleted: 20250414_tangentuniverse.net.tar.gz
Deleted: 20250412_tangentuniverse.net.tar.gz
Deleted: 20250411_tangentuniverse.net.tar.gz
Deleted: 20250410_tangentuniverse.net.tar.gz
Deleted: 20250409_tangentuniverse.net.tar.gz
Deleted: 20250408_tangentuniverse.net.tar.gz
Deleted: 20250407_tangentuniverse.net.tar.gz
Deleted: 20250405_tangentuniverse.net.tar.gz
Deleted: 20250404_tangentuniverse.net.tar.gz
Deleted: 20250403_tangentuniverse.net.tar.gz
Deleted: 20250402_tangentuniverse.net.tar.gz
Deleted: 20250401_tangentuniverse.net.tar.gz
Deleted: 20250331_tangentuniverse.net.tar.gz
Deleted: 20250329_tangentuniverse.net.tar.gz
Deleted: 20250328_tangentuniverse.net.tar.gz
Deleted: 20250327_tangentuniverse.net.tar.gz
Deleted: 20250326_tangentuniverse.net.tar.gz
Deleted: 20250325_tangentuniverse.net.tar.gz
Deleted: 20250324_tangentuniverse.net.tar.gz
Deleted: 20250322_tangentuniverse.net.tar.gz
Deleted: 20250321_tangentuniverse.net.tar.gz
Deleted: 20250320_tangentuniverse.net.tar.gz
Deleted: 20250319_tangentuniverse.net.tar.gz
Deleted: 20250318_tangentuniverse.net.tar.gz
Deleted: 20250317_tangentuniverse.net.tar.gz
Deleted: 20250315_tangentuniverse.net.tar.gz
Deleted: 20250314_tangentuniverse.net.tar.gz

Summary:
Total backup files: 47
Files to keep: 15
Files deleted: 32

Running the script for a second time

Analyzing backup files in: .
------------------------------------------------------------
Keeping recent backup: 20250429_tangentuniverse.net.tar.gz
Keeping recent backup: 20250428_tangentuniverse.net.tar.gz
Keeping recent backup: 20250427_tangentuniverse.net.tar.gz
Keeping recent backup: 20250426_tangentuniverse.net.tar.gz
Keeping recent backup: 20250425_tangentuniverse.net.tar.gz
Keeping recent backup: 20250424_tangentuniverse.net.tar.gz
Keeping recent backup: 20250423_tangentuniverse.net.tar.gz
Keeping recent backup: 20250422_tangentuniverse.net.tar.gz
Keeping recent backup: 20250421_tangentuniverse.net.tar.gz
Keeping recent backup: 20250420_tangentuniverse.net.tar.gz
Keeping weekly backup: 20250413_tangentuniverse.net.tar.gz (week 2025-W15)
Keeping weekly backup: 20250406_tangentuniverse.net.tar.gz (week 2025-W14)
Keeping weekly backup: 20250330_tangentuniverse.net.tar.gz (week 2025-W13)
Keeping weekly backup: 20250323_tangentuniverse.net.tar.gz (week 2025-W12)
Keeping weekly backup: 20250316_tangentuniverse.net.tar.gz (week 2025-W11)

Summary:
Total backup files: 15
Files to keep: 15
Files deleted: 0

Adding a cron job

Editing the cron jobs, I added an entry to run the backup script every morning at 6am. The backups are scheduled to run at 3am so they will be long finished by then.

0 6 * * * /home/john/tu_backup/backup_pruner.sh /home/john/tu_backup