background :

The company is working on a project , Probably the function is a face recognition system for passing the gate , After the people who have to go through the access control register , The system will store an original picture in the data folder of the server , Including permanent storage and some temporary storage for visitor registration . Use it on Friday df -h Check that the root directory is already occupied 98%, The root mounted partitions are 50G size ; It showed that there were also 3.8G You can use , According to each face, only 200K In terms of size , According to the daily flow of people through the calculation will not produce too much data .

On Monday , There's a phone call , The system is not working properly , Think about it. Maybe the storage is full ! Log in and see , Sure enough, the partition of the root directory is full , I was thinking about two ways :

  • Expand the data directory , Because mount / The partition of is LVM volume
  • Clear some big files in the current file system

plan A: find / Big files in the directory ( There was no screenshot )

Use find Command found / Big log files in the directory .

1. The first step is to find something greater than 500M Log file .

find / -type f -name "*.log" -size +500M
#  Found this log file , use ls -l View yes 3.8G

2. Copy the file to another directory

Prevent accidental deletion of , Even if you delete it, you can get it back ,-a Keep the original attributes , Prevent recovery without the original file properties , Cause program error .

# home A partition is mounted in the directory
cp -a /wls/appsystems/nginx-1.14.0/install/logs/access.log /home/

3. Clear log files


At that time, we used this method to clear out part of the space , But at noon , It's almost full again , At that time, I still wanted to expand this directory partition , But think about it , According to the usual number of users, it will not grow so fast . So I gave up the expansion and began to find out why the data increased so fast .

plan B: Find out the reason for the abnormal increase of data and find the right medicine

1. Use du Command to view the relatively large directory

First from / View the occupied directory under the directory . At that time, this directory occupied 47G, This is after cleaning up .

du -sh ./*
-s      size  View size
-h       View the display in a readable way G perhaps M In units of
-m       With M Display in units
         Only -s The default unit is K

Then step by step down to find the largest directory to generate data . It was decided that /wls/files/FTR/faceFile/image/ Under the table of contents , It's the directory where the face data is stored . How to see his changes ?

2. Look at the data catalog unit time change

We look at the size of the directory so that du command , combining watch Command can see the change of the specified unit time .

watch -n 1 -d "du -sm ./*"
-n      Updated frequency units s
-d      Different highlights 

In this way, you can see image Changes in the directory , We looked directly at how much it changed in five minutes , At that time, the calculation was almost the same 5 The minutes have almost increased 100M about . It was confirmed at that time that there must be something wrong with the system .

3. Look at what happened in five minutes to determine why

adopt find The command 5min The data generated in the clock is exported , See what data is generated .

#  Executed in the directory where the data is generated .
find ./ -type f -mmin -5|xargs cp -t /opt/save/
cd /opt/save/
tar -czvf export_save.tar.gz ./save
sz export_save.tar.gz

It leads this bag to windows Desktop , Unzip to see .

The reason came out , One person can only register once , And this picture also shows 4 Time , Through the original watch The order also knows , This data is still being generated . To sum up, there must be something wrong with the program , Call development , Solve the problem from the root .

Last : Use shell Scripts clean dirty data

from linux Find out the data at the beginning of this prefix in .

#  Because there is a lot of data to view the results, you just need to look at the beginning and end
find ./ -type f -name "20000009_6ae05_*" |xargs ls -lh |head
find ./ -type f -name "20000009_6ae05_*" |xargs ls -lh |tail

Combined with the above two results , Duplicate data is generated every five minutes , The prefixes of duplicate images are the same , The fault occurred at 2021-1-9 4:00 after . Then there are two kinds of data generated after that , One is the repeated image produced by the program , There is also a picture produced by normal use . Determine the logical relationship .

  1. A file name prefix 2021-1-9 4:00 The previous data is normal data . Later, if there are duplicate data, it will be dirty .
  2. find 2021-1-9 4:00 And then the repeat photo .
  3. View the photo in 2021-1-9 4:00 Was there before , If you have one 2021-1-9 4:00 After the deletion .
  4. If not, you don't have to operate .

notes : stay 2021-1-9 Then there may be new users who sign up every day , But the number of users should not be very large , Because we are 1-18 It's handled very well , After that, new users may also generate new duplicate files , But it's more complicated to find that one , So I cleared it first 2021-1-9 4:00 After that, most of the duplicate files .

1. find 2021-1-9 4:00 And then the repeat photo

#  find 2021-1-9 4:00  And then the repeat photo
find ./ -type f -newermt "2021-1-9 4:00" >~/out-19400-image.txt
#  De duplicating and intercepting prefixes , Sort by number . Output statistics again .
cat out-19400-image.txt |sort|cut -c 3-17|uniq -c|sort -n > after1-9-4.txt

The result is similar to this .

Then I just need the file name , It needs to be dealt with again ; Then write the script according to the above logic .

cat after1-9-4.txt |awk -F' ' '{print $2}' > filename_fina.txt

The result after the prefix of the processed file .

2. Write a script

[[email protected] gxd]# vi 
# 2020-1-9 4:00 The pictures after that also >=10 Zhang's file prefix
for i in `cat filename_fina.txt`;do
#for i in "20004788_d2d9e_";do
    #  Check in 2020-1-9 4:00 Is there a record before
    num_pic=`find $dir -type f -name "${i}*" ! -newermt "2021-1-9 4:00"|wc -l`
    if [ $num_pic -ne 0 ];then
        echo " This picture is in 2020-1-9 4:00 There was a record before   Prefix ${i}  Perform the operation mv"
        find $dir -type f -name "$i*" -newermt "2021-1-9 4:00" |xargs mv -t /home/laji
        echo "============================================================="
        echo -e " This picture is in 2020-1-9 4:00 And then there's the record   Prefix ${i} \n  No operation "
        echo "============================================================="

The effect of execution .

The space that's finally cleared up .

Clean up the repetitive content .