资源说明:Find Duplicate Music Files
########### FDMF - find duplicate music files ############### Installation instructions are in the file INSTALL. THANKS: John Jolly, Lawrence Wade, Paul Russo, Paul Forgey, Luis Magaña, David Parker, Alain Deschenes, and Andre Pool. This program is designed to detect music files that contain the same music, even if the files are differently named and contain different or nonexistent metainformation such as ID3 tags. USAGE: fdmffdmf_vector_pairs USAGE EXAMPLE: ./fdmf /home/bob/music/ ./fdmf_vector_pairs fdmf_vector_pairs will print on stdout a list of pairs of files that may contain the same music. NOTE: The first time you run this program it will do a lot of processing. It takes 10 minutes to process a music_dir containing 140 mp3 files on my Apple G4. It will attempt to cache the result so that subsequent runs are faster. You can interrupt it and your progress will not be lost. ANOTHER NOTE: On my current development box, an IBM T20, mplayer is the fastest mp3 decoder. mpg123 is also quite fast, much faster than mpg321. Some Linux distros come with mpg123 being a symbolic link to mpg321 since mpg123 is not under GPL. Bottom line: you can use any program for decoding if it can take the name of a music file as a command line arg and write the raw decoded 44100/stereo/16-bit stream on stdout. You might want to take some measurements on your system to see which decoder is fastest. ################################################## TUNING: There are important threshold parameters that affect the the detection of similar files by fdmf_vector_pairs. If you don't give fdmf_vector_pairs any command line arguments, then it will use the default parameters, which should hopefully work OK. Or you can tune these parameters to improve the performance. ################################################## HOW IT WORKS: 1) Build a list of all music files in music_dir. FOR EACH FILE: 2) Decode/decompress file to raw binary data. 3) Calculate the energy in 4 frequency bands for each 250 msec chunk. 3.3) Calculate the sum of the 4 bands for each chunk. 3.5) Calculate the (b2 + b3):(b0 + b1) ratio for each chunk. 3.7) Calculate the (b0 + b2):(b1 + b3) ratio for each chunk. 4) Calculate the power spectrum of steps 3.3, 3.5, and 3.7. 5) Spline fit power spectra to a fixed set of frequency points. 6) Quantize the result of step 5 to one bit using median as threshold. 6.5) Store the result of step 6 in a database. FOR EACH POSSIBLE PAIR OF FILES: 7) Count the number of bits that are the same in the two spectra. 8) If the result of 7 exceeds a threshold, and both files still exist, print the filenames. fdmf keeps a cache of the results of step 6. This cache persists from one invocation of fdmf to the next in the form of a database file in the user's home directory. (~/.fdmf) If a music file has its step 6 result already in the cache, then step 2 through step 6 will be skipped for that file. That speeds things up by about two orders of magnitude. If a music file is identical (data has the same MD5 hash) to another file that has already been indexed, then the step 6 result will be copied from the existing index entry to the new index entry. This skips steps 2 through 6 and speeds up processing of that file by about two orders of magnitude. In the worst case (no identical files) this slows things down by about 1%. After finding that two files in the database seem to match, fdmf_vector_pairs checks to make sure that both files exist. If they don't exist, it will suppress the output regarding that pair. The goal of this is to only print messages about actual duplicate files, but to keep the records for all files that have been analyzed. If sometime later you get the exact same file (based on MD5 sum) then fdmf will just reuse the summary that it already has. The summaries are small, so for most people, it is worth it to cache these "ghost" records. Steps 1 through 6.5 are implemented in fdmf. Steps 7 and 8 are implemented in fdmf_vector_pairs in C. ################################################ DEVELOPMENT GOALS: It would be really nice to synchronize/merge two music collections by exchanging tables of results from step 6 of the algorithm. This reminds me of rsync. I think that it might make sense to qualify the filenames in the database with the host that they are on. Right now it is just a UNIX path. Maybe better would be an NFS-style name like foobox:/home/john/music/funsong.mp3. It should be easy to extend the program to support all types of music files. It currently supports mp3, ogg, m4a, wma, wav, and realaudio using mplayer. But any format that has a command line player that can output a raw stream should be easy to support. It would be really good to have a GNU configure script. If somebody wants to write one, that would be great. The FFTW fourier transform library that fdmf_sonic_reducer uses has a "plan" structure that controls how the FFT will be performed. It is possible to cache this plan from one invocation to the next in a file. This might speed things up. But currently about 90% of the CPU time required to index music files is the time it takes to decode the music file (mpg123, for example), it is questionable whether it is worth worrying about optimizing fdmf_sonic_reducer. It might be more beneficial to see whether there is something that can be done to speed up the decoding. Skipping frames in the decoding is possible in mpg123, but in my early testing it seemed to increase the number of classification errors. ############################################## KNOWN BUGS: ############################################## HACKING: If you want to hack around, you might find useful the stderr output of fdmf_vector_pairs. It is four numbers in the following format: This can be used to print out the number of classification errors that the program is making. It will tell you how many times fdmf_vector_pairs said that the pair of files contained the same music when they really didn't, and how many times fdmf_vector_pairs said that the pair of files did not contain the same music when they actually do. For the program to do this, you must give it a big hint in the naming of the files. It looks at the basenames. Here is how to use this feature: 1) Create a directory inside your music_dir. 2) Recode your music files and put the results in the new directory. When you recode your files, the basename of the file should be the same as the original. Just the directory is different. After you recode, the contents will be different, even though the files my sound the same when you play them. A shell command for recoding is mpg123 -s foo.mp3 | bladeenc STDIN new_dir/foo.mp3 When fdmf_vector_pairs runs, it checks to see whether the basenames of the filenames are the same. If the classification algorithm judged the contents to be different but the files have the same basename, it increments the type2 error counter. If the classification algorithm judged them to have the same contents but basenames are different, the type1 error counter is incremented. These two counters are printed to stderr at the end of the run. When generating this synthetic test data, you can also add some silence to the beginning of the files. You can do this with dd if=/dev/zero. This helps to make the tests more realistic. Also, it is good to use different encoders. I use blade and lame for MP3 encoding. Anything that you can do to simulate real-world variations in the representation of the music will make the tests more meaningful. FDMF_BENCH: A little script called fdmf_bench is included in the fdmf tarball. It will run fdmf_vector_pairs over a range of parameter values. It will output a file called gpd. gpd output line format: To use this, you must have a set of music files that were prepared as described above. It will be totally meaningless to run this for a regular (non-synthetic) set of files. To find good parameter values for a set of synthetic data, you can use something like this: sort -n -k 4 gpd | less OPTPARAMS: Another little script is included that uses fdmf_bench to search the entire parameter space. It is useful for finding reasonable parameters in a reasonable amount of time. Usually this isn't necessary because the default parameters defined in fdmf_vector_pairs.h should be OK, but when somebody makes a change to the fdmf_sonic_reducer algorithm, it is necessary to find new parameter values. AUTOMATIC DUPLICATE HANDLING (cleanup_dups): The fdmf_vector_pairs program takes and optional -0 (that's a zero) argument. This has the same purpose as the -print0 option for the UNIX find program. It causes the output of the program to be delimited by null bytes instead of the usual human readable delimiters, tab and newline. This way, the output can be fed to another program and the other program can unambiguously parse the stream into separate filenames. With the -0 option, the output of fdmf_vector_pairs can be piped to another program which takes in the list of matching pairs of music files and takes some action. All kinds of logic can be employed to decide what to do. Maybe you want to move duplicate files to some other place. Or maybe you want to delete them. Or maybe you want to delete one and replace it with a link to the other. Or maybe you want to have some user interaction. A program called cleanup_dups is included as an example of this possibility. The program is just an example, and it just finds which file of each pair is smaller and copies that file to a directory in /tmp. This is documented at the top of the file cleanup_dups. You are encouraged to hack cleanup_dups to add your own logic. Please be careful. Your music collection is precious. Back it up to CDROM/DVD/whatever. Have fun.
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。