Mourchid asked this 7 years ago

How to match two text files using Python

Hi
I'm asking for a Python algorithm which can align (match) two text files, for example, a file of script movie and the file of its subtitles, if you have any idea about how can I do this step, I will be very grateful for you because I'm in dying situation.

18/01/2017 Update: This question is not about finding exact match. It is about finding lines of text with highest similarity using natural language processing methods

dig_abacus 7 years ago

13 likes

Similarity between lines of text can be measured by various similarity measures -

One similarity measue cosine similarity can be implemented in python as follows.

Before calculating cosine similarity you have to convert each line of text to vector.

text1 = 'You have a choice in this world, I believe, about how to tell sad stories.

'text2 = 'i believe we have a choice in this world about how to tell sad stories.'

text3 = 'totally unrelated sentence.'

The corresponding vectors have each word in the text and its number of occurrences.

({'a': 1, 'to': 1, 'world': 1, 'about': 1, 'believe': 1, 'this': 1, 'choice': 1, 'how': 1, 'stories': 1, 'sad': 1, 'tell': 1, 'in': 1, 'You': 1, 'I': 1, 'have': 1})

({'a': 1, 'to': 1, 'world': 1, 'about': 1, 'believe': 1, 'i': 1, 'this': 1, 'choice': 1, 'how': 1, 'stories': 1, 'tell': 1, 'have': 1, 'we': 1, 'sad': 1, 'in': 1})

({'unrelated': 1, 'totally': 1, 'sentence': 1})

You can use counter objects for these vectors.

#!/usr/bin/env python
import re, math
from collections import Counter

WORD = re.compile(r'\w+')
text1 = 'You have a choice in this world, I believe, about how to tell sad stories.'
text2 = 'i believe we have a choice in this world about how to tell sad stories.'
text3 = 'totally unrelated sentence.'

def get_cosine(vec1, vec2):
     intersection = set(vec1.keys()) & set(vec2.keys())
     numerator = sum([vec1[x] * vec2[x] for x in intersection])

     sum1 = sum([vec1[x]**2 for x in vec1.keys()])
     sum2 = sum([vec2[x]**2 for x in vec2.keys()])
     denominator = math.sqrt(sum1) * math.sqrt(sum2)

     if not denominator:
        return 0.0
     else:
        return float(numerator) / denominator

def text_to_vector(text):
     words = WORD.findall(text)
     return Counter(words)


vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
vector3 = text_to_vector(text3)

print (vector1)
print(vector2)
print(vector3)

cosine12 = get_cosine(vector1, vector2)
cosine13 = get_cosine(vector1, vector3)

print ('Cosine similarity between text1 and text2:', cosine12)
print ('Cosine similarity between text1 and text3:', cosine13)

You may also use scikit.learn python module to convert text to tf-idf vector and then compute cosine similarity.

scikit tf-idf

Also refer:

text to tf-idf

So you could read both text sources line by line and match the ones with highest cosine similarity.

Mourchid 7 years ago

5 likes

Sorry for not being clear enough, so my problem is how can i align the script (or the dialogue) movie wich is a text file (.txt) with its subtitles (.srt) also is a text file after a transformation(.txt). For example :

Subtitle file:

i believe we have a choice in this world about how to tell sad stories.

on the one hand, you can sugarcoat it.

the way they do in movies and romance novels...

where beautiful people learn beautiful lessons...

where nothing is too messed up that can't be fixed...

with an apology and a peter gabriel song.

i like that version as much as the next girl does,

believe me.

it's just not the truth.

this is the truth.

sorry.

Script File:

HAZEL GRACE LANCASTER (16) lies in the grass, staring up at

the stars. We're CLOSE ON her FACE and we hear:

HAZEL (V.O.)

You have a choice in this world, I

believe, about how to tell sad

stories.

CUT TO a SERIES OF QUICK IMAGES:

- Hazel and the BOY we will come to know as AUGUSTUS "GUS"

WATERS (17) at an outdoor restaurant in some magical place.

[They look very much like the perfect Hollywood couple.]

sonja 7 years ago

4 likes

So file1(subtitle.txt) contains only the dialogues from file2(script.txt)?

So I am assuming script.txt is bigger than subtitle.txt end every line in subtitle.txt appear in some shot inside script.txt

So what is the output to be? Find the matching lines and show the line numbers of matching lines?

You could go about like read a line from file1 and file 2, compare the lines, if not matching increment line pointer to file 2 until you find a match. When a match is found show the line number of matching line.

Do this until end of file is reached.

Refer to the following code

http://www.opentechguides.com/askotg/question/64/how-to-match-two-text-files-using-python

that print line numbers that are different. You could alter slightly to print line numbers that are matching.

gareth 7 years ago

3 likes

try this script

https://gist.github.com/altermarkive/4dfc62346796eb1e07f5

Mourchid 7 years ago

1 like

i want to use natural language processing methods to extract the contained meta data and match between the text sources and finally assign each subtitle to a script dialog according to the highest similarity between text passages.

sonja 7 years ago

match two text files?

Do you mean match a video to its subtitles(SRT)?

Remy Pereira 7 years ago

Can you be more specific? What file formats are the two files? Are they both text?

sonja 7 years ago

Is it 2 text files or 1 video file and 1 text file? Are we matching video with subtitles? It is not clear. Are you trying to extract subtitles from script?