Hi
I'm asking for a Python algorithm which can align (match) two text files, for example, a file of script movie and the file of its subtitles, if you have any idea about how can I do this step, I will be very grateful for you because I'm in dying situation.
18/01/2017 Update: This question is not about finding exact match. It is about finding lines of text with highest similarity using natural language processing methods
Similarity between lines of text can be measured by various similarity measures -
five most popular similarity measures implementation in python
One similarity measue cosine similarity can be implemented in python as follows.
Before calculating cosine similarity you have to convert each line of text to vector.
text1 = 'You have a choice in this world, I believe, about how to tell sad stories.
'text2 = 'i believe we have a choice in this world about how to tell sad stories.'
text3 = 'totally unrelated sentence.'
The corresponding vectors have each word in the text and its number of occurrences.
({'a': 1, 'to': 1, 'world': 1, 'about': 1, 'believe': 1, 'this': 1, 'choice': 1, 'how': 1, 'stories': 1, 'sad': 1, 'tell': 1, 'in': 1, 'You': 1, 'I': 1, 'have': 1})
({'a': 1, 'to': 1, 'world': 1, 'about': 1, 'believe': 1, 'i': 1, 'this': 1, 'choice': 1, 'how': 1, 'stories': 1, 'tell': 1, 'have': 1, 'we': 1, 'sad': 1, 'in': 1})
({'unrelated': 1, 'totally': 1, 'sentence': 1})
You can use counter objects for these vectors.
#!/usr/bin/env python
import re, math
from collections import Counter
WORD = re.compile(r'\w+')
text1 = 'You have a choice in this world, I believe, about how to tell sad stories.'
text2 = 'i believe we have a choice in this world about how to tell sad stories.'
text3 = 'totally unrelated sentence.'
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
vector3 = text_to_vector(text3)
print (vector1)
print(vector2)
print(vector3)
cosine12 = get_cosine(vector1, vector2)
cosine13 = get_cosine(vector1, vector3)
print ('Cosine similarity between text1 and text2:', cosine12)
print ('Cosine similarity between text1 and text3:', cosine13)
You may also use scikit.learn python module to convert text to tf-idf vector and then compute cosine similarity.
Also refer:
cosine similarity between 2 lines of text
So you could read both text sources line by line and match the ones with highest cosine similarity.
Sorry for not being clear enough, so my problem is how can i align the script (or the dialogue) movie wich is a text file (.txt) with its subtitles (.srt) also is a text file after a transformation(.txt). For example :
Subtitle file:
i believe we have a choice in this world about how to tell sad stories.
on the one hand, you can sugarcoat it.
the way they do in movies and romance novels...
where beautiful people learn beautiful lessons...
where nothing is too messed up that can't be fixed...
with an apology and a peter gabriel song.
i like that version as much as the next girl does,
believe me.
it's just not the truth.
this is the truth.
sorry.
Script File:
HAZEL GRACE LANCASTER (16) lies in the grass, staring up at
the stars. We're CLOSE ON her FACE and we hear:
HAZEL (V.O.)
You have a choice in this world, I
believe, about how to tell sad
stories.
CUT TO a SERIES OF QUICK IMAGES:
- Hazel and the BOY we will come to know as AUGUSTUS "GUS"
WATERS (17) at an outdoor restaurant in some magical place.
[They look very much like the perfect Hollywood couple.]
So file1(subtitle.txt) contains only the dialogues from file2(script.txt)?
So I am assuming script.txt is bigger than subtitle.txt end every line in subtitle.txt appear in some shot inside script.txt
So what is the output to be? Find the matching lines and show the line numbers of matching lines?
You could go about like read a line from file1 and file 2, compare the lines, if not matching increment line pointer to file 2 until you find a match. When a match is found show the line number of matching line.
Do this until end of file is reached.
Refer to the following code
http://www.opentechguides.com/askotg/question/64/how-to-match-two-text-files-using-python
that print line numbers that are different. You could alter slightly to print line numbers that are matching.
try this script
i want to use natural language processing methods to extract the contained meta data and match between the text sources and finally assign each subtitle to a script dialog according to the highest similarity between text passages.
match two text files?
Do you mean match a video to its subtitles(SRT)?
Can you be more specific? What file formats are the two files? Are they both text?
Is it 2 text files or 1 video file and 1 text file? Are we matching video with subtitles? It is not clear. Are you trying to extract subtitles from script?