Notifications
Clear all

Question SenticGCN Trained Model not Appearing

7 Posts
2 Users
0 Likes
410 Views
 yen
(@yen)
Active Member Member
Joined: 10 months ago
Posts: 4
Topic starter  

Hi, I've been training my SenticGCN model, and while my embed_models and tokenizers folders are populated, it seems like my models folder has not been populated at all... 

Here is my sentic_gcn_config.json 

image

From this, my model is supposed to be saved and updated into ./models/senticgcn/

However, when I enter that directory, it's empty, with no models saved at all 

image

 

This image, for example, shows that the model is supposed to be saved to the filepath, but when I go to the filepath, it's empty. 

image

 

I've run the code for a few hours already, and the folder is still not populated. Is there something I'm doing wrong, or something I'm not aware of? 🙁

 


   
Quote
Raymond Ng
(@raymond_aisg)
Eminent Member AISG Staff
Joined: 1 year ago
Posts: 35
 

Hi,

From your screenshot, I'm assuming that you are using Windows to run the training script.

Could you kindly try replacing the `save_model_path` config with an absolute path to see if the weights could be saved? (e.g. C:\Users\user\Desktop\AISG\models\senticgcn\).

Please also note that the default config is based on Unix which uses backslash whereas Windows uses forward slash.

https://jrogel.com/backslashes-v-forward-slashes-windows-linux-and-mac/

Hope this helps.

All things NLP


   
ReplyQuote
 yen
(@yen)
Active Member Member
Joined: 10 months ago
Posts: 4
Topic starter  

Hi, thank you for your suggestion! I tried using an absolute file path, but the model still isn't being saved. The tokenizer, and the embedding model, however, are being saved (the folders were updated); it's just the model that isn't. Has anyone else encountered this problem? 


   
ReplyQuote
 yen
(@yen)
Active Member Member
Joined: 10 months ago
Posts: 4
Topic starter  

Hi, I spent a little bit of time going through the module, and I think there's probably nothing wrong. Here's my take (see the code explanation below): 

Basically, the model apparently does NOT save the code in your folder UNTIL the end of the run, unlike for tokenizer and embed_model, which are saved at the start. 

 

So, in conclusion, my laptop is probably just slow (and I need to rerun the training :/)

 

Hope this helps anyone else who may be facing this problem (and who can't sleep at 3am at night wondering whether your model will be saved after 6 hours of training)!

# How the model is saved eventually 
def _save_model(self):
        # Other stuff 
	self.model.save_pretrained(self.config.save_model_path)


# self._save_model() is called in :


class SenticGCNTrainer:
	def train(self):
		repeat_result = self._train()
		# Other important stuff
		self._save_model()  # Only saved eventually, after full training is called 
		# Other important stuff

	def _train(self,
    train_dataloader: Union[DataLoader,
    BucketIterator],
    val_dataloader: Union[DataLoader,
    BucketIterator]) -> Dict[str,
    Dict[str,
    Union[int,
     float]]]:
		# Setting up variables
		for i in range(self.config.repeats):
			repeat_tmpdir = self.temp_dir.joinpath(f"repeat{i + 1}")  # This is crucial, as it is where the models are actually saved before the code is complete 
			self._reset_params()
			# Calls self._train_loop. Critically, with directory as repeat_tmpdir
			max_val_acc, max_val_f1, max_val_epoch = self._train_loop(
                		criterion, optimizer, train_dataloader, val_dataloader, repeat_tmpdir
            		)
			# Record repeat run results
			# Overwrite global stats
		return repeat_result

	def _train_loop((self, criterion, optimizer, train_dataloader, val_dataloader, tmpdir: pathlib.Path) -> pathlib.Path:
		# Setting up of some config variables
		for epoch in range(self.config.epochs):
			global_step += 1
			self.model.train()  # To check what this is

			# Other config steps to get stuff
			if val_acc > max_val_acc:
				# Saving variables
				self.model.save_pretrained(tmpdir)  # Saved to tmpdir, NOT save_model_path 

			# Code for early stopping
		return max_val_acc, max_val_f1, max_val_epoch


# The big question now, is what is repeat_tmpdir?
with tempfile.TemporaryDirectory() as tmpdir:
	self.temp_dir = pathlib.Path(tmpdir)

"""
Doing a little more tracing...
prefix = "tmp"
suffix = ""
dir:
"""

# Calls a dir from here:
def _candidate_tempdir_list():
    """Generate a list of candidate temporary directories which
    _get_default_tempdir will try."""

    dirlist = []

    # First, try the environment.
    for envname in 'TMPDIR', 'TEMP', 'TMP':
        dirname = _os.getenv(envname)
        if dirname: dirlist.append(dirname)

    # Failing that, try OS-specific locations.
    if _os.name == 'nt':
        dirlist.extend([_os.path.expanduser(r'~\AppData\Local\Temp'),
                         _os.path.expandvars(r'%SYSTEMROOT%\Temp'),
                         r'c:\temp', r'c:\tmp', r'\temp', r'\tmp'])
    else:
        dirlist.extend(['/tmp', '/var/tmp', '/usr/tmp'])

    # As a last resort, the current directory.
    try:
        dirlist.append(_os.getcwd())
    except (AttributeError, OSError):
        dirlist.append(_os.curdir)

    return dirlist


"""After getting the dirlist, it creates a binary file by getting the absolute path of the directory and writing into a binary file."""


   
ReplyQuote
 yen
(@yen)
Active Member Member
Joined: 10 months ago
Posts: 4
Topic starter  

(To AISG staff, please confirm if my suspicions are correct, thanks!)


   
ReplyQuote
Raymond Ng
(@raymond_aisg)
Eminent Member AISG Staff
Joined: 1 year ago
Posts: 35
 

@yen Hi,

Thanks for trying out the suggestions.

The tokenizer and embedding models are pre-trained models that are directly downloaded from the Huggingface hub and are part of the setup required for training. 

All things NLP


   
ReplyQuote
Raymond Ng
(@raymond_aisg)
Eminent Member AISG Staff
Joined: 1 year ago
Posts: 35
 

@yen Hi,

Your observation is correct, as stated in the paper, the full train loop is run 10 times and the best model out of the 10 runs is saved at the end. Intermediate model weights are saved in a temp folder between train runs as indicated here,

https://github.com/aisingapore/sgnlp/blob/main/sgnlp/models/sentic_gcn/train.py#L236

 

For a quick test run to check if it's possible to save the final model, first reduce the number of repeats to 1 for a single run,

https://github.com/aisingapore/sgnlp/blob/main/sgnlp/models/sentic_gcn/config/sentic_gcn_bert_config.json#L37

Next reduce the epoch to a small figure like 1 or 2 here,

https://github.com/aisingapore/sgnlp/blob/main/sgnlp/models/sentic_gcn/config/sentic_gcn_bert_config.json#L28

Run the training script again and you should be able to quickly observe if the model will save to the folder indicated in the `save_model_path` config.

 

Lastly, for the quick test above, could you try running the training script in debug mode with a breakpoint at the following line and observe if the script will trigger,

https://github.com/aisingapore/sgnlp/blob/main/sgnlp/models/sentic_gcn/train.py#L583

 

As indicated in the model card, our training on an A100 40GB GPU with the SemEval14/15/16 datasets takes only around an hour. If you are training with CPU, please ensure that you have enough system RAM and hard disk resources available throughout the training duration.

Hope this helps.

All things NLP


   
ReplyQuote
Share: